Splitting a set of object into several subsets of 'similar' objects

Posted by doublep on Stack Overflow See other posts from Stack Overflow or by doublep
Published on 2010-06-12T19:00:17Z Indexed on 2010/06/12 19:02 UTC
Read the original article Hit count: 275

Filed under:

Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.

However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.

The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.

Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:

for x in S:
    i = such i that
             size(f(Si + {x})) - size(f(Si))
             is min
    Si = Si + {x}

This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.

Is there any standard approach to such kinds of problems?

I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?

Developer IT

Splitting a set of object into several subsets of 'similar' objects - Developer IT

Splitting a set of object into several subsets of 'similar' objects

algorithm

fuzzy

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

word disambiguation algorithm (Lesk algorithm)

Search algorithm (with a sort algorithm already implemented)

Is there any algorithm for finding LINES by PIXEL COLORS on picture?

collsion issues with quadtree [on hold]

Related posts about fuzzy

"Go to file" feature in various editors

Where can I find FuzzyGK or other fuzzy clustering algorithms to use in Weka?

Lucene Fuzzy Match on Phrase instead of Single Word

SDL_image/C++ OpenGL Program: IMG_Load() produces fuzzy images

Lucene.NET (strings fuzzy matching)

Categories cloud