ybungalobill - Developer IT

Search Results

Search found 2 results on 1 pages for 'ybungalobill'.

Page 1/1 | 1

How to calculate this string-dissimilarity function efficiently?

- by ybungalobill

Hello, I was looking for a string metric that have the property that moving around large blocks in a string won't affect the distance so much. So "helloworld" is close to "worldhello". Obviously Levenshtein distance and Longest common subsequence don't fulfill this requirement. Using Jaccard distance on the set of n-grams gives good results but has other drawbacks (it's a pseudometric and higher n results in higher penalty for changing single character). [original research] As I thought about it, what I'm looking for is a function f(A,B) such that f(A,B)+1 equals the minimum number of blocks that one have to divide A into (A1 ... An), apply a permutation on the blocks and get B: f("hello", "hello") = 0 f("helloworld", "worldhello") = 1 // hello world -> world hello f("abba", "baba") = 2 // ab b a -> b ab a f("computer", "copmuter") = 3 // co m p uter -> co p m uter This can be extended for A and B that aren't necessarily permutations of each other: any additional character that can't be matched is considered as one additional block. f("computer", "combuter") = 3 // com uter -> com uter, unmatched: p and b. Observing that instead of counting blocks we can count the number of pairs of indices that are taken apart by a permutation, we can write f(A,B) formally as: f(A,B) = min { C(P) | P:|A|?|B|, P is bijective, ?i?dom(P) A[P(i)]=B[P(i)] } C(P) = |A| + |B| - |dom(P)| - |{ i | i,i+1?dom(P) and P(i)+1=P(i+1) }| - 1 The problem is... guess what... ... that I'm not able to calculate this in polynomial time. Can someone suggest a way to do this efficiently? Or perhaps point me to already known metric that exhibits similar properties?

Read the article

How string accepting interface should look like?

- by ybungalobill

Hello, This is a follow up of this question. Suppose I write a C++ interface that accepts or returns a const string. I can use a const char* zero-terminated string: void f(const char* str); // (1) The other way would be to use an std::string: void f(const string& str); // (2) It's also possible to write an overload and accept both: void f(const char* str); // (3) void f(const string& str); Or even a template in conjunction with boost string algorithms: template<class Range> void f(const Range& str); // (4) My thoughts are: (1) is not C++ish and may be less efficient when subsequent operations may need to know the string length. (2) is bad because now f("long very long C string"); invokes a construction of std::string which involves a heap allocation. If f uses that string just to pass it to some low-level interface that expects a C-string (like fopen) then it is just a waste of resources. (3) causes code duplication. Although one f can call the other depending on what is the most efficient implementation. However we can't overload based on return type, like in case of std::exception::what() that returns a const char*. (4) doesn't work with separate compilation and may cause even larger code bloat. Choosing between (1) and (2) based on what's needed by the implementation is, well, leaking an implementation detail to the interface. The question is: what is the preffered way? Is there any single guideline I can follow? What's your experience?