categorize a set of phrases into a set of similar phrases

Posted by Dingo on Stack Overflow See other posts from Stack Overflow or by Dingo
Published on 2010-12-26T09:32:53Z Indexed on 2010/12/26 9:54 UTC
Read the original article Hit count: 374

Filed under:

algorithm

|

string-matching

|

categorization

I have a few apps that generate textual tracing information (logs) to log files. The tracing information is the typical printf() style - i.e. there are a lot of log entries that are similar (same format argument to printf), but differ where the format string had parameters.

What would be an algorithm (url, books, articles, ...) that will allow me to analyze the log entries and categorize them into several bins/containers, where each bin has one associated format?
Essentially, what I would like is to transform the raw log entries into (formatA, arg0 ... argN) instances, where formatA is shared among many log entries. The formatA does not have to be the exact format used to generate the entry (even more so if this makes the algo simpler).

Most of the literature and web-info I found deals with exact matching, a max substring matching, or a k-difference (with k known/fixed ahead of time). Also, it focuses on matching a pair of (long) strings, or a single bin output (one match among all input). My case is somewhat different, since I have to discover what represents a (good-enough) match (generally a sequence of discontinuous strings), and then categorize each input entries to one of the discovered matches.

Lastly, I'm not looking for a perfect algorithm, but something simple/easy to maintain.

Thanks!

© Stack Overflow or respective owner

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

as seen on Super User - Search for 'Super User'
I'm just wonder, what the differences are between creating a BMP file algorithm and JPG file algorithm ? If you know the others images' format algorithm, please post them. Thanks. >>> More
word disambiguation algorithm (Lesk algorithm)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hii.. Can anybody help me to find an algorithm in Java code to find synonyms of a search word based on the context and I want to implement the algorithm with WordNet database. For example, "I am running a Java program". From the context, I want to find the synonyms for the word "running", but the… >>> More
Search algorithm (with a sort algorithm already implemented)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, Im doing a Java application and Im facing some doubts in which concerns performance. I have a PriorityQueue which guarantees me the element removed is the one with greater priority. That PriorityQueue has instances of class Event (which implements Comparable interface). Each Event is associated… >>> More
Is there any algorithm for finding LINES by PIXEL COLORS on picture?

as seen on Stack Overflow - Search for 'Stack Overflow'
So I have Image like this I want to get something like this (I hevent drawn all lines I want but I hope you can get my idea) I need algorithm for finding all straight lines on it by just reading colors of pixels. No hard math, no Haar, no Hough. Some algorithm which would be based on points… >>> More
collsion issues with quadtree [on hold]

as seen on Game Development - Search for 'Game Development'
So i implemented a Quad tree in Java for my 2D game and everything works fine except for when i run my collision detection algorithm, which checks if a object has hit another object and which side it hit.My problem is 80% of the time the collision algorithm works but sometimes the objects just go… >>> More

Related posts about string-matching

Approximate string matching with a letter confusion matrix?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do… >>> More
sample java code for approximate string matching or boyer-moore extended for approximate string matc

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi I need to find 1.mismatch(incorrectly played notes), 2.insertion(additional played), & 3.deletion (missed notes), in a music piece (e.g. note pitches [string values] stored in a table) against a reference music piece. This is either possible through exact string matching algorithms or dynamic… >>> More
Ranking based string matching algorithm..for Midi Music

as seen on Stack Overflow - Search for 'Stack Overflow'
i am working on midi music project. What i am trying to do is:- matching the Instrument midi track with the similar instrument midi track... for example Flute track in a some midi music is matched against the Flute track in some other music midi file... After matching ,the results should come ranking… >>> More
String matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
How to match the string Net-----Amount (or here between Net and Amount there can be any number of space) with net amount ? Consider ----- as space because I could not keep the space between these two words in the editor. >>> More
String Matching.

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a string String mainString="///BUY/SELL///ORDERTIME///RT///QTY///BROKERAGE///NETRATE///AMOUNTRS///RATE///SCNM///"; Now I have another strings String str1= "RT"; which should be matched only with RT which is substring of string mainString but not with ORDERTIME which is also substring… >>> More