table partitioning - Page 315

What algorithms can I use to detect if articles or posts are duplicates?

- by michael

I'm trying to detect if an article or forum post is a duplicate entry within the database. I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect): simple copy paste the whole text copy and paste parts of text merging it with their own copy an article from an external site and masquerade as their own Prepping Text For Analysis Basically any anomalies; the goal is to make the text as "pure" as possible. For more accurate results, the text is "standardized" by: Stripping duplicate white spaces and trimming leading and trailing. Newlines are standardized to \n. HTML tags are removed. Using a RegEx called Daring Fireball URLs are stripped. I use BB code in my application so that goes to. (ä)ccented and foreign (besides Enlgish) are converted to their non foreign form. I store information about each article in (1) statistics table and in (2) keywords table. (1) Statistics Table The following statistics are stored about the textual content (much like this post) text length letter count word count sentence count average words per sentence automated readability index gunning fog score For European languages Coleman-Liau and Automated Readability Index should be used as they do not use syllable counting, so should produce a reasonably accurate score. (2) Keywords Table The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc. Sample Data text_length, 3963 letter_count, 3052 word_count, 684 sentence_count, 33 word_per_sentence, 21 gunning_fog, 11.5 auto_read_index, 9.9 keyword 1, killed keyword 2, officers keyword 3, police It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values. How could I use the above information to detect if an article that's being published for the first time, is already existing within the database? I'm aware anything I'll design will not be perfect, the biggest risk being (1) Content that is not a duplicate will be flagged as duplicate (2) The system allows the duplicate content through. So the algorithm should generate a risk assessment number from 0 being no duplicate risk 5 being possible duplicate and 10 being duplicate. Anything above 5 then there's a good possibility that the content is duplicate. In this case the content could be flagged and linked to the article's that are possible duplicates and a human could decide whether to delete or allow. As I said before I'm storing keywords for the whole article, however I wonder if I could do the same on paragraph basis; this would also mean further separating my data in the DB but it would also make it easier for detecting (2) in my initial post. I'm thinking weighted average between the statistics, but in what order and what would be the consequences...

Search Results

Search found 25600 results on 1024 pages for 'table partitioning'.

Page 315/1024 | < Previous Page | 311 312 313 314 315 316 317 318 319 320 321 322 | Next Page >

- by michael

- by Tanu Sood

- by altsyset

- by John Svensson

- by AllenMWhite

- by Richard Lefebvre

- by Jie Chen

- by NimChimpsky

- by jeffrey.waterman

- by kingsley osime

- by Davide Mauri

- by TiborKaraszi

- by Michael Zilberstein

- by Melissa Centurio Lopes

- by Michael Zilberstein

- by Chris

- by Sam M

- by TiborKaraszi

< Previous Page | 311 312 313 314 315 316 317 318 319 320 321 322 | Next Page >