What algorithms can I use to detect if articles or posts are duplicates?

Posted by michael on Programmers See other posts from Programmers or by michael
Published on 2012-10-07T23:41:59Z Indexed on 2012/10/08 3:47 UTC
Read the original article Hit count: 363

Filed under:

algorithms

I'm trying to detect if an article or forum post is a duplicate entry within the database. I've given this some thought, coming to the conclusion that someone who duplicate content will do so using one of the three (in descending difficult to detect):

simple copy paste the whole text
copy and paste parts of text merging it with their own
copy an article from an external site and masquerade as their own

Prepping Text For Analysis

Basically any anomalies; the goal is to make the text as "pure" as possible. For more accurate results, the text is "standardized" by:

Stripping duplicate white spaces and trimming leading and trailing.
Newlines are standardized to \n.
HTML tags are removed.
Using a RegEx called Daring Fireball URLs are stripped.
I use BB code in my application so that goes to.
(ä)ccented and foreign (besides Enlgish) are converted to their non foreign form.

I store information about each article in (1) statistics table and in (2) keywords table.

(1) Statistics Table The following statistics are stored about the textual content (much like this post)

text length
letter count
word count
sentence count
average words per sentence
automated readability index
gunning fog score

For European languages Coleman-Liau and Automated Readability Index should be used as they do not use syllable counting, so should produce a reasonably accurate score.

(2) Keywords Table

The keywords are generated by excluding a huge list of stop words (common words), e.g., 'the', 'a', 'of', 'to', etc, etc.

Sample Data

text_length, 3963
letter_count, 3052
word_count, 684
sentence_count, 33
word_per_sentence, 21
gunning_fog, 11.5
auto_read_index, 9.9
keyword 1, killed
keyword 2, officers
keyword 3, police

It should be noted that once an article gets updated all of the above statistics are regenerated and could be completely different values.

How could I use the above information to detect if an article that's being published for the first time, is already existing within the database?

I'm aware anything I'll design will not be perfect, the biggest risk being (1) Content that is not a duplicate will be flagged as duplicate (2) The system allows the duplicate content through.

So the algorithm should generate a risk assessment number from 0 being no duplicate risk 5 being possible duplicate and 10 being duplicate. Anything above 5 then there's a good possibility that the content is duplicate. In this case the content could be flagged and linked to the article's that are possible duplicates and a human could decide whether to delete or allow.

As I said before I'm storing keywords for the whole article, however I wonder if I could do the same on paragraph basis; this would also mean further separating my data in the DB but it would also make it easier for detecting (2) in my initial post.

I'm thinking weighted average between the statistics, but in what order and what would be the consequences...

Developer IT

What algorithms can I use to detect if articles or posts are duplicates? - Developer IT

What algorithms can I use to detect if articles or posts are duplicates?

algorithms

Related posts about algorithms

Finding a problem in some task [closed]

Genetic algorithms

understanding evaluation function

How to know whether to create a general system or to hack a solution

How to implement a genetic algorithm with distance, time, and cost

Categories cloud