How to identify a PDF classification problem?

Posted by burtonic on Programmers See other posts from Programmers or by burtonic
Published on 2012-06-24T17:55:53Z Indexed on 2012/06/30 3:23 UTC
Read the original article Hit count: 170

Filed under:
|

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.

The PDFs are scanned and the database is populated with, among other things, the:

  • Title
  • Contents (full text)
  • Page count
  • Word count
  • Orientation
  • First line

Using this data we are checking for the obvious phrases such as:

  • Annual report
  • Financial statement
  • Quarterly report
  • Interim report

Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.

We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

© Programmers or respective owner

Related posts about algorithms

Related posts about ruby