Algorithm to match natural text in mail
- by snøreven
I need to separate natural, coherent text/sentences in emails from lists, signatures, greetings and so on  before further processing.
example:
  Hi tom,
  
  last monday we did bla bla, lore Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et
  dolore magna aliqua.
  
  
  list item 2
  list item 3
  list item 3
  
  
  Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid x ea commodi consequat. Quis aute iure reprehenderit
  in voluptate velit
  
  regards, K.
  
  ---line-of-funny-characters-#######
  
  example inc. 
  
  33 evil street, london
  
  mobile: 00 234534/234345
Ideally the algorithm would match only the bold parts. 
Is there any recommended approach - or are there even existing algorithms for that problem? Should I try approximate regular expressions or more statistical stuff based on number of punctation marks, length and so on?