scraping text from multiple html files into a single csv file
        Posted  
        
            by 
                Lulu
            
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Lulu
        
        
        
        Published on 2011-01-11T14:00:41Z
        Indexed on 
            2011/01/15
            1:53 UTC
        
        
        Read the original article
        Hit count: 593
        
html-parsing
|beautifulsoup
I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table.
My Input: e.g file 1500.html
My Code:
#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
for file in glob.glob('*html*'):
        print 'Processing', file
        soup = BeautifulSoup(open(file).read())
        rows = soup.findAll('tr')
        for tr in rows:
                cols = tr.findAll('td')
                #print >> csvfile,"#".join(col.string for col in cols)
                #print >> csvfile,"#".join(td.find(text=True))
                for col in cols:
                        print >> csvfile, col.string
                print >> csvfile, "==="
        print >> csvfile, "***"
Output:
One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!
© Stack Overflow or respective owner