BeautifulSoup can't parse a webpage?

Posted by JLTChiu on Stack Overflow See other posts from Stack Overflow or by JLTChiu
Published on 2012-10-14T21:18:02Z Indexed on 2012/10/14 21:37 UTC
Read the original article Hit count: 301

Filed under:
|
|

I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly.

Here's what I did

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1")
soup = BeautifulSoup(page)
print soup.prettify()

I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got:

Warning (from warnings module):

File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149

"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))

...

HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94

I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?

© Stack Overflow or respective owner

Related posts about python

Related posts about parsing