Search Results

Search found 101 results on 5 pages for 'beautifulsoup'.

Page 1/5 | 1 2 3 4 5 | Next Page >

Getting BeautifulSoup to find a specific <p>

- by Ryan

I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html. I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p> tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like from BeautifulSoup import BeautifulSoup import re import urllib2 address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html" html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) abstract = soup.find('p', attrs={'class' : 'lead'}) print abstract Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?

Read the article
Trying to grab just absolute links from a webpage using BeautifulSoup

- by Kevin

I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like. page = urllib2.urlopen("http://www.linkpages.com") soup = BeautifulSoup(page) for link in soup.findAll('a'): if link['href'].startswith('http://'): print links But that returns: Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' Any ideas? Thanks in advance. EDIT This isn't for any site in particular. The script gets the url from the user. So internal link targets would be an issue, that's also why I only want the <'a'> from the pages. If I turn it towards www.reddit.com, it parses the beginning links and it gets to this: <a href="http://www.reddit.com/top/">top</a> <a href="http://www.reddit.com/saved/">saved</a> Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__ return self._getAttrMap()[key] KeyError: 'href'

Read the article
Optimizing BeautifulSoup (Python) code

- by user283405

I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me with this? I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database. def parse(self,text): soup = BeautifulSoup(text) arr = soup.findAll('tbody') for i in range(0,len(arr)-1): data=Data() soup2 = BeautifulSoup(str(arr[i])) arr2 = soup2.findAll('td') c=0 for j in arr2: if str(j).find("<a href=") > 0: data.sourceURL = self.getAttributeValue(str(j),'<a href="') else: if c == 2: data.Hits=j.renderContents() #and few others... c = c+1 data.save() Any suggestions? Note: I already ask this question here but that was closed due to incomplete information.

Read the article
beautifulsoup can't find exist href in file

- by young001

I have a html file like following: <form action="/2811457/follow?gsid=3_5bce9b871484d3af90c89f37" method="post"> <div> <a href="/2811457/follow?page=2&gsid=3_5bce9b871484d3af90c89f37">next_page</a>  <input name="mp" type="hidden" value="3" /> <input type="text" name="page" size="2" style='-wap-input-format: "*N"' /> <input type="submit" value="jump" /> 1/3 </div> </form> how to extract the "1/3" from the file? It is a part of html,I intend to make it clear. When I use beautifulsoup, I'm new to beautifulsoup,and I have look the document,but still confused. how to extract"1/3" from the html file? total_urls_num = soup.find(re.compile('.*/d\//d.*')) doesn't work As JBernardo said,\d should be a number,When I change to .*\d/\d.*,it doesn't work too. my code: from BeautifulSoup import BeautifulSoup import re with open("html.txt","r") as f: response = f.read() print response soup = BeautifulSoup(response) delete_urls = soup.findAll('a', href=re.compile('follow\?page')) #works print delete_urls #total_urls_num = soup.find(re.compile('.*\d/\d.*')) total_urls_num = soup.find('input',style='submit') #can't work print total_urls_num

Read the article
BeautifulSoup Parser Confusion - HTML

- by lyngbym

I'm trying to scrape some content off another site and I'm not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I'm new to python. Here's my code: import sys import os import mechanize import re from BeautifulSoup import BeautifulSoup def scrape_trails(BASE_URL, data): #Get the trail names soup = BeautifulSoup(data) sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"}) print sitesDiv def main(): BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html" br = mechanize.Browser() data = br.open(BASE_URL).get_data() links = scrape_trails(BASE_URL, data) if __name__ == '__main__': main() If you follow that URL you can see the sitesDiv contains a lot of markup. I'm not sure if I'm doing something wrong or if this is just malformed markup that the script can't handle. Thanks!

Read the article
BeautifulSoup can't parse a webpage?

- by JLTChiu

I am using beautiful soup for parsing webpage now, I've heard it's very famous and good, but it doesn't seems works properly. Here's what I did import urllib2 from bs4 import BeautifulSoup page = urllib2.urlopen("http://www.cnn.com/2012/10/14/us/skydiver-record-attempt/index.html?hpt=hp_t1") soup = BeautifulSoup(page) print soup.prettify() I think this is kind of straightforward. I open the webpage and pass it to the beautifulsoup. But here's what I got: Warning (from warnings module): File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 149 "Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.")) ... HTMLParseError: bad end tag: u'</"+"script>', at line 634, column 94 I thought CNN website should be well designed, so I am not very sure what's going on though. Does anyone has idea about this?

Read the article
does BeautifulSoup strips inline CSS and javascript content

- by goh

hey guys does beautifulSoup strips css and javascript content? after using content3 = ''.join(BeautifulSoup(content).findAll(text=True)) i still have them lingering around.

Read the article
Extracting an attribute value with beautifulsoup

- by Barnabe

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag = soup.findAll(attrs={"name" : "stainfo"}) output = inputTag['value'] print str(output) I get a TypeError: list indices must be integers, not str even though from the Beautifulsoup documentation i understand that strings should not be a problem here... but i a no specialist and i may have misunderstood. Any suggestion is greatly appreciated! Thanks in advance.

Read the article
BeautifulSoup, but for CSS?

- by MTsoul

BeautifulSoup parses HTML and offers various ways to manipulate and search within HTML. Is there something similar for CSS? Specifically, I'd like to know if a given HTML text is rendered as bold. Either it has an ancestor that is the <strong> or the <bold> tag (which can be done with BeautifulSoup), or it has an ancestor (or itself) that has CSS attributes with font-weight: bold. Is this possible without resulting to writing my own library?

Read the article
BeautifulSoup: Get the contents of a specific table

- by Adam Matan

Hi, My local airport disgracefully blocks users without IE, and looks awful. I want to write a Python scripts that would get the contents of the Arrival and Departures pages every few minutes, and show them in a more readable manner. My tools of choice are mechanize for cheating the site to believe I use IE, and BeautifulSoup for parsing page to get the flights data table. Quite honestly, I got lost in the BeautifulSoup documentation, and can't understand how to get the table (whose title I know) from the entire document, and how to get a list of rows from that table. Any ideas? Adam

Read the article
How to find links and modify an Html using BeautifulSoup in Python

- by systempuntoout

Starting from an Html input like this: <p> <a href="http://www.foo.com">this if foo</a> <a href="http://www.bar.com">this if bar</a> </p> using BeautifulSoup, i would like to change this Html in: <p> <a href="http://www.foo.com">this if foo[1]</a> <a href="http://www.bar.com">this if bar[2]</a> </p> saving parsed links in a dictionary with a result like this: links_dict = {"1":"http://www.foo.com","2":"http://www.bar.com"} Is it possible to do this using BeautifulSoup? Any valid alternative?

Read the article
Matching id's in BeautifulSoup

- by Ockonal

Hello, I'm using BeautifulSoup - python module. I have to find any reference to the div's with id like: 'post-#'. For example: <div id="post-45">...</div> <div id="post-334">...</div> How can I filter this? html = '<div id="post-45">...</div> <div id="post-334">...</div>' soupHandler = BeautifulSoup(html) print soupHandler.findAll('div', id='post-*') > []

Read the article
How to prevent BeautifulSoup from stripping lines

- by Oli

I'm trying to translate an online html page into text. I have a problem with this structure: <div align="justify"><b>Available in <a href="http://www.example.com.be/book.php?number=1"> French</a> and <a href="http://www.example.com.be/book.php?number=5"> English</a>. </div> Here is its representation as a python string: '<div align="justify"><b>Available in \r\n<a href="http://www.example.com.be/book.php?number=1">\r\nFrench</a>; \r\n<a href="http://www.example.com.be/book.php?number=5">\r\nEnglish</a>.\r\n</div>' When using: html_content = get_html_div_from_above() para = BeautifulSoup(html_content) txt = para.text BeautifulSoup translate it (in the 'txt' variable) as: u'Available inFrenchandEnglish.' It probably strips each line in the original html string. Do you have a clean solution about this problem ? Thanks.

Read the article
BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

- by AP257

I'm trying to scrape all the inner html from the <p> elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text. For example, for: <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> How can I extract: Red Blue Yellow Light green Neither .string nor .contents[0] does what I need. Nor does .extract(), because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur. Is there a 'just get the visible HTML' type of method in BeautifulSoup? ----UPDATE------ On advice, trying: p_tags = page.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(p_tag) But that doesn't help - it just prints out: Red <i>Blue</i> Yellow Light <b>green</b>

Read the article
beautifulsoup and mechanize to get ajax call result

- by nabizan

hi im building a scraper using python 2.5 and beautifulsoup but im stuble upon a problem ... part of the web page is generating after user click on some button, whitch start an ajax request by calling specific javacsript function using proper parameters is there a way to simulate user interaction and get this result? i come across a mechanize module but it seems to me that this is mostly used to work with forms ... i would appreciate any links or some code samples thanks

Read the article
Beautifulsoup recursive attribute

- by Marcos Placona

Hi, trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall() I have a pretty odd xml format shown below: <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> <catalog>true</catalog> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> <catalog>false</catalog> </book> </catalog> As you can see, the catalog tag repeats inside the book tag, which causes an error when I try to to something like: from BeautifulSoup import BeautifulStoneSoup as BSS catalog = "catalog.xml" def open_rss(): f = open(catalog, 'r') return f.read() def rss_parser(): rss_contents = open_rss() soup = BSS(rss_contents) items = soup.findAll('catalog', recursive=False) for item in items: print item.title.string rss_parser() As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one. This doesn't seem to work, as I always get the following error: File "catalog.py", line 17, in rss_parser print item.title.string AttributeError: 'NoneType' object has no attribute 'string' I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem. Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file. Thanks in advance, Marcos

Read the article
Extracting value in Beautifulsoup

- by Seth

I have the following code: f = open(path, 'r') html = f.read() # no parameters => reads to eof and returns string soup = BeautifulSoup(html) schoolname = soup.findAll(attrs={'id':'ctl00_ContentPlaceHolder1_SchoolProfileUserControl_SchoolHeaderLabel'}) print schoolname which gives: [<span id="ctl00_ContentPlaceHolder1_SchoolProfileUserControl_SchoolHeaderLabel">A B Paterson College, Arundel, QLD</span>] when I try and access the value (i.e. 'A B Paterson College, Arundel, QLD) by using schoolname['value'] I get the following error: print schoolname['value'] TypeError: list indices must be integers, not str What am I doing wrong to get that value?

Read the article
Get table with maximum number of rows in a page using BeautifulSoup

- by Mridang Agarwalla

Hi, Can anyone tell me how i can get the table in a HTML page which has a the most rows? I'm using BeautifulSoup. Thanks.

Read the article
what is the return value of BeautifulSoup.find ?

- by prosseek

I run to get some value as score. score = soup.find('div', attrs={'class' : 'summarycount'}) I run 'print score' to get as follows. <div class=\"summarycount\">524</div> I need to extract the number part. I used re module but failed. m = re.search("[^\d]+(\d+)", score) TypeError: expected string or buffer function search in re.py at line 142 return _compile(pattern, flags).search(string) What's the return type of the find function? How to get the number from the score variable? Is there any easy way to let BeautifulSoup to return the value(in this case 524) itself?

Read the article
Parsing with BeautifulSoup, error message TypeError: coercing to Unicode: need string or buffer, NoneType found

- by Samsun Knight

so I'm trying to scrape an Amazon page for data, and I'm getting an error when I try to parse for where the seller is located. Here's my code: #getting the html request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/') opener = urllib2.build_opener() #hiding that I'm a webscraper request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko') #opening it up, putting into soup form html = opener.open(request).read() soup = BeautifulSoup(html, "html5lib") #parsing for the seller info sellers = soup.findAll('div', {'class' : 'a-row a-spacing-medium olpOffer'}) for eachseller in sellers: #parsing for price price = eachseller.find('span', {'class' : 'a-size-large a-color-price olpOfferPrice a-text-bold'}) #parsing for shipping costs shippingprice = eachseller.find('span' , {'class' : 'olpShippingPrice'}) #parsing for condition condition = eachseller.find('span', {'class' : 'a-size-medium'}) #parsing for seller name sellername = eachseller.find('b') #parsing for seller location location = eachseller.find('div', {'class' : 'olpAvailability'}) #printing it all out print "price, " + price.string + ", shipping price, " + shippingprice.string + ", condition," + condition.string + ", seller name, " + sellername.string + ", location, " + location.string I get the error message, pertaining to the 'print' command at the end, "TypeError: coercing to Unicode: need string or buffer, NoneType found" I know that it's coming from this line - location = eachseller.find('div', {'class' : 'olpAvailability'}) - because the code works fine without that line, and I know that I'm getting NoneType because the line isn't finding anything. Here's the html from the section I'm looking to parse: <*div class="olpAvailability"> In Stock. Ships from WI, United States. <*br/><*a href="/gp/aag/details/ref=olp_merch_ship_9/175-0430757-3801038?ie=UTF8&asin=0393934241&seller=A1W2IX7T37FAMZ&sshmPath=shipping-rates#aag_shipping">Domestic shipping rates</a> and <*a href="/gp/aag/details/ref=olp_merch_return_9/175-0430757-3801038?ie=UTF8&asin=0393934241&seller=A1W2IX7T37FAMZ&sshmPath=returns#aag_returns">return policy</a>. <*/div> (but without the stars - just making sure the HTML doesn't compile out of code form) I don't see what's the problem with the 'location' line of code, or why it's not pulling the data I want. Help?

Read the article
How to get these values with BeautifulSoup?

- by Damiano

Hello everybody, I have this html table: <table> <tr> <td class="datax">a</td> <td class="datax">b</td> <td class="datax">c</td> <td class="datax">d</td> </tr> <tr> <td class="datax">e</td> <td class="datax">f</td> <td class="datax">g</td> <td class="datax">h</td> </tr> </table> How to get the second and the fourth value of each <tr> ? If i do: bs.findAll('td', {'class':'datax'}) I get: <td class="datax">a</td> <td class="datax">b</td> <td class="datax">c</td> <td class="datax">d</td> <td class="datax">e</td> <td class="datax">f</td> <td class="datax">g</td> <td class="datax">h</td> it's correct! but I would like to have this result: <td class="datax">b</td> <td class="datax">d</td> <td class="datax">f</td> <td class="datax">h</td> so, the values I want are - b - d - f - h (the second and the forth <td> of each <tr>) Is it possible with BeautifulSoup module? Thank you very much!

Read the article
How to unescape special characters from BeautifulSoup output?

- by Suhail

Hi, I am facing issues with the special characters like ° and ® which represent the degree Fahrenheit sign and the registered sign, when i print the string the contains the special characters, it gives output like this: Preheat oven to 350° F Welcome to Lorem Ipsum Inc® Is there a way I can output the exact characters and not their codes? Please let me know.

Read the article
In Python BeautifulSoup How to move tags

- by JJ

I have a partially converted XML document in soup coming from HTML. After some replacement and editing in the soup, the body is essentially - <Text...></Text> # This replaces <a href..> tags but automatically creates the </Text> <p class=norm ...</p> <p class=norm ...</p> <Text...></Text> <p class=norm ...</p> and so forth. I need to "move" the <p> tags to be children to <Text> or know how to suppress the </Text>. I want - <Text...> <p class=norm ...</p> <p class=norm ...</p> </Text> <Text...> <p class=norm ...</p> </Text> I've tried using item.insert and item.append but I'm thinking there must be a more elegant solution. for item in soup.findAll(['p','span']): if item.name == 'span' and item.has_key('class') and item['class'] == 'section': xBCV = short_2_long(item._getAttrMap().get('value','')) if currentnode: pass currentnode = Tag(soup,'Text', attrs=[('TypeOf', 'Section'),... ]) item.replaceWith(currentnode) # works but creates end tag elif item.name == 'p' and item.has_key('class') and item['class'] == 'norm': childcdatanode = None for ahref in item.findAll('a'): if childcdatanode: pass newlink = filter_hrefs(str(ahref)) childcdatanode = Tag(soup, newlink) ahref.replaceWith(childcdatanode) Thanks

Read the article
beautifulsoup: find the n-th element's sibling

- by deostroll

I have a complex html DOM tree of the following nature: <table> ... <tr> <td> ... </td> <td> <table> <tr> <td>  <table> ... </table> <h2>This is hell!</h2> <td> </tr> </table> </td> </tr> </table> I have some logic to find out the inner most table. But after having found it, I need to get the next sibling element (h2). Is there anyway you can do this?

Read the article
How do I make BeautifulSoup parse the contents of textarea tags as HTML?

- by brofield

Before 3.0.5, BeautifulSoup used to treat the contents of <textarea as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it. I've tried: for textarea in soup.findAll('textarea'): contents = BeautifulSoup.BeautifulSoup(textarea.contents) textarea.replaceWith(contents.html(text=True)) But I'm getting errors. I can't find this in the documentation, and the alternative parsers aren't helping. Anyone know how I can parse the textareas as HTML?

Read the article

1 2 3 4 5 | Next Page >