Search Results

Search found 101 results on 5 pages for 'beautifulsoup'.

Page 2/5 | < Previous Page | 1 2 3 4 5 | Next Page >

Is it possibile to modify a link value with Beautifulsoup without recreating the all link?

- by systempuntoout

Starting from an Html input like this: <p> <a href="http://www.foo.com" rel="nofollow">this is foo</a> <a href="http://www.bar.com" rel="nofollow">this is bar</a> </p> is it possible to modify the <a> node values ("this i foo" and "this is bar") adding the suffix "PARSED" to the value without recreating the all link? The result need to be like this: <p> <a href="http://www.foo.com" rel="nofollow">this is foo_PARSED</a> <a href="http://www.bar.com" rel="nofollow">this is bar_PARSED</a> </p> And code should be something like: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) for link_tag in soup.findAll('a'): link_tag.string = link_tag.string + '_PARSED' #This obviously does not work

Read the article
Is it possible for BeautifulSoup to work in a case-insensitive manner?

- by Nitin

I am trying to extract Meta Description for fetched webpages. But here I am facing the problem of case sensitivity of BeautifulSoup. As some of the pages have <meta name="Description and some have <meta name="description. My problem is very much similar to that of Question on Stackoverflow The only difference is that I can't use lxml .. I have to stick with Beautifulsoup.

Read the article
BeautifulSoup HTMLParseError. What's wrong with this?

- by user1915496

This is my code: from bs4 import BeautifulSoup as BS import urllib2 url = "http://services.runescape.com/m=news/recruit-a-friend-for-free-membership-and-xp" res = urllib2.urlopen(url) soup = BS(res.read()) other_content = soup.find_all('div',{'class':'Content'})[0] print other_content Yet an error comes up: /Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help. "Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.")) Traceback (most recent call last): File "web.py", line 5, in <module> soup = BS(res.read()) File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 172, in __init__ self._feed() File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 185, in _feed self.builder.feed(self.markup) File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed raise e I've let two other people use this code, and it works for them perfectly fine. Why is it not working for me? I have bs4 installed...

Read the article
Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)

- by Michael Altman

I have the following HTML that is within a larger document <br /> Important Text 1 <br /> <br /> Not Important Text <br /> Important Text 2 <br /> Important Text 3 <br /> <br /> Non Important Text <br /> Important Text 4 <br /> I'm currently using BeautifulSoup to obtain other elements within the HTML, but I have not been able to find a way to get the important lines of text between <br /> tags. I can isolate and navigate to each of the <br /> elements, but can't find a way to get the text in between. Any help would be greatly appreciated. Thanks.

Read the article
Python web scraping involving HTML tags with attributes

- by rohanbk

I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do? At the moment, my code looks like what is below: import re import urllib2,sys import lxml from lxml import etree from lxml.html.soupparser import fromstring from lxml.etree import tostring from lxml.cssselect import CSSSelector from BeautifulSoup import BeautifulSoup, NavigableString address='http://www.example.com/' html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) html=soup.prettify() html=html.replace('&nbsp', '&#160') html=html.replace('&iacute','&#237') root=fromstring(html) I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file. EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.

Read the article
Beautiful Soup Unicode encode error

- by iamrohitbanga

I am trying the following code with a particular HTML file from BeautifulSoup import BeautifulSoup import re import codecs import sys f = open('test1.html') html = f.read() soup = BeautifulSoup(html) body = soup.body.contents para = soup.findAll('p') print str(para).encode('utf-8') I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128) How do I debug this?

Read the article
Python beautiful soup arguments

- by scott

Hi I have this code that fetches some text from a page using BeautifulSoup soup= BeautifulSoup(html) body = soup.find('div' , {'id':'body'}) print body I would like to make this as a reusable function that takes in some htmltext and the tags to match it like the following def parse(html, atrs): soup= BeautifulSoup(html) body = soup.find(atrs) return body But if i make a call like this parse(htmlpage, ('div' , {'id':'body'}")) or like parse(htmlpage, ['div' , {'id':'body'}"]) I get only the div element, the body attribute seems to get ignored. Is there a way to fix this?

Read the article
optimize python code

- by user283405

i have code that uses BeautifulSoup library for parsing. But it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me about this? I am using beautifulsoup library for parsing and than save in DB. if i comment the save statement, than still it takes time so there is no problem with database. def parse(self,text): soup = BeautifulSoup(text) arr = soup.findAll('tbody') for i in range(0,len(arr)-1): data=Data() soup2 = BeautifulSoup(str(arr[i])) arr2 = soup2.findAll('td') c=0 for j in arr2: if str(j).find("<a href=") > 0: data.sourceURL = self.getAttributeValue(str(j),'<a href="') else: if c == 2: data.Hits=j.renderContents() #and few others... #... c = c+1 data.save() Any suggestions? Note: I already ask this question here but that was closed due to incomplete information.

Read the article
Python Beautiful Soup .content Property

- by Robert Birch

What does BeautifulSoup's .content do? I am working through crummy.com's tutorial and I don't really understand what .content does. I have looked at the forums and I have not seen any answers. Looking at the code below.... from BeautifulSoup import BeautifulSoup import re doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.contents[0].contents[0].contents[0].contents[0].name I would expect the last line of the code to print out 'body' instead of... File "pe_ratio.py", line 29, in <module> print soup.contents[0].contents[0].contents[0].contents[0].name File "C:\Python27\lib\BeautifulSoup.py", line 473, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) AttributeError: 'NavigableString' object has no attribute 'name' Is .content only concerned with html, head and title? If, so why is that? Thanks for the help in advance.

Read the article
Python BeautifulSoup Print Info in CSV

- by Codin

I can print the information I am pulling from a site with no problem. But when I try to place the street names in one column and the zipcodes into another column into a CSV file that is when I run into problems. All I get in the CSV is the two column names and every thing in its own column across the page. Here is my code. Also I am using Python 2.7.5 and Beautiful soup 4 from bs4 import BeautifulSoup import csv import urllib2 url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) f = csv.writer(open("Defiance Steets1.csv", "w")) f.writerow(["Name", "ZipCodes"]) # Write column headers as the first line links = soup.find_all(['i','a']) for link in links: names = link.contents[0] print unicode(names) f.writerow(names)

Read the article
From escaped html -> to regular html? - Python

- by RadiantHex

Hi folks, I used BeautifulSoup to handle XML files that I have collected through a REST API. The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely. Unfortunately I need the HTML code. How would I go on about transforming the escaped HTML into proper markup? Help would be very much appreciated!

Read the article
Massage with BeatifulSoup or clean with Regex

- by Mridang Agarwalla

Could someone tell me whats a better way to clean up bad HTML so BeautifulSoup can handle it - should one use the massage methods of BeautifulSoup or clean it up using regular expressions? Thanks.

Read the article
scraping text from multiple html files into a single csv file

- by Lulu

I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table. My Input: e.g file 1500.html My Code: #!/usr/bin/env python import glob import codecs from BeautifulSoup import BeautifulSoup with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile: for file in glob.glob('*html*'): print 'Processing', file soup = BeautifulSoup(open(file).read()) rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') #print >> csvfile,"#".join(col.string for col in cols) #print >> csvfile,"#".join(td.find(text=True)) for col in cols: print >> csvfile, col.string print >> csvfile, "===" print >> csvfile, "***" Output: One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

Read the article
Beautiful soup how print a tag while iterating over it .

- by Bunny Rabbit

<?xml version="1.0" encoding="UTF-8"?> <playlist version="1" xmlns="http://xspf.org/ns/0/"> <trackList> <track> <location>file:///home/ashu/Music/Collections/randomPicks/ipod%20on%20sep%2009/Coldplay-Sparks.mp3</location> <title>Coldplay-Sparks</title> </track> <track> <location>file:///home/ashu/Music/Collections/randomPicks/gud%201s/Coldplay%20Warning%20sign.mp3</location> <title>Coldplay Warning sign</title> </track>.... My xml looks like this , i want to get the locations, i am trying from BeautifulSoup import BeautifulSoup as bs soup = bs (the_above_xml_text) for track in soup.tracklist: print track.location.string but that is not working because i am getting AttributeError: 'NavigableString' object has no attribute 'location' how can i achive the result , thanks in advance.

Read the article
How to convert Beautiful Soup Unicode into a decimal value?

- by MikeTheCoder

I'm trying to Use python's Beautiful Soup Library to grab a bunch of divs from an html file, and from there get the string - which is a money value - that's inside the div. Then remove the dollar sign and convert it to a decimal so that I can use a greater than and less than conditional statement to compare values. I have googled the heck out of it and can't seem to come up with a way to convert this unicode string into a decimal value. I really could use some help here. How do I convert unicode into a decimal value? This was my last attempt: import unicodedata from bs4 import BeautifulSoup soup = BeautifulSoup(open("/Users/sm/Documents/python/htmldemo.html")) for tag in soup.findAll("div",attrs={"itemprop":"price"}) : val = tag.string new_val = val[8:] workable = int(new_val) if workable > 250: print(type(workable)) else: print(type(workable)) Edit: When I print the type of new_val I get : print(type(new_val))

Read the article
Need help specifying a ending while condition

- by johnthexiii

I have written a Python script to download all of the xkcd comic images. The only problem is I can't tell it to stop when it gets to the last one... Here is what I have so far. import re, mechanize from urllib import urlretrieve from BeautifulSoup import BeautifulSoup as bs baseUrl = "http://xkcd.com/1/" #Specify the first comic page br = mechanize.Browser() #Create a browser response = br.open(baseUrl) #Create an initial response x = 1 #Assign an initial file name while (SomeCondition): soup = bs(response.get_data()) #Create an instance of bs that contains the response data img = soup.findAll('img')[1] #Get the online file path of the image localFile = "C:\\Comics\\xkcd\\" + str(x) + ".jpg" #Come up with a local file name urlretrieve(img["src"], localFile) #Download the image file response = br.follow_link(text = "Next >") #Store the response of the next button x += 1 #Increase x by 1 print "All xkcd comics downloaded" #Let the user know the images have been downloaded Initially what I had was something like while br.follow_link(text = "Next >") != br.follow_link(text = ">|"): but by doing this I actually send skip to the last page before the script has a chance to perform the intended purpose.

Read the article
How to append a tag after a link with BeaufitulSoup

- by systempuntoout

Starting from an Html input like this: <p> <a href="http://www.foo.com">this if foo</a> <a href="http://www.bar.com">this if bar</a> </p> using BeautifulSoup, i would like to change this Html in: <p> <a href="http://www.foo.com">this if foo</a><b>OK</b> <a href="http://www.bar.com">this if bar</a><b>OK</b> </p> Is it possible to do this using BeautifulSoup?

Read the article
I want to find the span tag beween the LI tag and its attributes but no luck.

- by Mahesh

I want to find the span tag beween the LI tag and its attributes. Trying with beautful soap but no luck. Details of my code. Is any one point me right methodlogy In this this code, my getId function should return me id = "0_False-2" Any one know right method? from BeautifulSoup import BeautifulSoup as bs import re html = '<ul>\ <li class="line"> </li>\ <li class="folder-open-last" id="0">\ <img style="float: left;" class="trigger" src="/media/images/spacer.gif" border="0">\ <span class="text" id="0_False">NOC</span><ul style="display: block;"><li class="line"> </li><li class="doc" id="1"><span class="active text" id="0_False-1">PNQAIPMS1</span></li><li class="line"> </li><li class="doc-last" id="2"><span class="text" id="0_False-2">PNQAIPMS2</span></li><li class="line-last"></li></ul></li><li class="line-last"></li>\ </ul>' def getId(html, txt): soup = bs(html) soup.findAll('ul',recursive=False) head = soup.contents[0] temp = head elements = {} while True: # It temp is None that means no HTML tags are available if temp == None: break #print temp if re.search('li', str( temp)) != None: attr = str(temp.attrs).encode('ascii','ignore') attr = attr.replace(' ', '') attr = attr.replace('[', '') attr = attr.replace(']', '') attr = attr.replace(')', '') attr = attr.replace('(', '') attr = attr.replace('u\'', '') attr = attr.replace('\'', '') attr = attr.split(',') span = str(temp.text) if span == txt: return attr[3] temp = temp.next else: temp = temp.next id = getId(html,"PNQAIPMS2") print "ID = " + id

Read the article
Generate a table of contents from HTML with Python

- by Oli

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags. My plan so far was to: Extract a list of headers using beautifulsoup Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup? Output a nested list of links to the headers in a predefined spot. It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear. Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel? A example: <p>This is an introduction</p> <h2>This is a sub-header</h2> <p>...</p> <h3>This is a sub-sub-header</h3> <p>...</p> <h2>This is a sub-header</h2> <p>...</p>

Read the article
Prettify working using python idle but not when using python script

- by Loclip

So when I use print soup.prettify() on idle it's prettifying my html but when I calling the script from server the prettify() not working #!/usr/bin/python import cgi, cgitb, urllib2, sys from bs4 import BeautifulSoup styles = [] i = 1 site = "www.example.com" page = urllib2.urlopen(site) soup = BeautifulSoup(page) table = soup.find('table', {'class' :'style5'}) alltd = table.findAll('td') allspans = table.findAll('span') while i<55: styles.append("style" + str(i)) i+=1 for td in alltd: if any(x in td["class"] for x in styles): td["class"] = "align" for span in allspans: span.replaceWithChildren() print "Content-type: text/html" print print "<!DOCTYPE html>" print "<html>" print "<head>" print '<meta http-equiv="content-type" content="text/html; charset=utf-8">' print '<link rel="stylesheet" href="lightbox/css/lightbox.css" type="text/css" media="screen">' print '<style type="text/css">' print "body {background-color:#b0c4de;}" print '.align {text-align: center;}' print "</style>" print "</head>" print "<body>" print table.pretiffy() print "</body>" print "</html>"

Read the article
grabbing a substring while scraping with Python2.6

- by Diego

Hey can someone help with the following? I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag.. [<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>] [<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</strong> 15024/25</li>, <li><strong>Label:</strong> CAMERATA</li>] here's a piece of the code I've been using to grab the above data using mechanize and BeautifulSoup. I'm stuck here as it won't let me use the find() function for a list br_results = mechanize.urlopen(br_results) html = br_results.read() soup = BeautifulSoup(html) local_links = soup.findAll("a", {"class" : "down-arrow csa"}) upc_code = soup.findAll("ul", {"class" : "bc-meta3"}) for upc in upc_code: upc_text = upc.contents.contents print upc_text

Read the article
urllib2.Request() with data returns empty url

- by Mr. Polywhirl

My main concern is the function: getUrlAndHtml() If I manually build and append the query to the end of the uri, I can get the response.url(), but if I pass a dictionary as the request data, the url does not come back. Is there anyway to guarantee the redirected url? In my example below, if thisWorks = True I get back a url, but the returned url is the request url as opposed to a redirect link. On a sidenote, the encoding for .E2.80.93 does not translate to - for some reason? #!/usr/bin/python import pprint import urllib import urllib2 from bs4 import BeautifulSoup from sys import argv URL = 'http://en.wikipedia.org/w/index.php?' def yesOrNo(boolVal): return 'yes' if boolVal else 'no' def getTitleFromRaw(page): return page.strip().replace(' ', '_') def getUrlAndHtml(title, printable=False): thisWorks = False if thisWorks: query = 'title={:s}&printable={:s}'.format(title, yesOrNo(printable)) opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] response = opener.open(URL + query) else: params = {'title':title,'printable':yesOrNo(printable)} data = urllib.urlencode(params) headers = {'User-agent':'Mozilla/5.0'}; request = urllib2.Request(URL, data, headers) response = urllib2.urlopen(request) return response.geturl(), response.read() def getSoup(html, name=None, attrs=None): soup = BeautifulSoup(html) if name is None: return None return soup.find(name, attrs) def setTitle(soup, newTitle): title = soup.find('div', {'id':'toctitle'}) h2 = title.find('h2') h2.contents[0].replaceWith('{:s} for {:s}'.format(h2.getText(), newTitle)) def updateLinks(soup, url): fragment = '#' for a in soup.findAll('a', href=True): a['href'] = a['href'].replace(fragment, url + fragment) def writeToFile(soup, filename='out.html', indentLevel=2): with open(filename, 'wt') as out: pp = pprint.PrettyPrinter(indent=indentLevel, stream=out) pp.pprint(soup) print('Wrote {:s} successfully.'.format(filename)) if __name__ == '__main__': def exitPgrm(): print('usage: {:s} "<PAGE>" <FILE>'.format(argv[0])) exit(0) if len(argv) == 2: help = argv[1] if help == '-h' or help == '--help': exitPgrm() if False:''' if not len(argv) == 3: exitPgrm() ''' page = 'Led Zeppelin' # argv[1] filename = 'test.html' # argv[2] title = getTitleFromRaw(page) url, html = getUrlAndHtml(title) soup = getSoup(html, 'div', {'id':'toc'}) setTitle(soup, page) updateLinks(soup, url) writeToFile(soup, filename)

Read the article
Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

- by Diego

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and BeautifulSoup on Python2.6. hoping for a work-around

Read the article
optimize python code

- by user283405

i have code that uses beautifulsoup library for parsing. But it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me about this?

Read the article
Python beautifulsoup trying to remove html tags 'span'

- by Michelle Jun Lee

I am trying to remove [<span class="street-address"> 510 E Airline Way </span>] and I have used this clean function to remove the one that is in between < > def clean(val): if type(val) is not StringType: val = str(val) val = re.sub(r'<.*?>', '',val) val = re.sub("\s+" , " ", val) return val.strip() and it produces [ 510 E Airline Way ]` i am trying to add within "clean" function to remove the char '[' and ']' and basically i just want to get the "510 E Airline Way". anyone has any clue what can i add to clean function? thank you

Read the article

< Previous Page | 1 2 3 4 5 | Next Page >