Search Results

Search found 117 results on 5 pages for 'soup'.

Page 2/5 | < Previous Page | 1 2 3 4 5  | Next Page >

  • Removing an Element on Certain Location

    - by Chicken Soup
    Let's say the window's location is on htt://stackoverflow.com/index.php, I want to remove an element in the index page with jQuery. This is what I have and it's not working: $(document).ready(function() { var location = window.location; var locQuery = /index/i; if (location.match(locQuery)) { $('.someClass').removeClass(); } });

    Read the article

  • Using sqlalchemy to query using multiple column where in clause

    - by crunkchitis
    I'm looking to execute this query using sqlalchemy. SELECT name, age, favorite_color, favorite_food FROM kindergarten_classroom WHERE (favorite_color, favorite_food) IN (('lavender','lentil soup'),('black','carrot juice')); I only want kids that like (lavender AND lentil soup) OR (black and carrot juice). This is similar, but doesn't get me all of the way there: Sqlalchemy in clause

    Read the article

  • How to use Regular Expression to extract information from a HTML webpage?

    - by user569248
    How to use Regular Expression to extract the answer "Here is the answer" from a HTML webpage like this? <b>Last Question:</b> <b>Here is the answer</b> ..:: Update ::.. Thanks everybody! Here is my solution by using BeautifulSoup since I'm using Python framework: response = opener.open(url) the_page = response.read() soup = BeautifulSoup(''.join(the_page)) paraText1 = soup.body.find('div', 'div_id', text = u'Last Question:') if paraText1: answer = paraText1.next

    Read the article

  • Parsing with BeautifulSoup, error message TypeError: coercing to Unicode: need string or buffer, NoneType found

    - by Samsun Knight
    so I'm trying to scrape an Amazon page for data, and I'm getting an error when I try to parse for where the seller is located. Here's my code: #getting the html request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/') opener = urllib2.build_opener() #hiding that I'm a webscraper request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko') #opening it up, putting into soup form html = opener.open(request).read() soup = BeautifulSoup(html, "html5lib") #parsing for the seller info sellers = soup.findAll('div', {'class' : 'a-row a-spacing-medium olpOffer'}) for eachseller in sellers: #parsing for price price = eachseller.find('span', {'class' : 'a-size-large a-color-price olpOfferPrice a-text-bold'}) #parsing for shipping costs shippingprice = eachseller.find('span' , {'class' : 'olpShippingPrice'}) #parsing for condition condition = eachseller.find('span', {'class' : 'a-size-medium'}) #parsing for seller name sellername = eachseller.find('b') #parsing for seller location location = eachseller.find('div', {'class' : 'olpAvailability'}) #printing it all out print "price, " + price.string + ", shipping price, " + shippingprice.string + ", condition," + condition.string + ", seller name, " + sellername.string + ", location, " + location.string I get the error message, pertaining to the 'print' command at the end, "TypeError: coercing to Unicode: need string or buffer, NoneType found" I know that it's coming from this line - location = eachseller.find('div', {'class' : 'olpAvailability'}) - because the code works fine without that line, and I know that I'm getting NoneType because the line isn't finding anything. Here's the html from the section I'm looking to parse: <*div class="olpAvailability"> In Stock. Ships from WI, United States. <*br/><*a href="/gp/aag/details/ref=olp_merch_ship_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=shipping-rates#aag_shipping">Domestic shipping rates</a> and <*a href="/gp/aag/details/ref=olp_merch_return_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=returns#aag_returns">return policy</a>. <*/div> (but without the stars - just making sure the HTML doesn't compile out of code form) I don't see what's the problem with the 'location' line of code, or why it's not pulling the data I want. Help?

    Read the article

  • scraping text from multiple html files into a single csv file

    - by Lulu
    I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table. My Input: e.g file 1500.html My Code: #!/usr/bin/env python import glob import codecs from BeautifulSoup import BeautifulSoup with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile: for file in glob.glob('*html*'): print 'Processing', file soup = BeautifulSoup(open(file).read()) rows = soup.findAll('tr') for tr in rows: cols = tr.findAll('td') #print >> csvfile,"#".join(col.string for col in cols) #print >> csvfile,"#".join(td.find(text=True)) for col in cols: print >> csvfile, col.string print >> csvfile, "===" print >> csvfile, "***" Output: One CSV file, with 1500 lines of text and columns of data. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!

    Read the article

  • I want to find the span tag beween the LI tag and its attributes but no luck.

    - by Mahesh
    I want to find the span tag beween the LI tag and its attributes. Trying with beautful soap but no luck. Details of my code. Is any one point me right methodlogy In this this code, my getId function should return me id = "0_False-2" Any one know right method? from BeautifulSoup import BeautifulSoup as bs import re html = '<ul>\ <li class="line">&nbsp;</li>\ <li class="folder-open-last" id="0">\ <img style="float: left;" class="trigger" src="/media/images/spacer.gif" border="0">\ <span class="text" id="0_False">NOC</span><ul style="display: block;"><li class="line">&nbsp;</li><li class="doc" id="1"><span class="active text" id="0_False-1">PNQAIPMS1</span></li><li class="line">&nbsp;</li><li class="doc-last" id="2"><span class="text" id="0_False-2">PNQAIPMS2</span></li><li class="line-last"></li></ul></li><li class="line-last"></li>\ </ul>' def getId(html, txt): soup = bs(html) soup.findAll('ul',recursive=False) head = soup.contents[0] temp = head elements = {} while True: # It temp is None that means no HTML tags are available if temp == None: break #print temp if re.search('li', str( temp)) != None: attr = str(temp.attrs).encode('ascii','ignore') attr = attr.replace(' ', '') attr = attr.replace('[', '') attr = attr.replace(']', '') attr = attr.replace(')', '') attr = attr.replace('(', '') attr = attr.replace('u\'', '') attr = attr.replace('\'', '') attr = attr.split(',') span = str(temp.text) if span == txt: return attr[3] temp = temp.next else: temp = temp.next id = getId(html,"PNQAIPMS2") print "ID = " + id

    Read the article

  • How to store multiple cookies through PHP Curl

    - by Ahmad
    'SOUP.IO' is not providing any api. So Iam trying to use 'PHP Curl' to login and submit data through PHP. Iam able to login the website successfully(through cUrl), but when I try to submit data through cUrl, it gives me error of 'invalid user'. When I tried to analysed the code and website, I came to know that cUrl is getting values of only 1-2 cookies. Where as when I open the same page in FireFox, it shows me 6-7 cookies related to 'SOUP.IO'. Can some one guide me how to get all these 7 cookies values. Following cookies are getable by cUrl: soup_session_id Following cookies are shown in Firefox (not through cUrl): __qca, __utma, __utmb, __utmc, __utmz Following is my cUrl code: $cookie_file_path = getcwd()."/cookie/cookie.txt"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://www.soup.io'); curl_setopt($ch, CURLOPT_VERBOSE, 1); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE); curl_setopt($ch, CURLOPT_HEADER, TRUE); curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate'); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 (.NET CLR 3.5.30729) FirePHP/0.4'); curl_setopt($ch, CURLOPT_MAXREDIRS, 10); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); $result = curl_exec($ch); curl_close($ch); print_r($result); ? Can some one guide me in this regards Thanks in advance

    Read the article

  • grabbing a substring while scraping with Python2.6

    - by Diego
    Hey can someone help with the following? I'm trying to scrape a site that has the following information.. I need to pull just the number after the </strong> tag.. [<li><strong>ISBN-13:</strong> 9780375853401</li>, <li><strong>Pub. Date: </strong> 05/11/2010</li>] [<li><strong>UPC:</strong> 490355000372</li>, <li><strong>Catalog No:</strong> 15024/25</li>, <li><strong>Label:</strong> CAMERATA</li>] here's a piece of the code I've been using to grab the above data using mechanize and BeautifulSoup. I'm stuck here as it won't let me use the find() function for a list br_results = mechanize.urlopen(br_results) html = br_results.read() soup = BeautifulSoup(html) local_links = soup.findAll("a", {"class" : "down-arrow csa"}) upc_code = soup.findAll("ul", {"class" : "bc-meta3"}) for upc in upc_code: upc_text = upc.contents.contents print upc_text

    Read the article

  • Prettify working using python idle but not when using python script

    - by Loclip
    So when I use print soup.prettify() on idle it's prettifying my html but when I calling the script from server the prettify() not working #!/usr/bin/python import cgi, cgitb, urllib2, sys from bs4 import BeautifulSoup styles = [] i = 1 site = "www.example.com" page = urllib2.urlopen(site) soup = BeautifulSoup(page) table = soup.find('table', {'class' :'style5'}) alltd = table.findAll('td') allspans = table.findAll('span') while i<55: styles.append("style" + str(i)) i+=1 for td in alltd: if any(x in td["class"] for x in styles): td["class"] = "align" for span in allspans: span.replaceWithChildren() print "Content-type: text/html" print print "<!DOCTYPE html>" print "<html>" print "<head>" print '<meta http-equiv="content-type" content="text/html; charset=utf-8">' print '<link rel="stylesheet" href="lightbox/css/lightbox.css" type="text/css" media="screen">' print '<style type="text/css">' print "body {background-color:#b0c4de;}" print '.align {text-align: center;}' print "</style>" print "</head>" print "<body>" print table.pretiffy() print "</body>" print "</html>"

    Read the article

  • Python: Removing particular character (u"\u2610") from string

    - by duhaime
    I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character. (You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.) To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove. for work in glob.glob(pathtofiles): openfile = open(work) readfile = openfile.read() stringfile = str(readfile) decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line? soup = BeautifulSoup(decodefile) textwithtags = soup.findAll('text') textwithtagsasstring = str(textwithtags) #this method strips everything between anglebrackets as it should textwithouttags = stripTags(textwithtagsasstring) #clean text nonewlines = textwithouttags.replace("\n", " ") noextrawhitespace = re.sub(' +',' ', nonewlines) print noextrawhitespace #the boxes appear I tried to remove the boxes by using noboxes = noextrawhitespace.replace(u"\u2610", "") But Python threw an error flag: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128) Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.

    Read the article

  • Getting BeautifulSoup to find a specific <p>

    - by Ryan
    I'm trying to put together a basic HTML scraper for a variety of scientific journal websites, specifically trying to get the abstract or introductory paragraph. The current journal I'm working on is Nature, and the article I've been using as my sample can be seen at http://www.nature.com/nature/journal/v463/n7284/abs/nature08715.html. I can't get the abstract out of that page, however. I'm searching for everything between the <p class="lead">...</p> tags, but I can't seem to figure out how to isolate them. I thought it would be something simple like from BeautifulSoup import BeautifulSoup import re import urllib2 address="http://www.nature.com/nature/journal/v463/n7284/full/nature08715.html" html = urllib2.urlopen(address).read() soup = BeautifulSoup(html) abstract = soup.find('p', attrs={'class' : 'lead'}) print abstract Using Python 2.5, BeautifulSoup 3.0.8, running this returns 'None'. I have no option of using anything else that needs to be compiled/installed (like lxml). Is BeautifulSoup confused, or am I?

    Read the article

  • Extracting an attribute value with beautifulsoup

    - by Barnabe
    I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import BeautifulStoneSoup soup = BeautifulStoneSoup(s) inputTag = soup.findAll(attrs={"name" : "stainfo"}) output = inputTag['value'] print str(output) I get a TypeError: list indices must be integers, not str even though from the Beautifulsoup documentation i understand that strings should not be a problem here... but i a no specialist and i may have misunderstood. Any suggestion is greatly appreciated! Thanks in advance.

    Read the article

  • Beautifulsoup recursive attribute

    - by Marcos Placona
    Hi, trying to parse an XML with Beautifulsoup, but hit a brick wall when trying to use the "recursive" attribute with findall() I have a pretty odd xml format shown below: <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> <catalog>true</catalog> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> <catalog>false</catalog> </book> </catalog> As you can see, the catalog tag repeats inside the book tag, which causes an error when I try to to something like: from BeautifulSoup import BeautifulStoneSoup as BSS catalog = "catalog.xml" def open_rss(): f = open(catalog, 'r') return f.read() def rss_parser(): rss_contents = open_rss() soup = BSS(rss_contents) items = soup.findAll('catalog', recursive=False) for item in items: print item.title.string rss_parser() As you will see, on my soup.findAll I've added recursive=false, which in theory would make it no recurse through the item found, but skip to the next one. This doesn't seem to work, as I always get the following error: File "catalog.py", line 17, in rss_parser print item.title.string AttributeError: 'NoneType' object has no attribute 'string' I'm sure I'm doing something stupid here, and would appreciate if someone could give me some help on how to solve this problem. Changing the HTML structure is not an option, this this code needs to perform well as it will potentially parse a large XML file. Thanks in advance, Marcos

    Read the article

  • How to get the original variable name of variable passed to a function

    - by Acorn
    Is it possible to get the original variable name of a variable passed to a function? E.g. foobar = "foo" def func(var): print var.origname So that: func(foobar) Returns: >>foobar EDIT: All I was trying to do was make a function like: def log(soup): f = open(varname+'.html', 'w') print >>f, soup.prettify() f.close() .. and have the function generate the filename from the name of the variable passed to it. I suppose if it's not possible I'll just have to pass the variable and the variable's name as a string each time.

    Read the article

  • Error while trying to parse a website url using python . how to debug it ?

    - by mekasperasky
    #!/usr/bin/python import json import urllib from BeautifulSoup import BeautifulSoup from BeautifulSoup import BeautifulStoneSoup import BeautifulSoup def showsome(searchfor): query = urllib.urlencode({'q': searchfor}) url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query search_response = urllib.urlopen(url) search_results = search_response.read() results = json.loads(search_results) data = results['responseData'] print 'Total results: %s' % data['cursor']['estimatedResultCount'] hits = data['results'] print 'Top %d hits:' % len(hits) for h in hits: print ' ', h['url'] resp = urllib.urlopen(h['url']) res = resp.read() soup = BeautifulSoup(res) print soup.prettify() print 'For more results, see %s' % data['cursor']['moreResultsUrl'] showsome('sachin') What is the wrong in this code ? Note all the 4 links that I am getting out of the search , I am feeding it back to extract the contents out of it , and then use BeautifulSoup to parse it . How should I go about it ?

    Read the article

  • Optimizing BeautifulSoup (Python) code

    - by user283405
    I have code that uses the BeautifulSoup library for parsing, but it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me with this? I am using BeautifulSoup for parsing and than save into a DB. If I comment out the save statement, it still takes a long time, so there is no problem with the database. def parse(self,text): soup = BeautifulSoup(text) arr = soup.findAll('tbody') for i in range(0,len(arr)-1): data=Data() soup2 = BeautifulSoup(str(arr[i])) arr2 = soup2.findAll('td') c=0 for j in arr2: if str(j).find("<a href=") > 0: data.sourceURL = self.getAttributeValue(str(j),'<a href="') else: if c == 2: data.Hits=j.renderContents() #and few others... c = c+1 data.save() Any suggestions? Note: I already ask this question here but that was closed due to incomplete information.

    Read the article

  • Regular expressions in python unicode

    - by Remy
    I need to remove all the html tags from a given webpage data. I tried this using regular expressions: import urllib2 import re page = urllib2.urlopen("http://www.frugalrules.com") from bs4 import BeautifulSoup, NavigableString, Comment soup = BeautifulSoup(page) link = soup.find('link', type='application/rss+xml') print link['href'] rss = urllib2.urlopen(link['href']).read() souprss = BeautifulSoup(rss) description_tag = souprss.find_all('description') content_tag = souprss.find_all('content:encoded') print re.sub('<[^>]*>', '', content_tag) But the syntax of the re.sub is: re.sub(pattern, repl, string, count=0) So, I modified the code as (instead of the print statement above): for row in content_tag: print re.sub(ur"<[^>]*>",'',row,re.UNICODE But it gives the following error: Traceback (most recent call last): File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module> print re.sub(ur"<[^>]*>",'',row,re.UNICODE) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer What am I doing wrong?

    Read the article

  • optimize python code

    - by user283405
    i have code that uses BeautifulSoup library for parsing. But it is very slow. The code is written in such a way that threads cannot be used. Can anyone help me about this? I am using beautifulsoup library for parsing and than save in DB. if i comment the save statement, than still it takes time so there is no problem with database. def parse(self,text): soup = BeautifulSoup(text) arr = soup.findAll('tbody') for i in range(0,len(arr)-1): data=Data() soup2 = BeautifulSoup(str(arr[i])) arr2 = soup2.findAll('td') c=0 for j in arr2: if str(j).find("<a href=") > 0: data.sourceURL = self.getAttributeValue(str(j),'<a href="') else: if c == 2: data.Hits=j.renderContents() #and few others... #... c = c+1 data.save() Any suggestions? Note: I already ask this question here but that was closed due to incomplete information.

    Read the article

  • BeautifulSoup Parser Confusion - HTML

    - by lyngbym
    I'm trying to scrape some content off another site and I'm not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I'm new to python. Here's my code: import sys import os import mechanize import re from BeautifulSoup import BeautifulSoup def scrape_trails(BASE_URL, data): #Get the trail names soup = BeautifulSoup(data) sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"}) print sitesDiv def main(): BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html" br = mechanize.Browser() data = br.open(BASE_URL).get_data() links = scrape_trails(BASE_URL, data) if __name__ == '__main__': main() If you follow that URL you can see the sitesDiv contains a lot of markup. I'm not sure if I'm doing something wrong or if this is just malformed markup that the script can't handle. Thanks!

    Read the article

  • How do I print the Images?

    - by user1477539
    I want to print the images of the 30 nba teams drafting in the first round. However when I tell it to print it prints out the link instead of the image. How do I get it to print out the image instead of giving me the image link. Here's my code: import urllib2 from BeautifulSoup import BeautifulSoup # or if your're using BeautifulSoup4: # from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()) rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:] for row in rows: fields = row.findAll("td") if len(fields) >= 3: anchor = row.findAll("td")[1].find("a") if anchor: print anchor

    Read the article

  • Need help specifying a ending while condition

    - by johnthexiii
    I have written a Python script to download all of the xkcd comic images. The only problem is I can't tell it to stop when it gets to the last one... Here is what I have so far. import re, mechanize from urllib import urlretrieve from BeautifulSoup import BeautifulSoup as bs baseUrl = "http://xkcd.com/1/" #Specify the first comic page br = mechanize.Browser() #Create a browser response = br.open(baseUrl) #Create an initial response x = 1 #Assign an initial file name while (SomeCondition): soup = bs(response.get_data()) #Create an instance of bs that contains the response data img = soup.findAll('img')[1] #Get the online file path of the image localFile = "C:\\Comics\\xkcd\\" + str(x) + ".jpg" #Come up with a local file name urlretrieve(img["src"], localFile) #Download the image file response = br.follow_link(text = "Next >") #Store the response of the next button x += 1 #Increase x by 1 print "All xkcd comics downloaded" #Let the user know the images have been downloaded Initially what I had was something like while br.follow_link(text = "Next >") != br.follow_link(text = ">|"): but by doing this I actually send skip to the last page before the script has a chance to perform the intended purpose.

    Read the article

  • Is it possibile to modify a link value with Beautifulsoup without recreating the all link?

    - by systempuntoout
    Starting from an Html input like this: <p> <a href="http://www.foo.com" rel="nofollow">this is foo</a> <a href="http://www.bar.com" rel="nofollow">this is bar</a> </p> is it possible to modify the <a> node values ("this i foo" and "this is bar") adding the suffix "PARSED" to the value without recreating the all link? The result need to be like this: <p> <a href="http://www.foo.com" rel="nofollow">this is foo_PARSED</a> <a href="http://www.bar.com" rel="nofollow">this is bar_PARSED</a> </p> And code should be something like: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) for link_tag in soup.findAll('a'): link_tag.string = link_tag.string + '_PARSED' #This obviously does not work

    Read the article

< Previous Page | 1 2 3 4 5  | Next Page >