Search Results

Search found 101 results on 5 pages for 'beautifulsoup'.

Page 3/5 | < Previous Page | 1 2 3 4 5  | Next Page >

  • Element Based XML Parsing

    - by demos
    I have an XML document which reads like this: <xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml> my question is how do I access them using a library like BeautifulSoup in python? xmlDom.web["Web"].Total ? does not work?

    Read the article

  • Element Based XML Parising

    - by demos
    I have an XML document which reads like this: <xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml> my question is how do I access them using a library like BeautifulSoup in python? xmlDom.web["Web"].Total ? does not work?

    Read the article

  • Python GUI Scraper hanging issues.

    - by bball
    I wrote a scraper using python a while back, and it worked fine in the command line. I have made a GUI for the application now, but I am having trouble with one issue. When I attempt to update text inside the gui (e.g. 'fetching URL 12/50'), I am unable seeing as the function within the scraper is grabbing 100+ links. Also when going from one scraping function, to a function that should update the gui, to another function, the gui update function seems to be skipped over while the next scrape function is run. An example would be: scrapeLinksA() #takes 20 seconds updateInfo("LinksA done") scrapeLinksB() #takes another 20 seconds in the above example, updateInfo is never executed, unless I end the program with a KeyboardInterrupt. I'm thinking my solution is threading, but I'm not sure. What can I do to fix this? I am using: PyQt4 urllib2 BeautifulSoup

    Read the article

  • Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

    - by Eric Wilson
    I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup. Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice? Also, please let me know if I have overlooked a significant option. Edit: I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

    Read the article

  • Python: using a regular expression to match one line of HTML

    - by skylarking
    This simple Python method I put together just checks to see if Tomcat is running on one of our servers. import urllib2 import re import sys def tomcat_check(): tomcat_status = urllib2.urlopen('http://10.1.1.20:7880') results = tomcat_status.read() pattern = re.compile('<body>Tomcat is running...</body>',re.M|re.DOTALL) q = pattern.search(results) if q == []: notify_us() else: print ("Tomcat appears to be running") sys.exit() If this line is not found : <body>Tomcat is running...</body> It calls : notify_us() Which uses SMTP to send an email message to myself and another admin that Tomcat is no longer runnning on the server... I have not used the re module in Python before...so I am assuming there is a better way to do this... I am also open to a more graceful solution with Beautiful Soup ... but haven't used that either.. Just trying to keep this as simple as possible...

    Read the article

  • Downloading a picture via urllib and python.

    - by Mike
    So I'm trying to make a Python script that downloads webcomics and puts them in a folder on my desktop. I've found a few similar programs on here that do something similar, but nothing quite like what I need. The one that I found most similar is right here (http://bytes.com/topic/python/answers/850927-problem-using-urllib-download-images). I tried using this code: >>> import urllib >>> image = urllib.URLopener() >>> image.retrieve("http://www.gunnerkrigg.com//comics/00000001.jpg","00000001.jpg") ('00000001.jpg', <httplib.HTTPMessage instance at 0x1457a80>) I then searched my computer for a file "00000001.jpg", but all I found was the cached picture of it. I'm not even sure it saved the file to my computer. Once I understand how to get the file downloaded, I think I know how to handle the rest. Essentially just use a for loop and split the string at the '00000000'.'jpg' and increment the '00000000' up to the largest number, which I would have to somehow determine. Any reccomendations on the best way to do this or how to download the file correctly? Thanks!

    Read the article

  • Scrape zipcode table for different urls based on county

    - by Dr.Venkman
    I used lxml and ran into a wall as my new computer wont install lxml and the code doesnt work. I know this is simple - maybe some one can help with a beautiful soup script. this is my code: import codecs import lxml as lh from selenium import webdriver import time import re results = [] city = [ 'amador'] state = [ 'CA'] for state in states: for city in citys: browser = webdriver.Firefox() link2 = 'http://www.getzips.com/cgi-bin/ziplook.exe?What=3&County='+ city +'&State=' + state + '&Submit=Look+It+Up' browser.get(link2) bcontent = browser.page_source zipcode = bcontent[bcontent.find('<td width="15%"'):bcontent.find('<p>')+0] if len(zipcode) > 0: print zipcode else: print 'none' browser.quit() Thanks for the help

    Read the article

  • Error while trying to parse a website url using python . how to debug it ?

    - by mekasperasky
    #!/usr/bin/python import json import urllib from BeautifulSoup import BeautifulSoup from BeautifulSoup import BeautifulStoneSoup import BeautifulSoup def showsome(searchfor): query = urllib.urlencode({'q': searchfor}) url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query search_response = urllib.urlopen(url) search_results = search_response.read() results = json.loads(search_results) data = results['responseData'] print 'Total results: %s' % data['cursor']['estimatedResultCount'] hits = data['results'] print 'Top %d hits:' % len(hits) for h in hits: print ' ', h['url'] resp = urllib.urlopen(h['url']) res = resp.read() soup = BeautifulSoup(res) print soup.prettify() print 'For more results, see %s' % data['cursor']['moreResultsUrl'] showsome('sachin') What is the wrong in this code ? Note all the 4 links that I am getting out of the search , I am feeding it back to extract the contents out of it , and then use BeautifulSoup to parse it . How should I go about it ?

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • Web scraping with Python

    - by Jack
    I'm currently trying to scrape a website that has fairly poorly-formatted HTML (often missing closing tags, no use of classes or ids so it's incredibly difficult to go straight to the element you want, etc.). I've been using BeautifulSoup with some success so far but every once and a while (though quite rarely), I run into a page where BeautifulSoup creates the HTML tree a bit differently from (for example) Firefox or Webkit. While this is understandable as the formatting of the HTML leaves this ambiguous, if I were able to get the same parse tree as Firefox or Webkit produces I would be able to parse things much more easily. The problems are usually something like the site opens a <b> tag twice and when BeautifulSoup sees the second <b> tag, it immediately closes the first while Firefox and Webkit nest the <b> tags. Is there a web scraping library for Python (or even any other language (I'm getting desperate)) that can reproduce the parse tree generated by Firefox or WebKit (or at least get closer than BeautifulSoup in cases of ambiguity).

    Read the article

  • How do I print the Images?

    - by user1477539
    I want to print the images of the 30 nba teams drafting in the first round. However when I tell it to print it prints out the link instead of the image. How do I get it to print out the image instead of giving me the image link. Here's my code: import urllib2 from BeautifulSoup import BeautifulSoup # or if your're using BeautifulSoup4: # from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()) rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:] for row in rows: fields = row.findAll("td") if len(fields) >= 3: anchor = row.findAll("td")[1].find("a") if anchor: print anchor

    Read the article

  • Regular expressions in python unicode

    - by Remy
    I need to remove all the html tags from a given webpage data. I tried this using regular expressions: import urllib2 import re page = urllib2.urlopen("http://www.frugalrules.com") from bs4 import BeautifulSoup, NavigableString, Comment soup = BeautifulSoup(page) link = soup.find('link', type='application/rss+xml') print link['href'] rss = urllib2.urlopen(link['href']).read() souprss = BeautifulSoup(rss) description_tag = souprss.find_all('description') content_tag = souprss.find_all('content:encoded') print re.sub('<[^>]*>', '', content_tag) But the syntax of the re.sub is: re.sub(pattern, repl, string, count=0) So, I modified the code as (instead of the print statement above): for row in content_tag: print re.sub(ur"<[^>]*>",'',row,re.UNICODE But it gives the following error: Traceback (most recent call last): File "C:\beautifulsoup4-4.3.2\collocation.py", line 20, in <module> print re.sub(ur"<[^>]*>",'',row,re.UNICODE) File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer What am I doing wrong?

    Read the article

  • How to use Regular Expression to extract information from a HTML webpage?

    - by user569248
    How to use Regular Expression to extract the answer "Here is the answer" from a HTML webpage like this? <b>Last Question:</b> <b>Here is the answer</b> ..:: Update ::.. Thanks everybody! Here is my solution by using BeautifulSoup since I'm using Python framework: response = opener.open(url) the_page = response.read() soup = BeautifulSoup(''.join(the_page)) paraText1 = soup.body.find('div', 'div_id', text = u'Last Question:') if paraText1: answer = paraText1.next

    Read the article

  • Regex to remove conditional comments

    - by cnu
    I want a regex which can match conditional comments in a HTML source page so I can remove only those. I want to preserve the regular comments. I would also like to avoid using the .*? notation if possible. The text is foo <!--[if IE]> <style type="text/css"> ul.menu ul li{ font-size: 10px; font-weight:normal; padding-top:0px; } </style> <![endif]--> bar and I want to remove everything in <!--[if IE]> and <![endif]--> EDIT: It is because of BeautifulSoup I want to remove these tags. BeautifulSoup fails to parse and gives an incomplete source EDIT2: [if IE] isn't the only condition. There are lots more and I don't have any list of all possible combinations. EDIT3: Vinko Vrsalovic's solution works, but the actual problem why beautifulsoup failed was because of a rogue comment within the conditional comment. Like <!--[if lt IE 7.]> <script defer type="text/javascript" src="pngfix_253168.js"></script><!--png fix for IE--> <![endif]--> Notice the <!--png fix for IE--> comment? Though my problem was solve, I would love to get a regex solution for this.

    Read the article

  • How to install suggested packages in apt-get

    - by Alaa Ali
    EDIT: I solved my issue. I will answer my own question, but in 5 hours because I don't have permission now. I know the question has been asked before, but please hear me out. So I wanted to install screenlets. I ran sudo apt-get install screenlets, and this is what I got: The following extra packages will be installed: libart-2.0-2 libbonobo2-0 libbonobo2-common libbonoboui2-0 libbonoboui2-common libgnome2-0 libgnomecanvas2-0 libgnomecanvas2-common libgnomeui-0 libgnomeui-common libtidy-0.99-0 python-beautifulsoup python-evolution python-feedparser python-gmenu python-gnome2 python-numpy python-pyorbit python-rsvg python-tz python-utidylib screenlets-pack-basic Suggested packages: libbonobo2-bin python-gnome2-doc python-numpy-doc python-numpy-dbg python-nose python-dev gfortran python-pyorbit-dbg screenlets-pack-all python-dcop Recommended packages: python-numeric python-gnome2-extras The following NEW packages will be installed: libart-2.0-2 libbonobo2-0 libbonobo2-common libbonoboui2-0 libbonoboui2-common libgnome2-0 libgnomecanvas2-0 libgnomecanvas2-common libgnomeui-0 libgnomeui-common libtidy-0.99-0 python-beautifulsoup python-evolution python-feedparser python-gmenu python-gnome2 python-numpy python-pyorbit python-rsvg python-tz python-utidylib screenlets screenlets-pack-basic 0 upgraded, 23 newly installed, 0 to remove and 2 not upgraded. People say that Recommended packages are installed by default, but they are clearly not included in the NEW packages that will be installed above. I also decided to include the Suggested packages in the installation, so I ran sudo apt-get --install-suggests install screenlets instead, but I got a HUGE list of NEW packages that will be installed; that number is precisely 0 upgraded, 944 newly installed, 0 to remove and 2 not upgraded. Should'nt I be getting only around 10 extra packages?

    Read the article

  • Python regex on list

    - by Peter Nielsen
    Hi there I am trying to build a parser and save the results as an xml file but i have problems.. For instance i get a TypeError: expected string or buffer when i try to run the code.. Would you experts please have a look at my code ? import urllib2, re from xml.dom.minidom import Document from BeautifulSoup import BeautifulSoup as bs osc = open('OSCTEST.html','r') oscread = osc.read() soup=bs(oscread) doc = Document() root = doc.createElement('root') doc.appendChild(root) countries = doc.createElement('countries') root.appendChild(countries) findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)</h1>', re.DOTALL | re.IGNORECASE).findall(soup) findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL | re.IGNORECASE).findall(soup) for header in findtags1: title_elem = doc.createElement('title') countries.appendChild(title_elem) header_elem = doc.createTextNode(header) title_elem.appendChild(header_elem) for item in findtags2: art_elem = doc.createElement('artikel') countries.appendChild(art_elem) s = item.replace('<P>','') t = s.replace('</P>','') text_elem = doc.createTextNode(t) art_elem.appendChild(text_elem) print doc.toprettyxml()

    Read the article

  • Url open encoding

    - by badc0re
    I have the following code for urllib and BeautifulSoup: getSite = urllib.urlopen(pageName) # open current site getSitesoup = BeautifulSoup(getSite.read()) # reading the site content print getSitesoup.originalEncoding for value in getSitesoup.find_all('link'): # extract all <a> tags defLinks.append(value.get('href')) The result of it: /usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. "Some characters could not be decoded, and were " And when i try to read the site i get: ?7?e????0*"I??G?H????F??????9-??????;??E?YÞBs????????????4i???)?????^W?????`w?Ke??%??*9?.'OQB???V??@?????]???(P??^??q?$?S5???tT*?Z

    Read the article

  • Python: replace urls with title names from a string

    - by Hellnar
    Hello I would like to remove urls from a string replace them with their titles of the original contents. For example: mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com" sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images" For replacing url to the title, I have written this snipplet: #get_title: string -> string def get_title(url): """Returns the title of the input URL""" output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url)) return output.title.string

    Read the article

  • Python Continue Loop

    - by Rob B.
    I am using the following code from this tutorial (http://jeriwieringa.com/blog/2012/11/04/beautiful-soup-tutorial-part-1/). from bs4 import BeautifulSoup soup = BeautifulSoup (open("43rd-congress.html")) final_link = soup.p.a final_link.decompose() trs = soup.find_all('tr') for tr in trs: for link in tr.find_all('a'): fulllink = link.get ('href') print fulllink #print in terminal to verify results tds = tr.find_all("td") try: #we are using "try" because the table is not well formatted. This allows the program to continue after encountering an error. names = str(tds[0].get_text()) # This structure isolate the item by its column in the table and converts it into a string. years = str(tds[1].get_text()) positions = str(tds[2].get_text()) parties = str(tds[3].get_text()) states = str(tds[4].get_text()) congress = tds[5].get_text() except: print "bad tr string" continue #This tells the computer to move on to the next item after it encounters an error print names, years, positions, parties, states, congress However, I get an error saying that 'continue' is not properly in the loop on line 27. I am using notepad++ and windows powershell. How do I make this code work?

    Read the article

  • ASP .NET, Javascript, AjaxControlToolkit - get results with Selenium??

    - by Seth
    I'm a newbie to web stuff. However, I wish to scrape some data from multiple websites. I'm currently using the following technologies: Selenium; Python; and BeautifulSoup; I believe the site I am trying to scrape is using a combination of ASP.NET, javascript and the AjaxControlToolkit. I believe the key results I am looking for are in the following script: <script type="text/javascript"> //<![CDATA[ Sys.Application.initialize(); Sys.Application.add_init(function() { $create(AjaxControlToolkit.AutoCompleteBehavior, {"completionInterval":50,"completionListCssClass":"autocomplete_completionListElement","completionListItemCssClass":"autocomplete_listItem","completionSetCount":20,"delimiterCharacters":"","highlightedItemCssClass":"autocomplete_highlightedListItem","id":"ctl00_ContentPlaceHolder1_AutoCompleteExtender1","minimumPrefixLength":4,"serviceMethod":"GetSchoolNames","servicePath":"AutoComplete.asmx"}, {"itemSelected":ItemSelected}, null, $get("ctl00_ContentPlaceHolder1_SchoolNameTextBox")); }); Sys.Application.add_init(function() { $create(AjaxControlToolkit.AutoCompleteBehavior, {"completionInterval":50,"completionListCssClass":"autocomplete_completionListElement","completionListItemCssClass":"autocomplete_listItem","delimiterCharacters":"","highlightedItemCssClass":"autocomplete_highlightedListItem","id":"ctl00_ContentPlaceHolder1_AutoCompleteExtender2","minimumPrefixLength":2,"serviceMethod":"GetSuburbNames","servicePath":"AutoComplete.asmx"}, null, null, $get("ctl00_ContentPlaceHolder1_SuburbTownTextBox")); }); //]]> </script> Is there an easy way to get the results of the above script processed using Selenium so that I may pass it using BeautifulSoup?

    Read the article

  • python mechanize.browser submit() related problem

    - by paul
    Hello All im making some script with mechanize.browser module. one of problem is all other thing is ok, but when submit() form,it not working, so i was found some suspicion source part. in the html source i was found such like following. <form method="post" onsubmit="return loginCheck(this)" name="FRMLOGIN"/> im thinking, loginCheck(this) making problem when submit form. but how to handle this kind of javascript function with mechanize module ,so i can successfully submit form and can receive result? folloing is my current script source. if anyone can help me ..much appreciate!! # -*- coding: cp949-*- import sys,os import mechanize, urllib import cookielib from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag import datetime, time, socket import re,sys,os,mechanize,urllib,time br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) # Follows refresh 0 but not hangs on refresh > 0 br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) # Want debugging messages? br.set_debug_http(True) br.set_debug_redirects(True) br.set_debug_responses(True) # User-Agent (this is cheating, ok?) br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')] br.open('http://user.buddybuddy.co.kr/Login/LoginForm.asp?URL=') html = br.response().read() print html br.select_form(name='FRMLOGIN') print br.viewing_html() br.form['ID']='zero1zero2' br.form['PWD']='012045' br.submit() print br.response().read()

    Read the article

  • How to Select Items in Dropdown in Selenium

    - by Marcus Gladir
    Firstly, I have been trying to get the dropdown from this web page: http://solutions.3m.com/wps/portal/3M/en_US/Interconnect/Home/Products/ProductCatalog/Catalog/?PC_Z7_RJH9U5230O73D0ISNF9B3C3SI1000000_nid=RFCNF5FK7WitWK7G49LP38glNZJXPCDXLDbl This is the code I have: import urllib2 from bs4 import BeautifulSoup import re from pprint import pprint import sys from selenium import common from selenium import webdriver import selenium.webdriver.support.ui as ui from boto.s3.key import Key import requests url = 'http://solutions.3m.com/wps/portal/3M/en_US/Interconnect/Home/Products/ProductCatalog/Catalog/?PC_Z7_RJH9U5230O73D0ISNF9B3C3SI1000000_nid=RFCNF5FK7WitWK7G49LP38glNZJXPCDXLDbl' element_xpath = '//*[@id="Component1"]' driver = webdriver.PhantomJS() driver.get(url) element = driver.find_element_by_xpath(element_xpath) element_xpath = '/option[@value="02"]' all_options = element.find_elements_by_tag_name("option") for option in all_options: print("Value is: %s" % option.get_attribute("value")) option.click() source = driver.page_source.encode('utf-8', 'ignore') driver.quit() source = str(source) soup = BeautifulSoup(source, 'html.parser') print soup What prints out is this: Traceback (most recent call last): File "../../../../test.py", line 58, in <module> Value is: XX main() File "../../../../test.py", line 46, in main option.click() File "/home/eric/dev/octocrawler-env/local/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 54, in click self._execute(Command.CLICK_ELEMENT) File "/home/eric/dev/octocrawler-env/local/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webelement.py", line 228, in _execute return self._parent.execute(command, params) File "/home/eric/dev/octocrawler-env/local/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/webdriver.py", line 165, in execute self.error_handler.check_response(response) File "/home/eric/dev/octocrawler-env/local/lib/python2.7/site-packages/selenium-2.33.0-py2.7.egg/selenium/webdriver/remote/errorhandler.py", line 158, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.ElementNotVisibleException: Message: u'{"errorMessage":"Element is not currently visible and may not be manipulated","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"81","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:51413","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"POST","post":"{\\"sessionId\\": \\"30e4fd50-f0e4-11e3-8685-6983e831d856\\", \\"id\\": \\":wdc:1402434863875\\"}","url":"/click","urlParsed":{"anchor":"","query":"","file":"click","directory":"/","path":"/click","relative":"/click","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/click","queryKey":{},"chunks":["click"]},"urlOriginal":"/session/30e4fd50-f0e4-11e3-8685-6983e831d856/element/%3Awdc%3A1402434863875/click"}}' ; Screenshot: available via screen And the weirdest most infuriating bit of it all is that sometimes it actually all works out. I have no clue what's going on here.

    Read the article

< Previous Page | 1 2 3 4 5  | Next Page >