Search Results

Search found 74 results on 3 pages for 'lxml'.

Page 1/3 | 1 2 3 | Next Page >

Installing both lxml 3.1.2 and lxml2 on ubuntu 12.04

- by wgw

I asked this on SO: http://stackoverflow.com/questions/19852911/lxml-3-1-2-and-lxml2-both-on-ubuntu/19856674#19856674 But it is perhaps more appropriate for AskUbuntu. So here it is again, reformulated. On the lxml site they suggest that it is possible to have both lxml2 and the newest version of lxml on ubuntu: Using lxml with python-libxml2 If you want to use lxml together with the official libxml2 Python bindings (maybe because one of your dependencies uses it), you must build lxml statically. Otherwise, the two packages will interfere in places where the libxml2 library requires global configuration, which can have any kind of effect from disappearing functionality to crashes in either of the two. To get a static build, either pass the --static-deps option to the setup.py script, or run pip with the STATIC_DEPS or STATICBUILD environment variable set to true, i.e. STATIC_DEPS=true pip install lxml The STATICBUILD environment variable is handled equivalently to the STATIC_DEPS variable, but is used by some other extension packages, too. I am generally confused about how pip packages and ubuntu packages get along, so I hesitate to run STATIC_DEPS=true pip install lxml. Will it damage/confuse my installed lxml2 package? The suggestion on SO was to install the new lxml in a virtualenv. That looks like the best way to go, but the lxml site is suggesting that a dual installation will work also. In general: what happens if I use pip (to get a newer install) for a package that is already installed by apt-get?

Read the article
how to pass an xml file to lxml to parse?

- by BeeBand

I'm trying to parse an xml file using lxml. xml.etree allowed me to simply pass the file name as a parameter to the parse function, so I attempted to do the same with lxml. My code: from lxml import etree from lxml import objectify file = "C:\Projects\python\cb.xml" tree = etree.parse(file) but I get the error: Traceback (most recent call last): File "cb.py", line 5, in <module> tree = etree.parse(file) File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4 9590) File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre e.c:71205) File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx ml.etree.c:71488) File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e tree.c:70583) File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/ lxml/lxml.etree.c:67736) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo c (src/lxml/lxml.etree.c:63820) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e tree.c:64741) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr ee.c:64084) lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26 What am I doing wrong?

Read the article
Python question, how to pass an xml file to lxml to parse?

- by BeeBand

I'm relatively new to python, my code is: from lxml import etree from lxml import objectify file = "C:\Projects\python\cb.xml" tree = etree.parse(file) but I get the error: Traceback (most recent call last): File "cb.py", line 5, in <module> tree = etree.parse(file) File "lxml.etree.pyx", line 2698, in lxml.etree.parse (src/lxml/lxml.etree.c:4 9590) File "parser.pxi", line 1491, in lxml.etree._parseDocument (src/lxml/lxml.etre e.c:71205) File "parser.pxi", line 1520, in lxml.etree._parseDocumentFromURL (src/lxml/lx ml.etree.c:71488) File "parser.pxi", line 1420, in lxml.etree._parseDocFromFile (src/lxml/lxml.e tree.c:70583) File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/ lxml/lxml.etree.c:67736) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo c (src/lxml/lxml.etree.c:63820) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e tree.c:64741) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr ee.c:64084) lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 26 What am I doing wrong?

Read the article
Default or fink python and lxml under 10.6.8

- by songdogtech

Ah, confusion. I'm trying to install a python library called lxml as needed by a python script. I've been through numerous SU quesitons and answers. I haven't been able to make much progress. I run easy_install lxml and get: Processing lxml-3.0.1-py2.6-macosx-10.6-universal.egg lxml 3.0.1 is already the active version in easy-install.pth Using /Library/Python/2.6/site-packages/lxml-3.0.1-py2.6-macosx-10.6-universal.egg Processing dependencies for lxml Finished processing dependencies for lxml but when I run my script, I get: File "scraper.py", line 3, in import lxml.html File "/Library/Python/2.6/site-packages/lxml-3.0.1-py2.6-macosx-10.6-universal.egg/lxml/html/init.py", line 42, in from lxml import etree ImportError: dlopen(/Library/Python/2.6/site-packages/lxml-3.0.1-py2.6-macosx-10.6-universal.egg/lxml/etree.so, 2): Symbol not found: _htmlParseChunk Referenced from: /Library/Python/2.6/site-packages/lxml-3.0.1-py2.6-macosx-10.6-universal.egg/lxml/etree.so Expected in: flat namespace in /Library/Python/2.6/site-packages/lxml-3.0.1-py2.6-macosx-10.6-universal.egg/lxml/etree.so I think that maybe I'm not using the correct python install? I've installed python with fink, but should I use OS X's python? This is in my .profile: test -r /sw/bin/init.sh && . /sw/bin/init.sh which points to the fink install. echo $PATH gives me: /sw/bin:/sw/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin Should I change that to point to snow leopard's python? (Which is 2.6.1) In Library/, there is: which are the lxml libaries I need, it appears, as well as requests. And whereis python gives me /usr/bin/python What do I do? How do I get python to use these libraries. And which python?

Read the article
Encoding in python with lxml - complex solution

- by Vojtech R.

Hi, I need to download and parse webpage with lxml and build UTF-8 xml output. I thing schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant find good tutorial about lxml and encoding, but I can find many problems with this...) I need robust approved solution so I ask you seniors. Many thanks

Read the article
Setting timeouts to parse webpages using python lxml

- by Saurabh

I am using python lxml library to parse html pages: import lxml.html # this might run indefinitely page = lxml.html.parse('http://stackoverflow.com/') Is there any way to set timeout for parsing?

Read the article
Help with parsing lxml

- by Casey

Hi To implement a college project, I need to handle XML files. For this I choose lxml after doing some research. However I can't seem to find some nice tutorial to help me get started. I can't choose most specifically which type of parsing I need to use. My XML files don't have that much data but speed is main concern, not memory. Can anyone point me to some tutorial that would help me or some book that I can lookup? I have already tried the tutorial on lxml site but that didn't help me much. Is there some small application I can look up to get a hang of parsing XML with lxml

Read the article
python [lxml] - cleaning out html tags

- by sadhu_

from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

Read the article
Write xml file using lxml library in Python

- by systempuntoout

I'm using lxml to create an XML file from scratch; having a code like this: from lxml import etree root = etree.Element("root") root.set("interesting", "somewhat") child1 = etree.SubElement(root, "test") How do i write root Element object to an xml file using write() method of ElementTree class?

Read the article
lxml unicode entity parse problems

- by Jon Hadley

I'm using lxml as follows to parse an exported XML file from another system: xmldoc = open(filename) etree.parse(xmldoc) But im getting: lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46 Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?

Read the article
In-document schema declarations and lxml

- by shylent

As per the official documentation of lxml, if one wants to validate a xml document against a xml schema document, one has to construct the XMLSchema object (basically, parse the schema document) construct the XMLParser, passing the XMLSchema object as its schema argument parse the actual xml document (instance document) using the constructed parser There can be variations, but the essense is pretty much the same no matter how you do it, - the schema is specified 'externally' (as opposed to specifying it inside the actual xml document). If you follow this procedure, the validation occurs, sure enough, but if I understand it correctly, that completely ignores the whole idea of the schemaLocation and noNamespaceSchemaLocation attributes from xsi. This introduces a whole bunch of limitations, starting with the fact, that you have to deal with instance<-schema relation all by yourself (either store it externally or write some hack to retrieve the schema location from the root element of the instance document), you can not validate the document using multiple schemata (say, when each schema governs its own namespace) and so on. So the question is: maybe I am missing something completely trivial or doing it wrong? Or are my statements about lxml's limitations regarding schema validation true? To recap, I'd like to be able to: have the parser use the schema location declarations in the instance document at parse/validation time use multiple schemata to validate a xml document declare schema locations on non-root elements (not of extreme importance) Maybe I should look for a different library? Although, that'd be a real shame, - lxml is a de-facto xml processing library for python and is regarded by everyone as the best one in terms of performace/features/convenience (and rightfully so, to a certain extent)

Read the article
How to get a html elements with python lxml

- by Damiano

Hello! I have this html code: <table> <tr> <td class="test"><b><a href="">aaa</a></b></td> <td class="test">bbb</td> <td class="test">ccc</td> <td class="test"><small>ddd</small></td> </tr> <tr> <td class="test"><b><a href="">eee</a></b></td> <td class="test">fff</td> <td class="test">ggg</td> <td class="test"><small>hhh</small></td> </tr> </table> I use this Python code to extract all <td class="test"> with lxml module. import urllib2 import lxml.html code = urllib.urlopen("http://www.example.com/page.html").read() html = lxml.html.fromstring(code) result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]') It works good! The result is: <td class="test"><b><a href="">aaa</a></b></td> <td class="test"><small>ddd</small></td> <td class="test"><b><a href="">eee</a></b></td> <td class="test"><small>hhh</small></td> (so the first and the fourth column of each <tr>) Now, I have to extract: aaa (the title of the link) ddd (text between <small> tag) eee (the title of the link) hhh (text between <small> tag) How could I extract these values? (the problem is that I have to remove <b> tag and get the title of the anchor on the first column and remove <small> tag on the forth column) Thank you!

Read the article
How to prevent lxml prom compacting elements?

- by Almad

Having following Python code: >>> from lxml import etree >>> root = etree.XML("<a><b></b></a>") >>> etree.tostring(root) '<a><b/></a>' How can I force lxml to use "long" version? Like >>> etree.tostring(root) '<a><b></b></a>'

Read the article
extract specific element from nested elements using lxml html

- by Dan.StackOverflow

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier. <table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table> What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so: from lxml import html page = '...' tree = html.fromstring(page) print tree.xpath('//table[//*[contains(text(), "Header1")]]') but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

Read the article
Confused as to use a class or a function: Writing XML files using lxml and Python

- by PulpFiction

Hi. I need to write XML files using lxml and Python. However, I can't figure out whether to use a class to do this or a function. The point being, this is the first time I am developing a proper software and deciding where and why to use a class still seems mysterious. I will illustrate my point. For example, consider the following function based code I wrote for adding a subelement to a etree root. from lxml import etree root = etree.Element('document') def createSubElement(text, tagText = ""): etree.SubElement(root, text) # How do I do this: element.text = tagText createSubElement('firstChild') createSubElement('SecondChild') As expected, the output of this is: <document> <firstChild/> <SecondChild/> </document> However as you can notice the comment, I have no idea how to do set the text variable using this approach. Is using a class the only way to solve this? And if yes, can you give me some pointers on how to achieve this?

Read the article
How to use regular expression in lxml xpath?

- by Arty

I'm using construction like this: doc = parse(url).getroot() links = doc.xpath("//a[text()='some text']") But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation

Read the article
xml to Python data structure using lxml

- by Samuel Taylor

How can I convert xml to Python data structure using lxml? I have searched high and low but can't find anything.

Read the article
Write xml file with lxml

- by systempuntoout

Having a code like this: from lxml import etree root = etree.Element("root") root.set("interesting", "somewhat") child1 = etree.SubElement(root, "test") How do i write root Element object to an xml file using write() method of ElementTree class?

Read the article
LXML E builder for java?

- by directedition

There is one thing I really love about LXML, and that the E builder. I love that I can throw XML together like this: message = E.Person( E.Name( E.First("jack") E.Last("Ripper") ) E.PhoneNumber("555-555-5555") ) To make: <Person> <Name> <First>Jack</First> <Last>Ripper</Last> </Name> <PhoneNumber>555-555-5555</PhoneNumber> </Person> As opposed to the painstaking way DOM works. I am going to be moving a bunch of my software to Java soon and it is very very heavy on its usage of E. Does Java have anything near equivalent to that usage?

Read the article
Find elements based on xsd type with lxml

- by joet3ch

I am trying to get a list of elements with a specific xsd type with lxml 2.x and I can't figure out how to traverse the xsd for specific types. Example of schema: <xsd:element name="ServerOwner" type="srvrs:string90" minOccurs="0"> <xsd:element name="HostName" type="srvrs:string35" minOccurs="0"> Example xml data: <srvrs:ServerOwner>John Doe</srvrs:ServerOwner> <srvrs:HostName>box01.example.com</srvrs:HostName> The ideal function would look like: elements = getElems(xml_doc, 'string90') def getElems(xml_doc, xsd_type): ** xpath or something to find the elements and build a dict return elements

Read the article
python lxml problem

- by David ???

I'm trying to print/save a certain element's HTML from a web-page. I've retrieved the requested element's XPath from firebug. All I wish is to save this element to a file. I don't seem to succeed in doing so. (tried the XPath with and without a /text() at the end) I would appreciate any help, or past experience. 10x, David import urllib2,StringIO from lxml import etree url='http://www.tutiempo.net/en/Climate/Londres_Heathrow_Airport/12-2009/37720.htm' seite = urllib2.urlopen(url) html = seite.read() seite.close() parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(html), parser) xpath = "/html/body/table/tbody/tr/td[2]/div/table/tbody/tr[6]/td/table/tbody/tr/td[3]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/text()" elem = tree.xpath(xpath) print elem[0].strip().encode("utf-8")

Read the article
Close a tag with no text in lxml

- by PulpFiction

I am trying to output a XML file using Python and lxml However, I notice one thing that if a tag has no text, it does not close itself. An example of this would be: root = etree.Element('document') rootTree = etree.ElementTree(root) firstChild = etree.SubElement(root, 'test') The output of this is: <document> <test/> </document I want the output to be: <document> <test> </test> </document> So basically I want to close a tag which has no text, but is used to the attribute value. How do I do that? And also, what is such a tag called? I would have Googled it, but I don't know how to search for it.

Read the article
Python lxml - returns null list

- by Chris Finlayson

I cannot figure out what is wrong with the XPATH when trying to extract a value from a webpage table. The method seems correct as I can extract the page title and other attributes, but I cannot extract the third value, it always returns an empty list? from lxml import html import requests test_url = 'SC312226' page = ('https://www.opencompany.co.uk/company/'+test_url) print 'Now searching URL: '+page data = requests.get(page) tree = html.fromstring(data.text) print tree.xpath('//title/text()') # Get page title print tree.xpath('//a/@href') # Get href attribute of all links print tree.xpath('//*[@id="financial"]/table/tbody/tr/td[1]/table/tbody/tr[2]/td[1]/div[2]/text()') Unless i'm missing something, it would appear the XPATH is correct: Chrome screenshot I checked Chrome console, appears ok! So i'm at a loss $x ('//*[@id="financial"]/table/tbody/tr/td[1]/table/tbody/tr[2]/td[1]/div[2]/text()') [ "£432,272" ]

Read the article
Finding inline style with lxml.cssselector

- by ropa

New to this library (no more familiar with BeautifulSoup either, sadly), trying to do something very simple (search by inline style): <td style="padding: 20px">blah blah </td> I just want to select all tds where style="padding: 20px", but I can't seem to figure it out. All the examples show how to select td, such as: for col in page.cssselect('td'): but that doesn't help me much.

Read the article
Lxml or Xpath content print

- by Angelo

function= def parseTitle(self, post): """ Returns title string with spaces replaced by dots "" return post.xpath('h2')[0].text.replace('.', ' ') i would to see the content of post. Tried all . How can i properly debug the content. ir is an webstite of movies where i rip links and tittle. So this one should parse the title.and i am sure H@ is not existing , how to print/debug it?

Read the article

1 2 3 | Next Page >