Command line tool to query HTML elements (linux)

Posted by ipsec on Super User See other posts from Super User or by ipsec
Published on 2012-11-18T12:01:45Z Indexed on 2012/11/18 17:06 UTC
Read the original article Hit count: 672

Filed under:

html

|

linux

|

command-line

I am looking for a (linux) command line tool to parse HTML files and extract some elements, ideally with some XPath-like syntax.

I have the following requirements:

It must be able to parse arbitrary HTML files (which may contain errors) in a robust manner
It must be able to extract text of elements and attributes

What I have tried so far:

xmlstarlet: would be perfect, but mostly reports errors in files (e.g. entity not defined), even xml fo or htmltidy does not help.

xmllint: the best I have found so far, but is not able to extract attribute texts. Something like //a/@href reports <a href="foo">, what I need is just foo. string(//a/@href) works, but queries only the first entry. data is not supported.

hxextract: works, but cannot extract attributes.

XQilla: would support XPath 2.0 and thus data. It also support xqilla:parse-html, but I have had no luck making this work.

Can you recommend me another tool?

© Super User or respective owner

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More

Related posts about linux

apt-get install and update fail

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I've got a problem with apt-get update and apt-get install ... commands . every time update or installing fails and errors are : Get:1 http://dl.google.com stable Release.gpg [198B] Ign http://dl.google.com/linux/chrome/deb/ stable/main Translation-en_US Get:2 http://dl… >>> More
kernel module compiling error

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
sh@ubuntu:/home/ccpp/helloworld$ make gcc-4.6 -O2 -DMODULE -D_KERNEL_ -W -Wall -Wstrict-prototypes -Wmissing-prototypes -isystem /lib/modules/`uname -r`/build/include -c -o hello-1.o hello-1.c hello-1.c:4:0: warning: "MODULE" redefined [enabled by default] <command-line>:0:0: note: this is… >>> More
Build-Essentials installation failing

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I am having trouble accessing the several critical header files that show to be a part of the build process. The "Ubuntu Software Center" shows "Build Essentials" as installed: Next I did the following two commands, which did not improve the problem: ~$ sudo apt-get install build-essential [sudo]… >>> More
Updating Debian kernel

as seen on Super User - Search for 'Super User'
I'm trying to update my Debian machine to 2.6.32-46 (which is the new stable). However, after doing apt-get update my apt-cache search linux-image shows me: linux-headers-2.6.32-5-486 - Header files for Linux 2.6.32-5-486 linux-headers-2.6.32-5-686-bigmem - Header files for Linux 2.6.32-5-686-bigmem linux-headers-2… >>> More
Serial connection over a single USB cable (Windows to linux, or linux to linux)

as seen on Server Fault - Search for 'Server Fault'
I'm helping out with a project for an embedded device that only has USB and no serial. This device is running Linux. These days, when we need to connect to a serial port on a device we typically use a USB to serial adapter (on something like a phone system or a load balancing device, etc). I would… >>> More