XML: Process large data

Posted by Atmocreations on Stack Overflow See other posts from Stack Overflow or by Atmocreations
Published on 2010-02-20T10:57:45Z Indexed on 2010/03/19 15:51 UTC
Read the original article Hit count: 561

Filed under:

large-files

Hello

What XML-parser do you recommend for the following purpose:

The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

Removing attributes I don't need could save around 30%, maybe a bit more.

First part for optimizing the second part: Is there any good tool (command line linux and windows if possible) to easily remove unused attributes in certain tags? I know that XSLT could be used. Or are there any easy alternatives? Also, I could split it into three files, one for each tag to gain speed for later parsing... Speed is not too important for this preparation of the data, of course it would be nice when it took rather minutes than hours.

Second part: Once I have the data prepared, be it shortened or not, I should be able to search for the ID-attribute I was mentioning, this being time-critical.

Estimations using wc -l tell me that there are around 3M N-tags and around 418K W-tags. The latter ones can contain up to approximately 20 subtags each. W-Tags also contain some, but they would be stripped away.

"All I have to do" is navigating between tags containing certain id-attributes. Some tags have references to other id's, therefore giving me a tree, maybe even a graph. The original data is big (as mentioned), but the resultset shouldn't be too big as I only have to pick out certain elements.

Now the question: What XML parsing library should I use for this kind of processing? I would use Java 6 in a first instance, with having in mind to be porting it to BlackBerry.

Might it be useful to just create a flat file indexing the id's and pointing to an offset in the file? Is it even necessary to do the optimizations mentioned in the upper part? Or are there parser known to be quite as fast with the original data?

Little note: To test, I took the id being on the very last line on the file and searching for the id using grep. This took around a minute on a Core 2 Duo.

What happens if the file grows even bigger, let's say 5 GB?

I appreciate any notice or recommendation. Thank you all very much in advance and regards

Developer IT

XML: Process large data - Developer IT

XML: Process large data

java

Xml

blackberry

xslt

large-files

Related posts about java

Tomcat 6: Access Control Exception?

Problem in creation MDB Queue connection at Jboss StartUp

failing to establish connection between Postgres db and gwt

failing to establish connection between postgre db and gwt

Migration and deployement problems JBoss 4.2.2.GA to JBoss 6.0.0.M2

Related posts about Xml

Store XML,update record in XML,retrive a specific record in XML stored on BB device

gwt+xml- can i read through incomplete XML using the GWT XML Parser

perl xml parser get xml content within xml

Reading php generated XML in flash?

Announcing RSS feeds of Microsoft All-In-One Code Framework code samples

Categories cloud