Parsing Wiki XML Dumps ver0.4 just got tough

Posted by syed on Stack Overflow See other posts from Stack Overflow or by syed
Published on 2010-06-05T16:51:36Z Indexed on 2010/06/05 19:12 UTC
Read the original article Hit count: 496

Filed under:
|
|
|
|

Hello, I am trying to parse Wikipedia XML Dump using "Parse-MediaWikiDump-1.0.4" along with "Wikiprep.pl" script. I guess this script works fine with ver0.3 Wiki XML Dumps but not with the latest ver0.4 Dumps. I get the following error.

Can't locate object method "page" via package "Parse::MediaWikiDump::Pages" at wikiprep.pl line 390.

Also, under the "Parse-MediaWikiDump-1.0.4" documentation @ http://search.cpan.org/~triddle/Parse-MediaWikiDump-1.0.4/lib/Parse/MediaWikiDump/Pages.pm, I read "LIMITATIONS Version 0.4 This class was updated to support version 0.4 dump files from a MediaWiki instance but it does not currently support any of the new information available in those files."

Any work arounds would help me get to the next level.

Note: one may wonder why cannot we directly use SAX or STAX parser instead, wikipedia dump is a 25GB plus single file, stack/memory issues are obvious. Hence, the above perl script resolves this issue but currently I am stuck with this version problem.

© Stack Overflow or respective owner

Related posts about Xml

Related posts about perl