Read huge free text docs in one file for lucene indexing

Posted by Jun on Stack Overflow See other posts from Stack Overflow or by Jun
Published on 2012-05-31T04:36:51Z Indexed on 2012/05/31 4:40 UTC
Read the original article Hit count: 213

Filed under:

text

|

lucene

|

indexing

|

huge

I have heaps of free text news docs in one big file. The structure of each news doc is like:

(Header line) Category, Doc1, Date (day, month, year)

(body text)

...

...

...

(Header line) Category, Doc2, Date (day, month, year)

(body text)

...

...

...

If I extract each doc from the big file, it costs too much time and not efficient. Therefore, I decide to read the file line by line and feed information to lucene the same time. I write c# code to index each doc to lucene like:

Streamreader sr = new Streamreader(file);
string line = "";
while((line = sr.ReadLine()) != null)
{
   How can I tell this line is a doc header line from text line
   and get the metadata and all the text lines of a doc for lucene to index.

   Also, the text is read by OCR which can not give correct line-separating.
   Captions are mixed with content text

   iterate the process till the end of the file
}

with thanks

© Stack Overflow or respective owner

Related posts about text

Error in running script [closed]

as seen on Programmers - Search for 'Programmers'
I'm trying to run heathusf_v1.1.0.tar.gz found here I installed tcsh to make build_heathusf work. But, when I run ./build_heathusf, I get the following (I'm running that on a Fedora Linux system from Terminal): $ ./build_heathusf Compiling programs to build a library of image processing functions… >>> More
Coloring even heighten columns

as seen on Stack Overflow - Search for 'Stack Overflow'
I try to set different a background colors for left and right columns and to maintain the same height. So I set a background color for outer wrapper ("container" div) so it will set a color to rightBar. But this didn't work. Online Demo I want it to work on all browsers. Markup: <!DOCTYPE… >>> More
HTML: How to create a DIV with only vertical scroll-bar to show long paragraphs on a webpage?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to show terms and condition note on my website. I dont want to use text field and also dont want to use my whole page. I just want to display my text in selected area and want to use only vertical scroll-bar to go down and read all text. Currently I am using this code: <div style="width:10;height:10;overflow:scroll"… >>> More
Qt Linking Error.

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I configure qt-x11 with following options ./configure -prefix /iTalk/qtx11 -prefix-install -bindir /iTalk/qtx11-install/bin -libdir /iTalk/qtx11-install/lib -docdir /iTalk/qtx11-install/doc -headerdir /iTalk/qtx11-install/include -datadir /iTalk/qtx11-install/data -examplesdir /iTalk/qtx11-install/examples… >>> More
XSLT Escape Character not working

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to use escape charaters in my text output, as i would like too surround the output in emailData tags. I am using <xsl:text><emailData></xsl:text> In the XSLT to esnure that this works however because i am using a tool called Cast Iron for some reason it… >>> More

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More