Tokenizing Twitter Posts in Lucene

Posted by Amaç Herdagdelen on Stack Overflow See other posts from Stack Overflow or by Amaç Herdagdelen
Published on 2010-03-31T17:26:09Z Indexed on 2010/04/01 6:23 UTC
Read the original article Hit count: 551

Filed under:

lucene

|

twitter

|

tokenizing

|

tokenize

Hello,

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

© Stack Overflow or respective owner

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More

Related posts about twitter

Generating a twitter OAuth access key - the semi-manual way

as seen on Hadermann.be - Search for 'Hadermann.be'
[UPDATE] Apparently someone at Twitter was listening, or I’m going senile/blind. Let’s call it a combination of both. Instead of following all the steps below, you could just login with the Twitter account you want to use on http://dev.twitter.com, register your application and then click… >>> More
shell_exec escaping quotes in php for Twitter API --> Getting CURL to work with obscure twitter api

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm using shell_exec() to execute a Twitter API Call. shell_exec('curl -u user:password -d "id=3191321" http://api.twitter.com/1/twitterapi/twitterlist/members.xml'); That works fine when I authenticate correctly and put in a number for the id. But when I try to put in a variable ($id), it screws… >>> More
How to get a Twitter user's handle from a phone number using Twitter API

as seen on Stack Overflow - Search for 'Stack Overflow'
If I have a phone number, and the owner of the number Has a Twitter account Associated his account with the phone number Can I use the Twitter API to find the account based on the phone number? UPDATE: What I basically need is a reverse lookup function from a phone number to a Twitter account… >>> More
La rubrique Qt sur Twitter, suivez l'actualité Qt depuis Twitter et partagez-la

as seen on Developper.com - Search for 'Developper.com'
Comme vous l'avez probablement remarqué, les réseaux sociaux explosent de partout, Developpez.com et toutes ses rubriques se doivent donc, comme toujours auparavant, de suivre l'évolution en s'ouvrant à ces réseaux sociaux. Vous pouvez donc désormais suivre l'actualité de la rubrique sur Facebook… >>> More
Twitter User/Search Feature Header Support in LINQ to Twitter

as seen on Geeks with Blogs - Search for 'Geeks with Blogs'
LINQ to Twitter’s goal is to support the entire Twitter API. So, if you see a new feature pop-up, it will be in-queue for inclusion. The same holds for the new X-Feature… response headers for User/Search requests. However, you don’t have to wait for a special property on the TwitterContext to access… >>> More