Using a "white list" for extracting terms for Text Mining

Posted by [email protected] on Oracle Blogs See other posts from Oracle Blogs or by [email protected]
Published on Tue, 23 Mar 2010 17:00:00 +0000 Indexed on 2010/03/23 21:23 UTC
Read the original article Hit count: 696

In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using Oracle Data Miner, which automates preparing data for text mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms.  However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms.

INTRODUCTION
A stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be excluded if they do not help to differentiate documents, e.g., the word Oracle is ubiquitous in the Oracle product literature.

One problem with stoplists is determining which words to specify. This usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the documents to determine if you missed any. Since a corpus of documents could contain thousands of words, this could be a tedious exercise. Moreover, since every word is considered as an individual token, a term excluded in one context may be needed to help identify a term in another context. For example, in our Oracle product literature example, the words "Oracle Data Mining" taken individually are not particular helpful. The term "Oracle" may be found in nearly all documents, as with the term "Data." The term "Mining" is more unique, but could also refer to the Mining industry. If we exclude "Oracle" and "Data" by specifying them in the stoplist, we lose valuable information. But it we include them, they may introduce too much noise.

Still, when you have a broad vocabulary or don't have a list of specific terms of interest, you rely on the text engine to identify important terms, often by computing the term frequency - inverse document frequency metric. (This is effectively a weight associated with each term indicating its relative importance in a document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable.

As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list, or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to this approach is the need to specify the set of terms of interest. However, this may not be as daunting a task as it seems. For example, in a given domain (Oracle product literature), there is often a recognized glossary, or a list of keywords and phrases (Oracle product names, industry names, product categories, etc.). Being able to identify multi-word terms, e.g., "Oracle Data Mining" or "Customer Relationship Management" as a single token can greatly increase the quality of the data mining results.

The remainder of this post and subsequent posts will focus on how to produce a dataset that contains white list terms, suitable for mining.

CREATING A WHITE LIST
We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules that are in effect our mapping from single and multi-word terms to the tokens used to represent those terms. For example, "Oracle Data Mining" becomes "ORACLEDATAMINING."

First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column.

TERM                                TOKEN
DATA MINING                         DATAMINING
ORACLE DATA MINING                  ORACLEDATAMINING
11G                                 ORACLE11G
JAVA                                JAVA
CRM                                 CRM
CUSTOMER RELATIONSHIP MANAGEMENT    CRM
...

Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules:

CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE);

CREATE TABLE my_thesaurus_rules (main_term     VARCHAR2(100),
                                 query_string 
VARCHAR2(400));

We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over  the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus.

DECLARE
  cursor c2 is
    select token, term
    from my_term_token_map;
BEGIN
  for r_c2 in c2 loop
    CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term);
    EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values
                       (:1,''SYN(' || r_c2.token || ', my_thesaurus)'')'
    using r_c2.token;
  end loop;
END;


We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example:

     'ORACLEDATAMINING'        SYN ('ORACLEDATAMINING', my_thesaurus)

At this point, we create a CTXRULE index on the my_thesaurus_rules table:

create index my_thesaurus_rules_idx on
       my_thesaurus_rules(query_string)
       indextype is ctxsys.ctxrule;


In my next post, this index will be used to extract the tokens that match each of the rules specified. We'll then compute the tf-idf weights for each of the terms and create a nested table suitable for mining.

© Oracle Blogs or respective owner

Related posts about Data Mining

Related posts about Oracle Data Mining