In Part 1 of my post on "Generating cluster names from a document clustering model" (part 1, part 2, part 3), I showed how to build a clustering model from text documents using
Oracle Data Miner, which automates preparing data for text
mining. In this process we specified a custom stoplist and lexer and relied on Oracle Text to identify important terms.
However, there is an alternative approach, the white list, which uses a thesaurus object with the Oracle Text CTXRULE index to allow you to specify the important terms.
INTRODUCTIONA stoplist is used to exclude, i.e., black list, specific words in your documents from being indexed. For example, words like a, if, and, or, and but normally add no value when text mining. Other words can also be
excluded if they do not help to differentiate documents, e.g., the word
Oracle is ubiquitous in the Oracle product literature.
One problem with stoplists is determining which words to specify. This
usually requires inspecting the terms that are extracted, manually identifying which ones you don't want, and then re-indexing the
documents to determine if you missed any. Since a corpus of documents could contain thousands of
words, this could be a tedious exercise. Moreover,
since every word is considered as an individual token, a term excluded
in one context may be needed to help identify a term in another
context. For example, in our Oracle product literature example, the
words "Oracle Data Mining" taken individually are not particular
helpful. The term "Oracle" may be found in nearly all documents, as
with the term "Data." The term "Mining" is more unique, but could also
refer to the Mining industry. If we exclude "Oracle" and "Data" by
specifying them in the stoplist, we lose valuable information. But it
we include them, they may introduce too much noise. Still, when you have a broad vocabulary or don't have a list of specific terms
of interest, you rely on the
text engine to identify important terms, often by computing the term
frequency - inverse document frequency metric. (This is
effectively a weight associated with each term indicating its relative
importance in a
document within a collection of documents. We'll revisit this later.) The results using this technique is often quite valuable.
As noted above, an alternative to the subtractive nature of the stoplist is to specify a white list,
or a list of terms--perhaps multi-word--that we want to extract and use for data mining. The obvious downside to
this approach is the need to specify the set of terms of interest.
However, this may not be as daunting a task as it seems. For example,
in a given domain (Oracle product literature), there is often a
recognized glossary, or a list of keywords and phrases (Oracle product
names, industry names, product categories, etc.). Being able to
identify multi-word terms, e.g., "Oracle Data Mining" or "Customer
Relationship Management" as a single token can greatly increase the
quality of the data mining results.
The remainder of this post and subsequent posts will focus on how to
produce a dataset that contains white list terms, suitable for mining.
CREATING A WHITE LIST
We'll leverage the thesaurus capability of Oracle Text. Using a thesaurus, we create a set of rules
that are in effect our mapping from single and multi-word terms to the
tokens used to represent those terms. For example, "Oracle Data Mining"
becomes "ORACLEDATAMINING."
First, we'll create and populate a mapping table called my_term_token_map. All text has been converted to upper case and values in the TERM column are intended to be mapped to the token in the TOKEN column.
TERM TOKEN
DATA MINING DATAMINING
ORACLE DATA MINING ORACLEDATAMINING
11G ORACLE11G
JAVA JAVA
CRM CRM
CUSTOMER RELATIONSHIP MANAGEMENT CRM
...
Next, we'll create a thesaurus object my_thesaurus and a rules table my_thesaurus_rules:
CTX_THES.CREATE_THESAURUS('my_thesaurus', FALSE);
CREATE TABLE my_thesaurus_rules (main_term VARCHAR2(100),
query_string VARCHAR2(400));
We next populate the thesaurus object and rules table using the term token map. A cursor is defined over my_term_token_map. As we iterate over the rows, we insert a synonym relationship 'SYN' into the thesaurus. We also insert into the table my_thesaurus_rules the main term, and the corresponding query string, which specifies synonyms for the token in the thesaurus.
DECLARE
cursor c2 is
select token, term
from my_term_token_map;
BEGIN
for r_c2 in c2 loop
CTX_THES.CREATE_RELATION('my_thesaurus',r_c2.token,'SYN',r_c2.term);
EXECUTE IMMEDIATE 'insert into my_thesaurus_rules values
(:1,''SYN(' || r_c2.token || ', my_thesaurus)'')'
using r_c2.token;
end loop;
END;
We are effectively inserting the token to return and the corresponding query that will look up synonyms in our thesaurus into the my_thesaurus_rules table, for example: 'ORACLEDATAMINING' SYN ('ORACLEDATAMINING', my_thesaurus)At this point, we create a CTXRULE index on the my_thesaurus_rules table:
create index my_thesaurus_rules_idx on
my_thesaurus_rules(query_string)
indextype is ctxsys.ctxrule;
In my next post, this index will be used to extract the tokens that
match each of the rules specified. We'll then compute the tf-idf
weights for each of the terms and create a nested table suitable for
mining.