In my last post, we set the groundwork for extracting specific tokens from a white list using a CTXRULE index. In this post, we will populate a table with the extracted tokens and produce a case table suitable for clustering with Oracle Data Mining.
Our corpus of documents will be stored in a database table that is defined as
create table documents(id NUMBER, text VARCHAR2(4000));
However, any suitable Oracle Text-accepted data type can be used for the text.
We then create a table to contain the extracted tokens. The id column contains the unique identifier (or case id)
of the document. The token column contains the extracted token. Note
that a given document many have many tokens, so there will be one row
per token for a given document.
create table extracted_tokens (id NUMBER, token VARCHAR2(4000));
The next step is to iterate over the documents and extract the matching
tokens using the index and insert them into our token table. We use the
MATCHES function for matching the query_string from my_thesaurus_rules
with the text.
DECLARE
cursor c2 is
select id, text
from documents;
BEGIN
for r_c2 in c2 loop
insert into extracted_tokens
select r_c2.id id, main_term token
from my_thesaurus_rules
where matches(query_string,
r_c2.text)>0;
end loop;
END;
Now that we have the tokens, we can compute the term frequency - inverse document frequency (TF-IDF) for each token of each document.
create table extracted_tokens_tfidf as
with num_docs as (select count(distinct id) doc_cnt
from extracted_tokens),
tf as (select a.id, a.token,
a.token_cnt/b.num_tokens token_freq
from
(select id, token, count(*) token_cnt
from extracted_tokens
group by id, token) a,
(select id, count(*) num_tokens
from extracted_tokens
group by id) b
where a.id=b.id),
doc_freq as (select token, count(*) overall_token_cnt
from extracted_tokens
group by token)
select tf.id, tf.token,
token_freq *
ln(doc_cnt/df.overall_token_cnt) tf_idf
from num_docs,
tf,
doc_freq df
where df.token=tf.token;
From the WITH clause, the num_docs query simply counts the number of
documents in the corpus. The tf query computes the term (token)
frequency by computing the number of times each token appears in a
document and divides that by the number of tokens found in the
document. The doc_req query counts the number of times the token
appears overall in the corpus. In the SELECT clause, we compute the tf_idf.
Next, we create the nested table required to produce one record per
case, where a case corresponds to an individual document. Here, we
COLLECT all the tokens for a given document into the nested column extracted_tokens_tfidf_1.
CREATE TABLE extracted_tokens_tfidf_nt
NESTED TABLE extracted_tokens_tfidf_1
STORE AS extracted_tokens_tfidf_tab AS
select id,
cast(collect(DM_NESTED_NUMERICAL(token,tf_idf)) as DM_NESTED_NUMERICALS) extracted_tokens_tfidf_1
from extracted_tokens_tfidf
group by id;
To build the clustering model, we create a settings table and then
insert the various settings. Most notable are the number of clusters
(20), using cosine distance which is better for text, turning off auto
data preparation since the values are ready for mining, the number of
iterations (20) to get a better model, and the split criterion of size
for clusters that are roughly balanced in number of cases assigned.
CREATE TABLE km_settings (setting_name VARCHAR2(30), setting_value VARCHAR2(30));
BEGIN INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.clus_num_clusters, 20); INSERT INTO km_settings (setting_name, setting_value)
VALUES (dbms_data_mining.kmns_distance, dbms_data_mining.kmns_cosine);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_off);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.kmns_iterations,20);
INSERT INTO km_settings (setting_name, setting_value) VALUES
VALUES (dbms_data_mining.kmns_split_criterion,dbms_data_mining.kmns_size);
COMMIT;
END;
With this in place, we can now build the clustering model.
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'TEXT_CLUSTERING_MODEL',
mining_function => dbms_data_mining.clustering,
data_table_name => 'extracted_tokens_tfidf_nt',
case_id_column_name => 'id',
settings_table_name => 'km_settings');
END;To generate cluster names from this model, check out my earlier post
on that topic.