|
What is a Document
Classification
Application?
A
document classification application is one that
classifies an incoming stream of documents based
on their content. These applications are also
known as document routing or filtering
applications. For example, an online news agency
might need to classify its incoming stream of
articles as they arrive into different categories
like politics, business, sports, health, and
travel.
What is the CTXRULE Index
Type?
CTXRULE is a new
index type designed for classification, which is
the inverse of information retrieval. In
traditional information retrieval applications,
you index a set of documents and find documents
with a text query. In classification, you index a
set of queries, and find queries with a document.
Creating a CTXRULE
index
The first step
is to create a table of queries that define our
classifications:
create table queries (
query_id number primary key,
category varchar2(40),
query varchar2(80)
);
We know populate the
table with the classifications and the queries
that define each:
insert into queries values (1,'Sports','tennis or football or soccer');
insert into queries values (2,'Health','cardiovascular medicine');
Now we create the
CTXRULE index. We can also specify lexer, storage,
section group, and wordlist parameters if needed:
create index query_idx on queries(query) indextype is ctxsys.ctxrule;
Classifying a
Document
We'll be using the MATCHES
operator instead of the CONTAINS operator used for
a CONTEXT index. Assuming that the incoming
documents are stored in a table like:
create table news (
newsid number,
author varchar2(40),
source varchar2(30),
category varchar2(50),
news_article clob
);
We can create a trigger
with MATCHES to classify each document:
...
begin
for c in (select category from queries
where matches(query, :new.news_article) > 0 )
loop
:new.category = c.category;
end loop;
end;
...
An article could have
multiple categories. In our example, this would
require another table to record category-article
id.
Instead of indexing the article
text and periodically running stored queries, we
index the query strings, and use incoming document
text to query the queries. The scalability problem
gets solved because we don't have to run every
query all the time.
CTXRULE indexes require sync and
optimize, just like a CONTEXT index. Simply use
the ctx_ddl.sync_index and ctx_ddl.optimize_index
calls, passing a CTXRULE index name instead of a
CONTEXT index name.
More
Info
Oracle9i Database Daily
Features
|
![]() |