9iDatabase Daily Feature - The CTXRULE Index Type

Technology Centers

Products
Oracle Database
Oracle Application Server
Oracle Developer Suite
Oracle JDeveloper
Oracle Collaboration Suite

Services
Order Software on CD
Security Alerts
Hosted Development
Get Support
Submit Articles
Events and User Group Meetings
Oracle University
Oracle Magazine
TechBlast
Buy Books

The CTXRULE Index Type - A New Oracle Text Indextype for Building Document Classification Applications

What is a Document Classification Application?

A document classification application is one that classifies an incoming stream of documents based on their content. These applications are also known as document routing or filtering applications. For example, an online news agency might need to classify its incoming stream of articles as they arrive into different categories like politics, business, sports, health, and travel.

What is the CTXRULE Index Type?

CTXRULE is a new index type designed for classification, which is the inverse of information retrieval. In traditional information retrieval applications, you index a set of documents and find documents with a text query. In classification, you index a set of queries, and find queries with a document.

Creating a CTXRULE index

The first step is to create a table of queries that define our classifications:


       create table queries (

         query_id      number primary key,

         category      varchar2(40),


         query         varchar2(80)

       );

We know populate the table with the classifications and the queries that define each:


       insert into queries values (1,'Sports','tennis or football or soccer');


       insert into queries values (2,'Health','cardiovascular medicine');

Now we create the CTXRULE index. We can also specify lexer, storage, section group, and wordlist parameters if needed:


       create index query_idx on queries(query) indextype is ctxsys.ctxrule;

Classifying a Document

We'll be using the MATCHES operator instead of the CONTAINS operator used for a CONTEXT index. Assuming that the incoming documents are stored in a table like:


      create table news (

         newsid         number,

	 author         varchar2(40),

	 source         varchar2(30),


	 category	varchar2(50),

	 news_article   clob

      );

We can create a trigger with MATCHES to classify each document:


      ...


      begin

       for c in (select category from queries

                 where matches(query, :new.news_article) > 0 )

		 loop

	   :new.category = c.category;

	 end loop;

      end;


      ...

An article could have multiple categories. In our example, this would require another table to record category-article id.

Instead of indexing the article text and periodically running stored queries, we index the query strings, and use incoming document text to query the queries. The scalability problem gets solved because we don't have to run every query all the time.

CTXRULE indexes require sync and optimize, just like a CONTEXT index. Simply use the ctx_ddl.sync_index and ctx_ddl.optimize_index calls, passing a CTXRULE index name instead of a CONTEXT index name.

More Info

	Visit the Oracle9i Text page for more information
	Visit the OTN Discussion Forum that the Oracle Text team monitors
	Oracle9i Text Reference
	Oracle9i Text Application Developers' Guide

Oracle9i Database Daily Features