Tuesday, April 26, 2011

Beware of Nulls

I ran across what I think was an interesting issue recently.
We've been trying to implement a text search component using Oracle 11.2.0.2, featuring an 11g enhancement called SDATA (structured data) sections. It helps solve one of Oracle Text's biggest issues -- that of "mixed" queries (a query that contains a basic text search with structured data qualifiers -like date limits- that are supposed to make the query faster).
In past versions, this type of query would never perform as fast as either a straight text query or a structured-only query; the optimizer would pick one way or the other to generate the list of matching rowids for one part of the query, then run the other part to generate the "final" list of matching rows. So the query takes longer, when the user intuitively feels that the additional information provided should help make the query faster.
In this case, our testing was giving us some unexpected results when we were filtering SDATA negation-- chiefly, the resultset count of matches was lower than it should have been. So, for example, a query for something like 'Ben Franklin and SDATA(language != "FRENCH")' did not return the right number of rows based on the tester's description of the test data.
Investigation revealed that the core issue was that the language field was null in some documents (even though the XML document described the element as mandatory).
Here's an example:
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Oracle Label Security and OLAP options

SQL> create table test_tab
  2  (text        varchar2 (30),
  3  language  varchar2 (10));

Table created.

SQL> insert all
  2    into test_tab values ('Bon Homme Richard', 'ENGLISH')
  3    into test_tab values ('Bon Homme Richard', 'FRENCH')
  4    into test_tab values ('Bon Homme Richard', 'CANADIAN')
  5    into test_tab values ('Bon Homme Richard', '')
  6    select * from dual;

4 rows created.

SQL> insert into test_tab select object_name, null from all_objects;

54562 rows created.

SQL> create index test_idx on test_tab (text)
  2    indextype is ctxsys.context filter by language;

Index created.

SQL> select score (1), text, language from test_tab
  2    where contains(text,'(Bon Homme Richard)', 1) > 0;

  SCORE(1) TEXT                           LANGUAGE
---------- ------------------------------ ----------
        15 Bon Homme Richard              ENGLISH
        15 Bon Homme Richard              FRENCH
        15 Bon Homme Richard              CANADIAN
        15 Bon Homme Richard


SQL> select score (1), text, language from test_tab
  2    where  contains(text,'(Bon Homme Richard)
  3    AND sdata (language != "FRENCH")', 1) > 0;

  SCORE(1) TEXT                           LANGUAGE
---------- ------------------------------ ----------
        15 Bon Homme Richard              ENGLISH
        15 Bon Homme Richard              CANADIAN
In this case, the fix was to correct the source data to set the language field properly. In a case where null values are permitted, the searcher [or developer] needs to be aware of these comparison rules:
select score (1), text, language from test_tab
where  contains(text,'(Bon Homme Richard)
AND (sdata (language != "FRENCH") OR sdata(language is null))', 1) > 0;

  SCORE(1) TEXT                           LANGUAGE
---------- ------------------------------ ----------
        15 Bon Homme Richard              ENGLISH
        15 Bon Homme Richard              CANADIAN
        15 Bon Homme Richard              

No comments: