Indexing Numeric Data with Isite

Beginning with Isite, release 2, Isearch is capable of indexing numeric data, dates and geospatial bounding boxes. Sample code showing what methods are required in the doctype is available in the FGDC doctype. This document will give a general idea of how to take advantage of the new data types.

Iindex identifies fields just as it always has, using the ParseFields() method in the doctype. Once the pointers to all of the fields have been located, the indexing process writes out a field coordinate table for each of the fields.

In the case of text fields, this table will contain the starting and ending offsets to the instance of the field being stored. In the case of numeric fields, the table will contain a pointer to the start of the field in the document, and the numeric value (or values - cf. dates and intervals).

In order to know which type of field table to write out, the indexer must have access to a table of field type. In the FGDC doctype, this field type file is loaded by the doctype and stored internally. The field table is simply a list of the non-text fields to be found in the documents, along with their types. For the FGDC doctype, the field table looks like this:


The allowed field types are "num", "date", "date-range", "time" and "gpoly". Note that "time" is not currently implemented. The type "gpoly" is currently implemented for a geospatial bounding rectangle, but the terminology hints that we may extend this to arbitrary polygons someday. The types "date" and "date-range" are similar, but with slight differences, described below.

The FGDC doctype receives the name of the field type file from one of the doctype command line options:

Iindex -d mydata -t fgdc -o fieldtype=fgdc.fields *.sgml

where the field type file is called "fgdc.fields".

Now, back to indexing. When the method WriteFieldData() in the INDEX class writes out the field tables, it first looks up the field type. If the field is numeric or date, it calls a parsing routine in the doctype to convert the text contents of the field to the appropriate numeric values. Numeric fields (including individual latitude and longitude fields) are converted by the doctype method ParseNumeric(). Its job is to take the text string in the field and return the appropriate numeric value as a double. Different doctypes will have different text representations of the values, so you will have to write a parser for whatever doctype you're implementing.

Date fields are converted by the doctype methods ParseDate() and ParseDateRange(). They take the contents of the field and calculate two numeric (double) values - the starting and ending values of a date range interval. The computed values can be of differing precision, but should be of the form YYYY (eg, 1986), YYYYMM (eg, January 1997 would be converted to 199701) or YYYYMMDD (eg, 15 April 1984 would be converted to 19840415). Fractional days should be converted to the obvious thing.

If you examine FGDC::ParseDate() and FGDC::ParseDateRange() you will note some differences. The buffer passed to the routines can be a numeric date, or can be a tagged field - FGDC uses <CALDATE> to tag a single date (for example, Publication date is a single date), and uses <BEGDATE> and <ENDDATE> to tag a date interval. Both date and date-range fields store intervals, but differ in how they treat single dates. If the field is defined as type "date", the beginning and ending points of the interval will be the same - the specified date. If the field is defined as "date-range", and if the dates are not of the full precision (for example, <CALDATE>1996</CALDATE>), then the beginning date will be extended to the starting date of the interval (i.e., 19960101), and the ending date will be extended to the ending date of the interval (i.e., 19961231), at the full precision.

This makes the code to execute the search simpler, but may result in unexpected results - hopefully you'll get too many rather than too few results.

There are a couple of special dates, as well. The current date (that is, the date on which the search is being run) can be encoded as the constant DATE_PRESENT. Errors can be returned as DATE_ERROR and unknown dates can be returned as DATE_UNKNOWN. No negative dates are allowed (for those of you who did your research in previous millenia).

If your date field is single valued (that is, not an interval), you can return the same value for the starting and ending dates. The search engine will treat it as a trivial interval.

To review, there are only three steps necessary to handle indexing the new data types.

  1. Create the field type file
  2. Add methods to load the field table, LoadFieldTable(), and parse the numeric data, ParseNumeric() and ParseDate(), to the doctype.
  3. Index the data with Iindex, using the doctype -o command line option to pass the name of the field type file to the indexer.


Searching Numeric Data with Isite

There is no command parser for searching numeric data or dates with the command line Isearch or CGI gateway Isearch-cgi. We'd welcome suggestions for a command syntax, but it has to be easy. Right now, queries on numeric data and dates have to be submitted using Z39.50, since that protocol supports the full range of parameters a user might need to specify.

Spatial queries can, however, be submitted using Isearch. See the command line help for the syntax of the -rect parameter. Currently, Isearch returns a record if there is an overlap between the region specified in the user's query and the bounding box in the data record.

Some explanation is required to understand the way the search engine matches dates.

zclient localhost 6668 test 199601[1,31,2,14,4,5]
zclient localhost 6668 test 199601[1,31,2,16,4,5]
zclient localhost 6668 test 199601[1,31,2,18,4,5]
zclient localhost 6668 test "19960101 19961004[1,31,2,16,4,115,5,100]"
zclient localhost 6668 test "90 -90 180 -180[1,3111]"