Configuring Indexes


Indexes are the primary means of locating records in the system, and hence need to be well thought out and specified in advance. They consist of one or more paths to tags in the record, and how to process the data once it has been located.


Example index configurations:

01  <subConfig id = "xtitle-idx">
02    <objectType>index.SimpleIndex</objectType>
03    <paths>
04      <object type="indexStore" ref="indexStore"/>
05    </paths>
06    <source>
07      <xpath>/ead/eadheader/filedesc/titlestmt/titleproper</xpath>
08      <process>
09        <object type="extractor" ref="SimpleExtractor"/>
10        <object type="normalizer" ref="SpaceNormalizer"/>
11        <object type="normalizer" ref="CaseNormalizer"/>
12      </process>
13    </source>
14    <options>
15      <setting type="sortStore">true</setting>
16    </options>
17  </subConfig>
19  <subConfig id = "stemtitleword-idx">
20    <objectType>index.ProximityIndex</objectType>
21    <paths>
22      <object type="indexStore" ref="indexStore"/>
23    </paths>
24    <source>
25      <xpath>titleproper</xpath>
26      <process>
27        <object type="extractor" ref="SimpleExtractor" />
28        <object type="tokenizer" ref="RegexpFindOffsetTokenizer"/>
29        <object type="tokenMerger" ref="OffsetProxTokenMerger"/>
30        <object type="normalizer" ref="CaseNormalizer"/>
31        <object type="normalizer" ref="PossessiveNormalizer"/>
32        <object type="normalizer" ref="EnglishStemNormalizer"/>
33      </process>
34    </source>
35  </subConfig>

Lines 1 and 2, 19 and 20 should be second nature by now. Line 4 and the same in line 22 are a reference to the indexStore in which the index will be maintained.

This brings us to the source section starting in line 6. It must contain one or more xpath elements. These XPaths will be evaluated against the record to find a node, nodeSet or attribute value. This is the base data that will be indexed after some processing. In the first case, we give the full path, but in the second only the final element. Cheshire3, it is generally most efficient to give as small a path as possible to identify exactly which elements you want to index, so the path at line 25 is cheaper than the path at line 7.

If the records contain XML Namespaces, then there are two approaches available. If the element names are unique between all the namespaces in the document, you can simply omit them. For example /srw:record/dc:title could be written as just /record/title. The alternative is to define the meanings of 'srw' and 'dc' on the xpath element in the normal xmlns fashion.

After the XPath(s), we need to tell the system how to process the data that gets pulled out. This happens in the process section, and is a list of objects to sequentially feed the data through. The first object must be an extractor. This may be followed by a Tokenizer and a TokenMerger. These are used to split the extracted data into tokens of a praticular type, and then merge it into discreet index entries. If a Tokenizer is used, a TokenMerger must also be used. Generally any further processing objects in the chain are normalizers.

The first index uses the SimpleExtractor to pull out the text as it appears exactly as a single term. This is followed by a SpaceNormalizer on line 10, to remove leading and trailing whitespace and normalize multiple adjacent whitespace characters (e.g. newlines folloed by tabs, spaces etc.) into single whitespaces The second index also uses the SimpleExtractor, however it then uses a RegexpFindOffsetTokenizer to identify word tokens, their positions and character offsets. It then uses the necessary OffsetProxTokenMerger to merge identical tokens into discreet index entries, maintaining the word positions and character offsets identified by the Tokenizer. Both indexes then send the extracted terms to a CaseNormalizer, which will reduce all characters to lowercase. The second index then gives the lowercase terms to a PossessiveNormalizer to strip off 's and s' from the end, and then to EnglishStemNormalizer to apply linguistic stemming.

After these processes have happened, the system will store the transformed terms in the indexStore referenced in the paths section.

Finally, in the first example, we have a setting called 'sortStore'. If this is given, then the system will create a map of record to term for the index to allow it to be quickly retrieved for the purposes of sorting.