Cheshire3 Configuration: Index

Introduction

Indexes need to be configured to know where to find the data that they should extract, how to process it once it's extracted and where to store it once processed.

Example 1
<subConfig type="index" id="zrx-idx-9">
  <objectType>index.ProximityIndex</objectType>
  <paths>
    <object type="indexStore" ref="zrxIndexStore"/>
  </paths>
  <source>
    <preprocess>
       <object type="transformer" ref="zeerexTxr"/>
       <object type="parser" ref="SaxParser"/>
    </preprocess>
    <xpath>name/value</xpath>
    <xpath xmlns:zrx="http://explain.z3950.org/dtd/2.0">zrx:name/zrx:value</xpath>
    <process>
      <object type="extractor" ref="ExactParentProximityExtractor"/>
      <object type="normalizer" ref="CaseNormalizer"/>
    </process>
  </source>
  <options>
    <setting type="sortStore">true</setting>
    <setting type="lr_constant0">-3.7</setting>
  </options>
</subConfig>
            
Example 2
<subConfig type="XPathProcessor" id="indexXPath">
  <objectType>xpathProcessor.SimpleXPathProcessor</objectType>
  <source>
 	<xpath>/explain/indexInfo/index/title</xpath>
 	<xpath>/explain/indexInfo/index/description</xpath>
  </source>
</subConfig>

<subConfig type="index" id="zrx-idx-10">
  <objectType>index.ProximityIndex</objectType>
  <paths>
    <object type="indexStore" ref="zrxIndexStore"/>
  </paths> 
  <source mode="data">
    <xpath ref="indexXPath"/>
    <process>
      <object type="extractor" ref="ProximityExtractor"/>
      <object type="normalizer" ref="CaseNormalizer"/>
      <object type="normalizer" ref="PossessiveNormalizer"/>
    </process>
  </source>
  <source mode="any|all|=">
    <process>
      <object type="extractor" ref="PreserveMaskingProximityExtractor"/>
      <object type="normalizer" ref="CaseNormalizer"/>
      <object type="normalizer" ref="PossessiveNormalizer"/>
    </process>
  </source> 
</subConfig>
            
            
<source>

An index configuration must contain at least one source element. Each source block configures a way of treating the data that the index is asked to process.

It's worth mentioning here that the index object will be asked to process incoming search terms as well as data from records being indexed. A source element may have a 'mode' attribute to specify when the processing configured within this source block should be applied. To clarify, the 'mode' attribute may have the value of any of the relations defined by CQL (any, all, =, exact, etc.), indicating that the processing in this source should be applied when the index is searched using that particular relation.

The 'mode' attribute may also have the value 'data', indicating that the processing in the source block should be applied to the records at the time they are indexed. Multiple modes can be specified for a single source block by separating the with a vertical pipe (|) character within the value of the 'mode' attribute. If no 'mode' attribute is specified, the source will default to being a 'data' source. Example 2 demonstrates the use of the mode attribute to apply a different Extractor object when carrying out searches using the 'any', 'all' or '=' CQL relation, in this case to preserve masking/wildcard characters.

Each data mode source block configures one or more XPaths to use to extract data from the record, a workflow of objects to process the results of the XPath evaluation and optionally a workflow of objects to preprocess the record to transform it into a state suitable for XPathing. Each data mode source block will be processed in turn by the system for each record during indexing.

For source blocks with modes other than data, only the element configuring the workflow of objects to process the incoming term with is required. <xpath> and <preprocess> elements will be ignored.

<xpath>

This element contains either an XPath expression (Example 1) or a reference to a configured XPathProcessor (Example 2), to use in extracting data from a record.

It may appear more than once, but not when using a reference to a configured XPathProcessor; these may themselves specify multiple XPath expressions. When the element is repeated, the results of each expression will be processed by the process chain (as described below).

If the XPath makes use of XML namespaces, then the mappings for the namespace prefixes must be present on the XPath element. This can be seen in Example 1.

<process> and <preprocess>

These elements contain an ordered list of objects. The results of the first object is given to the second and so on down the chain.

The first object in a process chain must be an Extractor, as the input data is either a string, a DOM node or a SAX event list as appropriate to the XPath evaluation. The result of a process chain must be a hash, typically from an Extractor or a Normalizer. However if the last object is an IndexStore, it will be used to store the terms rather than the default.

The input to a preprocess chain is a Record, so the first object is most likely to be a Transformer. The result must also be a Record, so the last object is most likely to be a Parser.

For existing processing objects that can be used in these fields, see the object documentation.

Paths
Settings

The value for any true/false type settings must be 0 or 1.