Configuring Extractors, Tokenizers, TokenMergers and Normalizers

Introduction

Extractors locate and extract data from either a string, a DOM node tree, or a list of SAX events. An Extractor must be the first object in an index's process workflow. Tokenizers may then be used to split the data. If a Tokenizer is used, then it must be followed by a TokenMerger. Normalizers may then used to process those terms into a standard form for storing in an index.

Unless you're using a new extractor or normalizer class, they should all be built by the default server configuration, but for completeness we'll go through the configuration below.

Example: Extractors

Example extractor configurations:

01 <subConfig type="extractor" id="SimpleExtractor">
02   <objectType>extractor.SimpleExtractor</objectType>
03 </subConfig>
04
05 <subConfig type="extractor" id="ProxExtractor"<
06   <objectType>extractor.SimpleExtractor</objectType>
07   <options>
08     <setting type="prox">1</setting>
09     <setting type="reversable">1</setting>
10   </options>
11 </subConfig>

            
Explanation: Extractors

There's obviously not much to say about the first subConfig for SimpleExtractor. It just does one thing and doesn't have any paths or settings to set.

The second subConfig, for ProxExtractor is a little more complex. Firstly at line 8, it has the setting "prox", to tell the extractor to maintain which element in the record the data was extracted from. This is important if you want to be able to conduct proximity, adjacency or phrase searches on the data extracted later. The second setting, "reversable" on line 9, tells the extractor to maintain this location information in such a way that we can use it to identify where the data originally came from later on If this setting is not set, or is set to 0, the location will be maintained in such a way that it is only possible to tell if two pieces of data were extracted from the same element.

Some of the currently available extractors:

Example: Tokenizers

[ Coming soon ]

Explanation: Tokenizers

Some of the currently available tokenizers:

Example: TokenMergers

[ Coming soon ]

Explanation: TokenMergers

Some of the currently available tokenMergers:

Example: Normalizers

Example normalizer configurations:

01 <subConfig type="normalizer" id="CaseNormalizer">
02   <objectType>normalizer.CaseNormalizer</objectType>
03 </subConfig>
04
05 <subConfig type="normalizer" id="StoplistNormalizer">
06   <objectType>normalizer.StoplistNormalizer</objectType>
07   <paths>
08     <path type="stoplist">stopwords.txt</path>
09   </paths>
10 </subConfig>
        
            
Explanation: Normalizers

Nomalizers usually just do one pre-defined job, so there aren't many options or paths to set.

The second example (lines 5-10) is a rare exception. This is a StoplistNormalizer, and requires a path of type 'stoplist' (line 7). The stoplist file should have one word per line. The normalizer will remove all occurences of these words from the data.

Some of the currently available normalizers: