Configuring Stores


There are several Store objects, but we're currently primarily concerned with RecordStores and IndexStores. DocumentStores are practically identical to RecordStores in terms of configuration, so we'll talk about the two together.

Database specific stores will be included in the subConfigs section of a database configuration file.


Example store configurations:

01  <subConfig type="recordStore" id="eadRecordStore">
02    <objectType>recordStore.BdbRecordStore</objectType>
03    <paths>
04      <path type="databasePath">recordStore.bdb</path>
05      <object type="idNormalizer" ref="StringIntNormalizer"/>
06    </paths>
07    <options>
08      <setting type="digest">sha</setting>
09    </options>
10  </subConfig>
12  <subConfig type="indexStore" id="eadIndexStore">
13   <objectType>indexStore.BdbIndexStore</objectType>
14    <paths>
15      <path type="defaultPath">indexes</path>
16      <path type="tempPath">temp</path>
17      <path type="recordStoreHash">eadRecordStore</path>
18    </paths>
19  </subConfig>


Line 1 starts a new recordStore configuration for an object called 'eadRecordStore', and the following line declares that it should be instantiated using the recordStore.BdbRecordStore class. There are several possible classes distributed with Cheshire3, the other main one is PostgresRecordStore which will maintain the data and associated metadata in a PostgreSQL relational database. The default is the much faster BerkeleyDB store.

Then we have two fields wrapped in the paths section. Line 4 gives the filename of the database to use, in this case 'recordStore.bdb'. Remember that this will be relative to the current defaultPath. Line 5 has a reference to a Normalizer object -- this is used to turn the record identifiers into something appropriate for the underlying storage system. In this case, it turns integers into strings as Berkeley DB only has string keys. It's safest to leave this alone.

Then in line 8, we have a setting called 'digest'. This will configure the recordStore to maintain a checksum for each record to ensure that it remains unique within the store. There are two checksum algorithms available at the moment, 'sha' and 'md5'. If left out, the store will be slightly faster, but allow (potentially inadvertant) duplicate records.

For documentStores, instead all we would change would be the identifier and the class. Everything else can remain the same. DocumentStores have some additional objects that can be referenced in the paths section however -- 'inWorkflow', 'inPreParser', 'outWorkflow' and 'outPreParser'. These are objects to call to transform the document as it comes into the store or as it is fetched out of it respectively.

At line 12 we start configuring an indexStore called eadIndexStore, and as always the following line gives the class to instantiate it as.

In the paths section for this object we have three path elements. The first at line 15 we've seen before -- it adds another link in the defaultPath. Line 16 gives a path in which to store temporary files. During indexing, the store will create a couple of temporary files per index and then load them into the database only at the end after some pre-processing for efficiency. Line 17 then has a path of type 'recordStoreHash'. This is a space separated list of recordStores that contain records that will appear in the indexes. Normally there will just be the one store, but for larger or composite databases it may be appropriate to split the data into multiple stores.