Build Script

Introduction

A sample and fairly straightforwards script to build a database from a single file containing XML documents. We go through it section by section and explain how things work. Stylistically, the Python code itself could be slightly improved, but is easy to understand. It can be used as a template for other scripts, or as a base point for more complicated versions.

Python Environment (1 - 5)
import sys, os                                                          #  1
                                                                        #  2
from cheshire3.baseObjects import Session                               #  3
from cheshire3.internal import cheshire3Root                            #  4
from cheshire3.server import SimpleServer                               #  5
                                                                        #  6
            

The first thing to do in any script is to setup Python such that you can use the various Cheshire3 objects. This allows us to find the Cheshire3 code first, before any other similarly named modules that might be installed. Lines 3 and 4 import the two Cheshire3 classes that we use directly - Session, and SimpleServer.

Cheshire3 Environment (7 - 16)
# Build environment...                                                  #  7
session = Session()                                                     #  8
servConfig = os.path.join(cheshire3Root, 'configs', 'serverConfig.xml') #  9
serv = SimpleServer(session, servConfig)                                # 10
db = serv.get_object(session, 'db_tei')                                 # 11
docFac = db.get_object(session, 'defaultDocumentFactory')               # 12
docParser = db.get_object(session, 'TeiParser')                         # 13
recStore = db.get_object(session, 'TeiRecordStore')                     # 14
                                                                        # 15

            

Next we need to set up the Cheshire3 environment which has been configured. The server is built (line 10) by giving it the path to a configuration file. The database is retrieved from the server using its identifier. From there, the other objects needed, such as the documentFactory, parser, and recordStore (12-14) to be used are extracted by their identifier.

# Load some data                                                        # 16
docFac.load(session, "tei_files.xml", cache=2, tagName='tei')           # 17
                                                                        # 18
            

In order to store and index records, we need to have them in a processable form. Line 17 loads a file named 'tei_files.xml' which contains a number of discreet XML documents. The 'cache' argument with value 2 tells the documentFactory to store all located documents in memory until they're needed. The 'tagName' argument tells the documentFactory to look for documents contained within <tei> tags.

Load and Index (19 - 29)
db.begin_indexing(session)                                              # 19
recStore.begin_storing(session)                                         # 20
for doc in docFac:                                                      # 21
    try:                                                                # 22
        rec = docParser.process_document(session, doc)                  # 23
    except:                                                             # 24
        print doc.get_raw(session)                                      # 25
        sys.exit()                                                      # 26
    id = recStore.create_record(session, rec)                           # 27
    db.add_record(session, rec)                                         # 28
    db.index_record(session, rec)                                       # 29
                                                                        # 30

            

First (line 19) we need to tell the database that we're going to be indexing some data. This lets the system handle all of the loading in one go at the end (line 33) and store only temporary information until then. Likewise line 20 tells the record store that it's going to be getting some data coming in, and is closed at line 31.

Then we step through each document in the loaded documentFactory (21). Parsing (23) the record from the raw XML should always happen in a try: (22) block so that if the XML isn't well formed, you can do something sensible with it. The 'sensible' thing in this case is to print it to the screen and then exit the script (25 - 26)

Once we have a record, we need to store it in the recordStore (line 27). When we do this, the identifier assigned to the record by the recordStore is returned - we can assign this to a variable, and use it later if necessary. Then we add it to the database (28) [recall that records may be in more than one database] and then index it (29).

Committing to persistent storage (30 - 32)
recStore.commit_storing(session)                                        # 31
db.commit_metadata(session)                                             # 32
db.commit_indexing(session)                                             # 33

            

Because we're not going to add any more records, we can close the recordStore (line 31), ensuring that any records are flushed to disk, rather than being kept in memory. We also need to commit the metadata (32) about the database (such as the newly added records) to disk and then finally we commit the indexing (line 33).

Complete Example
import sys, os                                                          #  1
                                                                        #  2
from cheshire3.baseObjects import Session                               #  3
from cheshire3.internal import cheshire3Root                            #  4
from cheshire3.server import SimpleServer                               #  5
                                                                        #  6
# Build environment...                                                  #  7
session = Session()                                                     #  8
servConfig = os.path.join(cheshire3Root, 'configs', 'serverConfig.xml') #  9
serv = SimpleServer(session, servConfig)                                # 10
db = serv.get_object(session, 'db_tei')                                 # 11
docFac = db.get_object(session, 'defaultDocumentFactory')               # 12
docParser = db.get_object(session, 'TeiParser')                         # 13
recStore = db.get_object(session, 'TeiRecordStore')                     # 14
                                                                        # 15
# Load some data                                                        # 16
docFac.load(session, "tei_files.xml", cache=2, tagName='tei')           # 17
                                                                        # 18
db.begin_indexing(session)                                              # 19
recStore.begin_storing(session)                                         # 20
for doc in docFac:                                                      # 21
    try:                                                                # 22
        rec = docParser.process_document(session, doc)                  # 23
    except:                                                             # 24
        print doc.get_raw(session)                                      # 25
        sys.exit()                                                      # 26
    id = recStore.create_record(session, rec)                           # 27
    db.add_record(session, rec)                                         # 28
    db.index_record(session, rec)                                       # 29
                                                                        # 30
recStore.commit_storing(session)                                        # 31
db.commit_metadata(session)                                             # 32
db.commit_indexing(session)                                             # 33