Cheshire3 Objects: DocumentFactory

Description

DocumentFactories are the main means by which Documents are ingested into the system. Once the 'load' argument has been called, a DocumentFactory should be able to return, on request, one or more Documents. The way in which it does this will depend on the way in which it has been configured, and how 'load' was called. For example it may locate all documents, and cache them internally (e.g. for multiple XML documents within a single file), or it may crawl, locating and returning the documents one at a time (e.g. for many large files in a directory structure.)

Implementations

The following implementations are pre-configured and ready to use.
They may be used out-of-the-box in configurations for Workflows, or in code by getting the object from a Server.

The DocumentFactory will try to guess the format based on the data argument passed to it, however if you know the format, you can tell the documentFactory by using the format keyword argument. e.g.

documentFacory.load(session, "/home/user/data/", format="dir")
documentFacory.load(session, "/home/user/data.zip", format="zip")

A DocumentFactory will use an appropriate DocumentStream to deal with each format. Part of the 'smart'ness of DocumentFactories is that the DocumentStreams can be recursively called. e.g. You could call 'load' on a directory which contained a number of zip files, each of which were made up of a number of XML files. The DocumentFactory would use a DirectoryDocumentStream, a ZipDocumentStream, a FileDocumentStream and a XmlDocumentStream in turn to find and return XML Documents.

At the present time, the following formats are supported by defaultDocumentFactory and defaultAccumulatingDocumentFactory.
Note Well: DocumentStreams are only intended for use by DocumentFactories, and are unlikely to behave correctly if called directly by users' scripts.

Short name DocumentStream used Description
xml XmlDocumentStream Given data, finds XML instances within it and treats each as a Document. By default the documentFactory will use the first tag that it encounters as the basis of all future Documents, but if you know the name of the tag to use, you can supply this with the tagName keyword argument. e.g.
documentFactory.load(session, "/home/user/myFile.xml", format="xml", tagName="myTag")
marc MarcDocumentStream Given data containing MARC records, treats each MARC record as a Document (see also docs for MarcParser and MarcRecord.)
dir DirectoryDocumentStream Given a directory name, walks though all files and sub-directories within it looking for Documents.
tar TarDocumentStream Given the data which makes up a tar file, extract the files from it as Documents.
zip ZipDocumentStream Given the data which makes up a zip file, extract the files from it as Documents.
cluster ClusterDocumentStream Given the path to a raw cluster data file (as created by a ClusterExtractionDocumentFactory), merge and create documents.
locate LocateDocumentStream Given a name or pattern, locates files whose names match.
component ComponentDocumentStream Given a Record, finds component Documents using a configured Selector.
termHash TermHashDocumentStream Given data consisting of a hash of terms, treat each term as a Document
file FileDocumentStream Given the path to a file, open it, and read the contents.

API

Module: cheshire3.documentFactory
Classes:

DocumentFactory Methods:

FunctionParametersReturnsDescription
__init__session, config, parent The constructor takes the config node for the object, and its parent (usually a database).
load session, ?data, ?cache, ?format, ?tagName, ?codec   Load the data provide (or use the configured default if not provided). The way the data is loaded is dependent on the other parameters (or their configured defaults if absent):
  • cache - should documents be cached in memory: 0 = No, 1 = Locations cached but not documents, 2 = Yes
  • format - specifies how to treat the data parameter
  • tagName - The XML tag to treat as the document root
  • codec - specifies the codec to use to read the documents in
get_documentsession, ?indexDocumentReturn the index'th document in the factory if index is provided, otherwise return the next document.
register_stream session, format, class   Class method to register the supplied class of DocumentStream with the document factory for the given format. This class will be used the next time 'load' is called with this format.

DocumentStream Methods:

FunctionParametersReturnsDescription
__init__session, stream, format, ?tagName, ?codec, ?factory   The constructor takes the location of the data stream, the format. Optional arguments are the tagName to look for, the codec to use to read in data and the DocumentFactory that initialized the stream.
open_stream streamLocation data stream Perform any operations needed before the data stream can be read (e.g. open files).
fetch_documentindexdata/DocumentReturn the index'th piece of data or Document.
find_documentssession, cache Find documents within the data stream.

Sub-Packages

Sub-Package: web
Module: cheshire3.web.documentFactory
Classes:

Sub-Package: vdb
Module: cheshire3.vdb.documentFactory
Classes: