Cheshire3: preParser Module

Module preParser

Classes [hide private]
NormalizerPreParser Calls a named Normalizer to do the conversion
HtmlSmashPreParser Attempts to reduce HTML to its raw text
RegexpSmashPreParser Either strip, replace or keep data which matches a given regular expression
HtmlTidyPreParser Calls Tidy utility to turn HTML into XHTML for parsing
TagStripPreParser Strip only named tags from the document eg script, style
PdfToXmlPreParser pdftohtml wrapper to turn PDF into XML
PdfToTxtPreParser Convert PDF to text via pdftotext utility
SgmlPreParser Convert SGML into XML
AmpPreParser Escape lone ampersands in otherwise XML text
MarcToXmlPreParser Convert MARC into MARCXML
MarcToSgmlPreParser Convert MARC into Cheshire2's MarcSgml
TxtToXmlPreParser Minimally wrap text in <data> xml tags
GzipPreParser Gunzip a gzipped document
B64EncodePreParser Encode document in Base64
B64DecodePreParser Decode document from Base64
OpenOfficePreParser Use OpenOffice server to convert documents into OpenDocument XML
PrintableOnlyPreParser Replace or Strip non printable characters
CharacterEntityPreParser Transform latin-1 and broken character entities into numeric character entities.