Class XMLFileIndexingWriter
- All Implemented Interfaces:
DocWriter
- Direct Known Subclasses:
DleseAnnoFileIndexingServiceWriter,DleseCollectionFileIndexingWriter,ItemFileIndexingWriter,NCSCollectionFileIndexingWriter,NewsOppsFileIndexingWriter,SimpleXMLFileIndexingWriter
Document from any XML file by stripping the XML tags
to extract and index the content. The reader for this type of Document is XMLDocReader.
The Lucene Document fields that are created by this class are (in addition the the ones listed for
FileIndexingServiceWriter):
collection - The collection associated with this resource.
- Author:
- John Weatherley
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected abstract String[]_getIds()Return unique IDs for the item being indexed, one for each collection that catalogs the resource.protected voidaddCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) Adds the full content of the XML to the default search field.protected abstract voidaddFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) Adds additional fields that are unique the document format being indexed.protected BoundingBoxReturn the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply.protected String[]Returns unique collection keys for the item being indexed.org.apache.lucene.document.DocumentgetDeletedDoc(org.apache.lucene.document.Document existingDoc) Creates a Lucene Document for the XML that is equal to the exsiting Document.abstract StringReturn a description for the document being indexed, or null if none applies.Gets the collection specifier, for example 'dcc', 'comet'.protected org.dom4j.DocumentGets the dom4j Document for use by sub-classesprotected StringgetFieldContent(String[] values, String useVocabMapping, String metadataFormat) Gets the vocab encoded keys for the given values, separated by the '+' symbol.protected StringgetFieldContent(String value, String useVocabMapping, String metadataFormat) Gets the encoded vocab key for the given content.protected StringgetFieldName(String vocabFieldString, String metadataFormat) Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'.String[]getIds()Returns the ids for the item being indexed.protected SimpleLuceneIndexgetIndex()Gets the index used by this XML File Indexerprotected ResultDocListGets the annotations for this record, null or zero length if none available.protected DleseCollectionDocReaderGets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.static final StringgetOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc) Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.Returns the unique primary record ID for the item being indexed.protected RecordDataServiceGets the recordDataService used by this XML File IndexerGets the ids of related records.Gets the ids of related records.Gets the urls of related records.Gets the urls of related records.protected StringgetTermStringFromStringArray(String[] vals) Gets the appropriate terms from a string array of metadata fields.abstract StringgetTitle()Return a title for the document being indexed, or null if none applies.abstract String[]getUrls()Return the URL(s) to the resource being indexed, or null if none apply.protected abstract DateReturns the date used to determine "What's new" in the library, or null if none is available.protected abstract StringReturns the type of category for "What's new" in the library, or null if none is available.protected XMLIndexerGets the XMLIndexer for use by sub-classesprotected XMLIndexerFieldsConfigGets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.abstract booleanReturn true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class.abstract voidThis method is called prior to processing and may be used to for any necessary set-up.Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter
abortIndexing, addDocToRemove, addToAdminDefaultField, addToDefaultField, create, destroy, getConfigAttributes, getDocsource, getDocType, getFileContent, getFileIndexingPlugin, getFileIndexingService, getLuceneDoc, getPreviousRecordDoc, getReaderClass, getSessionAttributes, getSourceDir, getSourceFile, getValidationReport, isMakingDeletedDoc, isValidationEnabled, prtln, prtlnErr, setConfigAttributes, setDebug, setFileIndexingPlugin, setFileIndexingService, setIsMakingDeletedDoc, setValidationEnabled
-
Constructor Details
-
XMLFileIndexingWriter
public XMLFileIndexingWriter()Constructor for the XMLFileIndexingWriter.
-
-
Method Details
-
getIds
Returns the ids for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.- Returns:
- The id String
- Throws:
Exception- If error- See Also:
-
getPrimaryId
Returns the unique primary record ID for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.- Returns:
- The id String
- Throws:
Exception- If error- See Also:
-
getRelatedIds
Gets the ids of related records.- Returns:
- The related ids value, or null if none
- Throws:
IllegalStateException- If called prior to calling method #indexFieldsException- If error
-
getRelatedUrls
Gets the urls of related records.- Returns:
- The related urls value, or null if none
- Throws:
IllegalStateException- If called prior to calling method #indexFieldsException- If error
-
getRelatedIdsMap
Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.- Returns:
- The related ids value, or null if none
- Throws:
IllegalStateException- If called prior to calling method #indexFieldsException- If error
-
getRelatedUrlsMap
Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.- Returns:
- The related urls value, or null if none
- Throws:
IllegalStateException- If called prior to calling method #indexFieldsException- If error
-
getCollections
Returns unique collection keys for the item being indexed. For example "dcc" (single collection) or "dcc dwel" (multiple collections). If more than one collection is provided, the first one must be the primary collection. May be overridden by sub-classes as appropriate (overridden by ADNFileIndexingWriter).- Returns:
- The collection keys
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getDocGroup
Gets the collection specifier, for example 'dcc', 'comet'.- Specified by:
getDocGroupin classFileIndexingServiceWriter- Returns:
- The collection specifier
- Throws:
Exception- If error occured
-
getBoundingBox
Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply. Override if nessary.- Returns:
- BoundingBox, or null
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
init
public abstract void init(File source, org.apache.lucene.document.Document existingDoc) throws Exception This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs.- Specified by:
initin classFileIndexingServiceWriter- Parameters:
source- The source file being indexedexistingDoc- An existing Document that currently resides in the index for the given resource, or null if none was previously present- Throws:
Exception- If an error occured during set-up.
-
_getIds
Return unique IDs for the item being indexed, one for each collection that catalogs the resource. For example "DLESE-000-000-000-001" (single ID) or "DLESE-000-000-000-036 COMET-60" (multiple IDs). If more than one ID is present, the first one is the primary.- Returns:
- The id(s)
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getTitle
Return a title for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'title' and is also indexed in the 'default' field.- Returns:
- The title String
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getDescription
Return a description for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'description' and is also indexed in the 'default' field.- Returns:
- The description String
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getUrls
Return the URL(s) to the resource being indexed, or null if none apply. If more than one URL references the resource, the first one is the primary. The URL Strings are tokenized and indexed under the field key 'uri' and is also indexed in the 'default' field. It is also stored in the index untokenized under the field key 'url.'- Returns:
- The url String(s)
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
indexFullContentInDefaultAndStems
public abstract boolean indexFullContentInDefaultAndStems()Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class. If true, the content is indexed using the #addToDefaultField method.- Returns:
- True to have the full XML content indexed in the 'default' and 'stems'
-
getWhatsNewDate
Returns the date used to determine "What's new" in the library, or null if none is available.- Returns:
- The what's new date for the item or null if not available.
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getWhatsNewType
Returns the type of category for "What's new" in the library, or null if none is available. Must be a simple lower case String with no spaces, for example 'itemnew,' 'itemannocomplete,' 'itemannoinprogress,' 'annocomplete,' 'annoinprogress,' 'collection'.- Returns:
- The what's new type.
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
addFields
protected abstract void addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception Adds additional fields that are unique the document format being indexed. When implementing this method, use the add method of theDocumentclass to add aField.The following Lucene
Fieldtypes are available for indexing with theDocument:
Field.Text(string name, string value) -- tokenized, indexed, stored
Field.UnStored(string name, string value) -- tokenized, indexed, not stored
Field.Keyword(string name, string value) -- not tokenized, indexed, stored
Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you wantExample code:
protected void addCustomFields(Document newDoc, Document existingDoc) throws Exception {
String customContent = "Some content";
newDoc.add(Field.Text("mycustomefield", customContent));
}- Parameters:
newDoc- The newDocumentthat is being created for this resourceexistingDoc- An existingDocumentthat currently resides in the index for the given resource, or null if none was previously presentsourceFile- The sourceFile that is being indexed- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
addCustomFields
protected void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception Adds the full content of the XML to the default search field. Strips the XML tags to extract the content. Will not work properly if the XML is not well-formed.- Specified by:
addCustomFieldsin classFileIndexingServiceWriter- Parameters:
newDoc- The newDocumentthat is being created for this resourceexistingDoc- An existingDocumentthat currently resides in the index for the given resource, or null if none was previously presentsourceFile- The feature to be added to the CustomFields attribute- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getDeletedDoc
public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc) throws Throwable Creates a Lucene Document for the XML that is equal to the exsiting Document.- Overrides:
getDeletedDocin classFileIndexingServiceWriter- Parameters:
existingDoc- An existing FileIndexingService Document that currently resides in the index for the given file- Returns:
- A Lucene FileIndexingService Document
- Throws:
Throwable- Thrown if error occurs
-
getMyAnnoResultDocs
Gets the annotations for this record, null or zero length if none available.- Returns:
- The myAnnoResultDocs value
- Throws:
Exception- NOT YET DOCUMENTED
-
getXmlIndexerFieldsConfig
Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.- Returns:
- The xmlIndexerFieldsConfig value
-
getFieldContent
protected String getFieldContent(String[] values, String useVocabMapping, String metadataFormat) throws Exception Gets the vocab encoded keys for the given values, separated by the '+' symbol.- Parameters:
values- The valuse to encode.useVocabMapping- The mapping to use, for example "contentStandards".metadataFormat- The metadata format, for example 'adn'- Returns:
- The encoded vocab keys.
- Throws:
Exception- If error.
-
getFieldContent
protected String getFieldContent(String value, String useVocabMapping, String metadataFormat) throws Exception Gets the encoded vocab key for the given content.- Parameters:
value- The value to encodeuseVocabMapping- The vocab mapping to use, for example 'contentStandard'metadataFormat- The metadata format, for example 'adn'- Returns:
- The encoded value, or unchanged if unable to encode
- Throws:
Exception- If error
-
getFieldName
Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'. If unable to get the field ID, the vocab field String is returned unchanged.- Parameters:
vocabFieldString- The field, for example 'gradeRange'metadataFormat- The metadata format, for example 'adn'- Returns:
- The field key, for example 'gr', or unchanged if unable to determine
- Throws:
Exception- If error
-
getTermStringFromStringArray
Gets the appropriate terms from a string array of metadata fields. Uses all terms found after the last colon ":" found in the string.- Parameters:
vals- Metadata fields that must be delemited by colons.- Returns:
- The individual terms used for indexing.
-
getXmlIndexer
Gets the XMLIndexer for use by sub-classes- Returns:
- The XMLIndexer
- Throws:
Exception- If error
-
getDom4jDoc
Gets the dom4j Document for use by sub-classes- Returns:
- The Document
- Throws:
Exception- If error
-
getMyCollectionDoc
Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.- Returns:
- The myCollectionDoc value
-
getOaiModtime
public static final String getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc) Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.- Parameters:
sourceFile- The source fileexistingDoc- The existing Doc- Returns:
- The oaiModtime value
-
getRecordDataService
Gets the recordDataService used by this XML File Indexer- Returns:
- The recordDataService, or null if not available.
-
getIndex
Gets the index used by this XML File Indexer- Returns:
- The index, or null if not available.
-