Class XMLFileIndexingWriter

java.lang.Object
org.dlese.dpc.index.writer.FileIndexingServiceWriter
org.dlese.dpc.index.writer.XMLFileIndexingWriter
All Implemented Interfaces:
DocWriter
Direct Known Subclasses:
DleseAnnoFileIndexingServiceWriter, DleseCollectionFileIndexingWriter, ItemFileIndexingWriter, NCSCollectionFileIndexingWriter, NewsOppsFileIndexingWriter, SimpleXMLFileIndexingWriter

public abstract class XMLFileIndexingWriter extends FileIndexingServiceWriter
Creates a Lucene Document from any XML file by stripping the XML tags to extract and index the content. The reader for this type of Document is XMLDocReader.

The Lucene Document fields that are created by this class are (in addition the the ones listed for FileIndexingServiceWriter):

collection - The collection associated with this resource.

Author:
John Weatherley
See Also:
  • Constructor Details

    • XMLFileIndexingWriter

      public XMLFileIndexingWriter()
      Constructor for the XMLFileIndexingWriter.
  • Method Details

    • getIds

      public String[] getIds() throws Exception
      Returns the ids for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.
      Returns:
      The id String
      Throws:
      Exception - If error
      See Also:
    • getPrimaryId

      public String getPrimaryId() throws Exception
      Returns the unique primary record ID for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.
      Returns:
      The id String
      Throws:
      Exception - If error
      See Also:
    • getRelatedIds

      public List getRelatedIds() throws IllegalStateException, Exception
      Gets the ids of related records.
      Returns:
      The related ids value, or null if none
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
      Exception - If error
    • getRelatedUrls

      public List getRelatedUrls() throws IllegalStateException, Exception
      Gets the urls of related records.
      Returns:
      The related urls value, or null if none
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
      Exception - If error
    • getRelatedIdsMap

      public Map getRelatedIdsMap() throws IllegalStateException, Exception
      Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.
      Returns:
      The related ids value, or null if none
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
      Exception - If error
    • getRelatedUrlsMap

      public Map getRelatedUrlsMap() throws IllegalStateException, Exception
      Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.
      Returns:
      The related urls value, or null if none
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
      Exception - If error
    • getCollections

      protected String[] getCollections() throws Exception
      Returns unique collection keys for the item being indexed. For example "dcc" (single collection) or "dcc dwel" (multiple collections). If more than one collection is provided, the first one must be the primary collection. May be overridden by sub-classes as appropriate (overridden by ADNFileIndexingWriter).
      Returns:
      The collection keys
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getDocGroup

      public String getDocGroup() throws Exception
      Gets the collection specifier, for example 'dcc', 'comet'.
      Specified by:
      getDocGroup in class FileIndexingServiceWriter
      Returns:
      The collection specifier
      Throws:
      Exception - If error occured
    • getBoundingBox

      protected BoundingBox getBoundingBox() throws Exception
      Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply. Override if nessary.
      Returns:
      BoundingBox, or null
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • init

      public abstract void init(File source, org.apache.lucene.document.Document existingDoc) throws Exception
      This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs.
      Specified by:
      init in class FileIndexingServiceWriter
      Parameters:
      source - The source file being indexed
      existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      Throws:
      Exception - If an error occured during set-up.
    • _getIds

      protected abstract String[] _getIds() throws Exception
      Return unique IDs for the item being indexed, one for each collection that catalogs the resource. For example "DLESE-000-000-000-001" (single ID) or "DLESE-000-000-000-036 COMET-60" (multiple IDs). If more than one ID is present, the first one is the primary.
      Returns:
      The id(s)
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getTitle

      public abstract String getTitle() throws Exception
      Return a title for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'title' and is also indexed in the 'default' field.
      Returns:
      The title String
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getDescription

      public abstract String getDescription() throws Exception
      Return a description for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'description' and is also indexed in the 'default' field.
      Returns:
      The description String
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getUrls

      public abstract String[] getUrls() throws Exception
      Return the URL(s) to the resource being indexed, or null if none apply. If more than one URL references the resource, the first one is the primary. The URL Strings are tokenized and indexed under the field key 'uri' and is also indexed in the 'default' field. It is also stored in the index untokenized under the field key 'url.'
      Returns:
      The url String(s)
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • indexFullContentInDefaultAndStems

      public abstract boolean indexFullContentInDefaultAndStems()
      Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class. If true, the content is indexed using the #addToDefaultField method.
      Returns:
      True to have the full XML content indexed in the 'default' and 'stems'
    • getWhatsNewDate

      protected abstract Date getWhatsNewDate() throws Exception
      Returns the date used to determine "What's new" in the library, or null if none is available.
      Returns:
      The what's new date for the item or null if not available.
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getWhatsNewType

      protected abstract String getWhatsNewType() throws Exception
      Returns the type of category for "What's new" in the library, or null if none is available. Must be a simple lower case String with no spaces, for example 'itemnew,' 'itemannocomplete,' 'itemannoinprogress,' 'annocomplete,' 'annoinprogress,' 'collection'.
      Returns:
      The what's new type.
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • addFields

      protected abstract void addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception
      Adds additional fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.

      The following Lucene Field types are available for indexing with the Document:
      Field.Text(string name, string value) -- tokenized, indexed, stored
      Field.UnStored(string name, string value) -- tokenized, indexed, not stored
      Field.Keyword(string name, string value) -- not tokenized, indexed, stored
      Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
      Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want

      Example code:
      protected void addCustomFields(Document newDoc, Document existingDoc) throws Exception {
        String customContent = "Some content";
        newDoc.add(Field.Text("mycustomefield", customContent));
      }

      Parameters:
      newDoc - The new Document that is being created for this resource
      existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      sourceFile - The sourceFile that is being indexed
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • addCustomFields

      protected void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception
      Adds the full content of the XML to the default search field. Strips the XML tags to extract the content. Will not work properly if the XML is not well-formed.

      Specified by:
      addCustomFields in class FileIndexingServiceWriter
      Parameters:
      newDoc - The new Document that is being created for this resource
      existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      sourceFile - The feature to be added to the CustomFields attribute
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getDeletedDoc

      public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc) throws Throwable
      Creates a Lucene Document for the XML that is equal to the exsiting Document.
      Overrides:
      getDeletedDoc in class FileIndexingServiceWriter
      Parameters:
      existingDoc - An existing FileIndexingService Document that currently resides in the index for the given file
      Returns:
      A Lucene FileIndexingService Document
      Throws:
      Throwable - Thrown if error occurs
    • getMyAnnoResultDocs

      protected ResultDocList getMyAnnoResultDocs() throws Exception
      Gets the annotations for this record, null or zero length if none available.
      Returns:
      The myAnnoResultDocs value
      Throws:
      Exception - NOT YET DOCUMENTED
    • getXmlIndexerFieldsConfig

      protected XMLIndexerFieldsConfig getXmlIndexerFieldsConfig()
      Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.
      Returns:
      The xmlIndexerFieldsConfig value
    • getFieldContent

      protected String getFieldContent(String[] values, String useVocabMapping, String metadataFormat) throws Exception
      Gets the vocab encoded keys for the given values, separated by the '+' symbol.
      Parameters:
      values - The valuse to encode.
      useVocabMapping - The mapping to use, for example "contentStandards".
      metadataFormat - The metadata format, for example 'adn'
      Returns:
      The encoded vocab keys.
      Throws:
      Exception - If error.
    • getFieldContent

      protected String getFieldContent(String value, String useVocabMapping, String metadataFormat) throws Exception
      Gets the encoded vocab key for the given content.
      Parameters:
      value - The value to encode
      useVocabMapping - The vocab mapping to use, for example 'contentStandard'
      metadataFormat - The metadata format, for example 'adn'
      Returns:
      The encoded value, or unchanged if unable to encode
      Throws:
      Exception - If error
    • getFieldName

      protected String getFieldName(String vocabFieldString, String metadataFormat) throws Exception
      Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'. If unable to get the field ID, the vocab field String is returned unchanged.
      Parameters:
      vocabFieldString - The field, for example 'gradeRange'
      metadataFormat - The metadata format, for example 'adn'
      Returns:
      The field key, for example 'gr', or unchanged if unable to determine
      Throws:
      Exception - If error
    • getTermStringFromStringArray

      protected String getTermStringFromStringArray(String[] vals)
      Gets the appropriate terms from a string array of metadata fields. Uses all terms found after the last colon ":" found in the string.
      Parameters:
      vals - Metadata fields that must be delemited by colons.
      Returns:
      The individual terms used for indexing.
    • getXmlIndexer

      protected XMLIndexer getXmlIndexer() throws Exception
      Gets the XMLIndexer for use by sub-classes
      Returns:
      The XMLIndexer
      Throws:
      Exception - If error
    • getDom4jDoc

      protected org.dom4j.Document getDom4jDoc() throws Exception
      Gets the dom4j Document for use by sub-classes
      Returns:
      The Document
      Throws:
      Exception - If error
    • getMyCollectionDoc

      protected DleseCollectionDocReader getMyCollectionDoc()
      Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.
      Returns:
      The myCollectionDoc value
    • getOaiModtime

      public static final String getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc)
      Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.
      Parameters:
      sourceFile - The source file
      existingDoc - The existing Doc
      Returns:
      The oaiModtime value
    • getRecordDataService

      protected RecordDataService getRecordDataService()
      Gets the recordDataService used by this XML File Indexer
      Returns:
      The recordDataService, or null if not available.
    • getIndex

      protected SimpleLuceneIndex getIndex()
      Gets the index used by this XML File Indexer
      Returns:
      The index, or null if not available.