Class FileIndexingServiceWriter

java.lang.Object
org.dlese.dpc.index.writer.FileIndexingServiceWriter
All Implemented Interfaces:
DocWriter
Direct Known Subclasses:
ErrorFileIndexingWriter, XMLFileIndexingWriter

public abstract class FileIndexingServiceWriter extends Object implements DocWriter
Abstract class for creating customized Lucene Documents for different file formats such as DLESE-IMS, ADN-item, ADN-collection, etc. Concrete sub-classes may be used with a FileIndexingService to enable automatic updating of the index whenever changes in the source file are made. This class, along with the FileIndexingService, may be used with a SimpleLuceneIndex to provide simple search support over files.

Note: after creating a new concrete FileIndexingServiceWriter, add a switch in RepositoryManager, method putDirInIndex(DirInfo, String) to select it for indexing.


The Lucene fields that are created by this class are:

  • doctype - The document format type (e.g. dlese_ims, adn, oai_dc, etc.) defined by concrete classes, with '0' appended to support wildcard searching.
  • readerclass - The class which is used to read typed Documents created by the concrete classes, for example "ItemDocReader".
  • default - The default field containing content added by concrete classes. Generally this is the field assigned in the Lucene index for default searching.
  • docsource - The absolute path to the file, which is used by the FileIndexingService for updating/deleting and may be used by beans or other classes that wish to have access to the source file.
  • docdir - The absolute path to the directory where the file resides, which is used by the FileIndexingService for updating/deleting and may be used by beans or other classes.
  • modtime - The file modification time, which is used by the FileIndexingService to determine if the file has changed and needs update and may be used by beans or other classes that wish to query the modtime for the record.
  • filecontent - The full content of the file, stored but not indexed.
  • deleted - Set to 'true' if the file or record for this document has been deleted, otherwise this field does not exist. Stored.
  • valid - Set to 'true' if the file or record for this document is valid, otherwise 'false'. This field may also be ommited. Not stored.
  • validationreport - Contains a report that provides validation information about the underlying file. This field may be ommited. Not stored.
Author:
John Weatherley
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    protected void
    Aborts the indexing process by returning a null index document.
    protected abstract void
    addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document previousRecordDoc, File sourceFile)
    Adds additional custom fields that are unique the document format being indexed.
    protected void
    addDocToRemove(String field, String value)
    Removes a matching item from the index during the FileIndexingService update.
    protected void
    Adds the given String to a text field referenced in the index by the field name 'admindefault'.
    protected void
    Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively.
    create(File sourceFile, org.apache.lucene.document.Document existingLuceneDoc, FileIndexingPlugin plugin, HashMap sessionAttr)
    Creates the Lucene Document for the given resource or returns null if unable to create.
    protected abstract void
    This method is called at the conclusion of processing and may be used for tear-down.
    Gets the configuration attributes that were set when the writer was created.
    org.apache.lucene.document.Document
    getDeletedDoc(org.apache.lucene.document.Document previousRecordDoc)
    Creates a Lucene Document equal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.
    abstract String
    Gets the specifier associated with this group of files or null if no group association exists.
    Gets the absolute path to the file, which is indexed under the 'docsource' field.
    abstract String
    Gets a unique document type key for this kind of record, corresponding to the format type.
    Gets the full content of the file as a String.
    Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.
    Gets the fileIndexingService attribute of the FileIndexingServiceWriter object
    org.apache.lucene.document.Document
    Gets the Lucene Document that this Writer is building.
    org.apache.lucene.document.Document
    Gets the previous Document that currently resides in the index for the given resource, or null if none was previously present.
    abstract String
    Gets the fully qualified name of the concrete DocReader class that is used to read this type of Document, for example "org.dlese.dpc.index.reader.ItemDocReader".
    Gets a Map of attributes used in a single indexing session.
    Gets the sourceDir that holds the file being indexed.
    Gets the sourceFile that is being indexed.
    protected String
    Gets a report detailing any errors found in the validation of the file, or null if no error was found.
    abstract void
    init(File source, org.apache.lucene.document.Document previousRecordDoc)
    This method is called prior to processing and may be used to for any necessary set-up.
    protected final boolean
    True if the current execution represents a deleted doc is being created.
    boolean
    Returns true if the files being indexed should be validated, otherwise false.
    protected final void
    Output a line of text to standard out, with datestamp, if debug is set to true.
    protected final void
    Output a line of text to error out, with datestamp.
    void
    Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.
    static final void
    setDebug(boolean db)
    Sets the debug attribute of the FileIndexingServiceWriter object
    void
    Sets the FileIndexingPlugin that will be used during the indexing process to index additional fields.
    void
    Sets the fileIndexingService attribute of the FileIndexingServiceWriter object
    protected void
    setIsMakingDeletedDoc(boolean isMakingDeletedDoc)
    Sets whether this DocWriter is making a deleted document.
    void
    setValidationEnabled(boolean validateFiles)
    Sets whether or not to validate the files being indexed and create a validation report, which is indexed.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • FileIndexingServiceWriter

      public FileIndexingServiceWriter()
  • Method Details

    • getDocType

      public abstract String getDocType() throws Exception
      Gets a unique document type key for this kind of record, corresponding to the format type. In the DLESE metadata repository, this corresponds to the XML format, for example "oai_dc," "adn," "dlese_ims," or "dlese_anno". The string is parsed using the Lucene StandardAnalyzer so it must be lowercase and should not contain any stop words.
      Specified by:
      getDocType in interface DocWriter
      Returns:
      The docType String
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getDocGroup

      public abstract String getDocGroup() throws Exception
      Gets the specifier associated with this group of files or null if no group association exists. In the DLESE metadata repository, this corresponds to the collection key, for example 'dcc', 'comet'.
      Returns:
      The docGroup specifier
      Throws:
      Exception - If error occured
    • getReaderClass

      public abstract String getReaderClass()
      Gets the fully qualified name of the concrete DocReader class that is used to read this type of Document, for example "org.dlese.dpc.index.reader.ItemDocReader".
      Specified by:
      getReaderClass in interface DocWriter
      Returns:
      The name of the DocReader.
    • init

      public abstract void init(File source, org.apache.lucene.document.Document previousRecordDoc) throws Exception
      This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs. The config attributes are set using the FileIndexingService.addDirectory(java.lang.String, java.lang.Class, java.util.HashMap, org.dlese.dpc.index.writer.FileIndexingPlugin, int) method.
      Parameters:
      source - The source file being indexed
      previousRecordDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      Throws:
      Exception - If an error occured during set-up.
    • destroy

      protected abstract void destroy()
      This method is called at the conclusion of processing and may be used for tear-down.
    • addCustomFields

      protected abstract void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document previousRecordDoc, File sourceFile) throws Exception
      Adds additional custom fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.

      The following Lucene Field types are available for indexing with the Document:
      Field.Text(string name, string value) -- tokenized, indexed, stored
      Field.UnStored(string name, string value) -- tokenized, indexed, not stored
      Field.Keyword(string name, string value) -- not tokenized, indexed, stored
      Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
      Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want

      Example code:
      protected void addCustomFields(Document newDoc, Document previousRecordDoc) throws Exception {
        String customContent = "Some content";
        newDoc.add(Field.Text("mycustomefield", customContent));
      }

      Parameters:
      newDoc - The new Document that is being created for this resource
      previousRecordDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      sourceFile - The sourceFile that is being indexed
      Throws:
      Exception - This method should throw and Exception with appropriate error message if an error occurs.
    • getFileContent

      public String getFileContent() throws IOException
      Gets the full content of the file as a String. If the file does not exist or the writer is processing a deleted doc, the content is pulled from the existing Lucene Document rather than the file.
      Returns:
      The full content of the file
      Throws:
      IOException - If error
    • getConfigAttributes

      public HashMap getConfigAttributes()
      Gets the configuration attributes that were set when the writer was created.
      Returns:
      The configuration attributes, or null if none were configured
    • setConfigAttributes

      public void setConfigAttributes(HashMap attributes)
      Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.
      Parameters:
      attributes - The configuration attributes
    • getSessionAttributes

      public HashMap getSessionAttributes()
      Gets a Map of attributes used in a single indexing session. A seesion is a portion of indexing for a given directory of records that will be added to the index as a block update. Since records are added to the index at the end of the session, the index can not be used to query information from those records during the session. Thus, these attributes can be used to communitcate information across records being indexed within a given session, such as the record IDs found so far in the session. The attributes are cleared at the end of each session.
      Returns:
      A Map of records IDs keys, or null
    • getSourceFile

      public File getSourceFile()
      Gets the sourceFile that is being indexed. Only available after create() has been called.
      Returns:
      The sourceFile value
    • getDocsource

      public String getDocsource()
      Gets the absolute path to the file, which is indexed under the 'docsource' field.
      Returns:
      The absolute path to the file
    • getSourceDir

      public File getSourceDir()
      Gets the sourceDir that holds the file being indexed. Only available after create() has been called.
      Returns:
      The sourceDir value
    • getLuceneDoc

      public org.apache.lucene.document.Document getLuceneDoc()
      Gets the Lucene Document that this Writer is building.
      Returns:
      The Lucene Document
    • getPreviousRecordDoc

      public org.apache.lucene.document.Document getPreviousRecordDoc()
      Gets the previous Document that currently resides in the index for the given resource, or null if none was previously present.
      Returns:
      The previousRecordDoc value
    • setFileIndexingService

      public void setFileIndexingService(FileIndexingService fileIndexingService)
      Sets the fileIndexingService attribute of the FileIndexingServiceWriter object
      Parameters:
      fileIndexingService - The new fileIndexingService.
    • getFileIndexingService

      public FileIndexingService getFileIndexingService()
      Gets the fileIndexingService attribute of the FileIndexingServiceWriter object
      Returns:
      The fileIndexingService.
    • isValidationEnabled

      public boolean isValidationEnabled()
      Returns true if the files being indexed should be validated, otherwise false. This method may be ignored by concrete classes if not needed.
      Returns:
      true if validateion is enabled.
    • setValidationEnabled

      public void setValidationEnabled(boolean validateFiles)
      Sets whether or not to validate the files being indexed and create a validation report, which is indexed. This value is set by the FileIndexingService prior to indexing. If true, the method getValidationReport() will be called, otherwise it will not.
      Parameters:
      validateFiles - True to validate, else false.
      See Also:
    • getValidationReport

      protected String getValidationReport() throws Exception
      Gets a report detailing any errors found in the validation of the file, or null if no error was found. This method should be overridden by concrete classes that need to validate the underlying file before indexing. Otherwise, this default method will simply return null. This method is called after all other method calls.
      Returns:
      Null if no file validation errors were found, otherwise a String that details the nature of the error.
      Throws:
      Exception - If error.
    • addToDefaultField

      protected void addToDefaultField(String value)
      Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively. The default and stems fields may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.
      Parameters:
      value - A text string to be added to the indexed fields named 'default' and 'stems'
    • addToAdminDefaultField

      protected void addToAdminDefaultField(String value)
      Adds the given String to a text field referenced in the index by the field name 'admindefault'. The default field may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.
      Parameters:
      value - A text string to be added to the indexed field named 'admindefault.'
    • getDeletedDoc

      public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document previousRecordDoc) throws Throwable
      Creates a Lucene Document equal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.

      Design note: This method should be overwritten by subclasses that require more envolved logic for deletes, and this super method should be called first and then subclassed should check

      invalid reference
      #getIsMakingDeletedDoc
      to execute as appropriate.
      Parameters:
      previousRecordDoc - An existing FileIndexingService Document that currently resides in the index for the given file
      Returns:
      A Lucene FileIndexingService Document with appropriate fields updated
      Throws:
      Throwable - Thrown if error occurs
    • setIsMakingDeletedDoc

      protected void setIsMakingDeletedDoc(boolean isMakingDeletedDoc)
      Sets whether this DocWriter is making a deleted document. Used by subclassed that crate a DocWriter in their getDeletedDoc(org.apache.lucene.document.Document) method.
      Parameters:
      isMakingDeletedDoc - Sets the making deleted doc status
    • isMakingDeletedDoc

      protected final boolean isMakingDeletedDoc()
      True if the current execution represents a deleted doc is being created.
      Returns:
      True if a deleted doc is being created
    • abortIndexing

      protected void abortIndexing()
      Aborts the indexing process by returning a null index document.
    • addDocToRemove

      protected void addDocToRemove(String field, String value)
      Removes a matching item from the index during the FileIndexingService update. This method should be called to instruct the indexer to remove documents that should no longer be in the index.
      Parameters:
      field - The field to search in.
      value - The matching value for the item to remove.
    • create

      public FileIndexingServiceData create(File sourceFile, org.apache.lucene.document.Document existingLuceneDoc, FileIndexingPlugin plugin, HashMap sessionAttr) throws Throwable
      Creates the Lucene Document for the given resource or returns null if unable to create. This method is called by class FileIndexingService.
      Parameters:
      sourceFile - The source file to be indexed
      existingLuceneDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
      plugin - The FileIndexingPlugin being used, or null
      sessionAttr - Attributes used in a given indexing session
      Returns:
      A Lucene Document with it's fields populated, or null.
      Throws:
      Throwable - Thrown if error occurs
    • setFileIndexingPlugin

      public void setFileIndexingPlugin(FileIndexingPlugin plugin)
      Sets the FileIndexingPlugin that will be used during the indexing process to index additional fields. Set to null to remove.
      Parameters:
      plugin - A FileIndexingPlugin to use during indexing.
    • getFileIndexingPlugin

      public FileIndexingPlugin getFileIndexingPlugin()
      Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.
      Returns:
      The FileIndexingPlugin configured for use used, or null.
    • prtlnErr

      protected final void prtlnErr(String s)
      Output a line of text to error out, with datestamp.
      Parameters:
      s - The text that will be output to error out.
    • prtln

      protected final void prtln(String s)
      Output a line of text to standard out, with datestamp, if debug is set to true.
      Parameters:
      s - The String that will be output.
    • setDebug

      public static final void setDebug(boolean db)
      Sets the debug attribute of the FileIndexingServiceWriter object
      Parameters:
      db - The new debug value