Class FileIndexingServiceWriter
- All Implemented Interfaces:
DocWriter
- Direct Known Subclasses:
ErrorFileIndexingWriter,XMLFileIndexingWriter
Documents for different
file formats such as DLESE-IMS, ADN-item, ADN-collection, etc. Concrete sub-classes may be used with a
FileIndexingService to enable automatic updating of the index whenever changes
in the source file are made. This class, along with the FileIndexingService,
may be used with a SimpleLuceneIndex to provide simple search support over
files.
Note: after creating a new concrete FileIndexingServiceWriter, add a switch in RepositoryManager, method putDirInIndex(DirInfo, String) to select it for
indexing.
The Lucene fields that are created by this class are:
-
doctype- The document format type (e.g. dlese_ims, adn, oai_dc, etc.) defined by concrete classes, with '0' appended to support wildcard searching. -
readerclass- The class which is used to read typedDocuments created by the concrete classes, for example "ItemDocReader". -
default- The default field containing content added by concrete classes. Generally this is the field assigned in the Lucene index for default searching. -
docsource- The absolute path to the file, which is used by theFileIndexingServicefor updating/deleting and may be used by beans or other classes that wish to have access to the source file. -
docdir- The absolute path to the directory where the file resides, which is used by theFileIndexingServicefor updating/deleting and may be used by beans or other classes. -
modtime- The file modification time, which is used by theFileIndexingServiceto determine if the file has changed and needs update and may be used by beans or other classes that wish to query the modtime for the record. -
filecontent- The full content of the file, stored but not indexed. -
deleted- Set to 'true' if the file or record for this document has been deleted, otherwise this field does not exist. Stored. -
valid- Set to 'true' if the file or record for this document is valid, otherwise 'false'. This field may also be ommited. Not stored. -
validationreport- Contains a report that provides validation information about the underlying file. This field may be ommited. Not stored.
- Author:
- John Weatherley
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected voidAborts the indexing process by returning a null index document.protected abstract voidaddCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document previousRecordDoc, File sourceFile) Adds additional custom fields that are unique the document format being indexed.protected voidaddDocToRemove(String field, String value) Removes a matching item from the index during the FileIndexingService update.protected voidaddToAdminDefaultField(String value) Adds the given String to a text field referenced in the index by the field name 'admindefault'.protected voidaddToDefaultField(String value) Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively.create(File sourceFile, org.apache.lucene.document.Document existingLuceneDoc, FileIndexingPlugin plugin, HashMap sessionAttr) Creates the LuceneDocumentfor the given resource or returns null if unable to create.protected abstract voiddestroy()This method is called at the conclusion of processing and may be used for tear-down.Gets the configuration attributes that were set when the writer was created.org.apache.lucene.document.DocumentgetDeletedDoc(org.apache.lucene.document.Document previousRecordDoc) Creates a LuceneDocumentequal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.abstract StringGets the specifier associated with this group of files or null if no group association exists.Gets the absolute path to the file, which is indexed under the 'docsource' field.abstract StringGets a unique document type key for this kind of record, corresponding to the format type.Gets the full content of the file as a String.Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.Gets the fileIndexingService attribute of the FileIndexingServiceWriter objectorg.apache.lucene.document.DocumentGets the Lucene Document that this Writer is building.org.apache.lucene.document.DocumentGets the previous Document that currently resides in the index for the given resource, or null if none was previously present.abstract StringGets the fully qualified name of the concreteDocReaderclass that is used to read this type ofDocument, for example "org.dlese.dpc.index.reader.ItemDocReader".Gets a Map of attributes used in a single indexing session.Gets the sourceDir that holds the file being indexed.Gets the sourceFile that is being indexed.protected StringGets a report detailing any errors found in the validation of the file, or null if no error was found.abstract voidThis method is called prior to processing and may be used to for any necessary set-up.protected final booleanTrue if the current execution represents a deleted doc is being created.booleanReturns true if the files being indexed should be validated, otherwise false.protected final voidOutput a line of text to standard out, with datestamp, if debug is set to true.protected final voidOutput a line of text to error out, with datestamp.voidsetConfigAttributes(HashMap attributes) Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.static final voidsetDebug(boolean db) Sets the debug attribute of the FileIndexingServiceWriter objectvoidSets the FileIndexingPlugin that will be used during the indexing process to index additional fields.voidsetFileIndexingService(FileIndexingService fileIndexingService) Sets the fileIndexingService attribute of the FileIndexingServiceWriter objectprotected voidsetIsMakingDeletedDoc(boolean isMakingDeletedDoc) Sets whether this DocWriter is making a deleted document.voidsetValidationEnabled(boolean validateFiles) Sets whether or not to validate the files being indexed and create a validation report, which is indexed.
-
Constructor Details
-
FileIndexingServiceWriter
public FileIndexingServiceWriter()
-
-
Method Details
-
getDocType
Gets a unique document type key for this kind of record, corresponding to the format type. In the DLESE metadata repository, this corresponds to the XML format, for example "oai_dc," "adn," "dlese_ims," or "dlese_anno". The string is parsed using the LuceneStandardAnalyzerso it must be lowercase and should not contain any stop words.- Specified by:
getDocTypein interfaceDocWriter- Returns:
- The docType String
- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getDocGroup
Gets the specifier associated with this group of files or null if no group association exists. In the DLESE metadata repository, this corresponds to the collection key, for example 'dcc', 'comet'.- Returns:
- The docGroup specifier
- Throws:
Exception- If error occured
-
getReaderClass
Gets the fully qualified name of the concreteDocReaderclass that is used to read this type ofDocument, for example "org.dlese.dpc.index.reader.ItemDocReader".- Specified by:
getReaderClassin interfaceDocWriter- Returns:
- The name of the
DocReader.
-
init
public abstract void init(File source, org.apache.lucene.document.Document previousRecordDoc) throws Exception This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs. The config attributes are set using theFileIndexingService.addDirectory(java.lang.String, java.lang.Class, java.util.HashMap, org.dlese.dpc.index.writer.FileIndexingPlugin, int)method.- Parameters:
source- The source file being indexedpreviousRecordDoc- An existing Document that currently resides in the index for the given resource, or null if none was previously present- Throws:
Exception- If an error occured during set-up.
-
destroy
protected abstract void destroy()This method is called at the conclusion of processing and may be used for tear-down. -
addCustomFields
protected abstract void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document previousRecordDoc, File sourceFile) throws Exception Adds additional custom fields that are unique the document format being indexed. When implementing this method, use the add method of theDocumentclass to add aField.The following Lucene
Fieldtypes are available for indexing with theDocument:
Field.Text(string name, string value) -- tokenized, indexed, stored
Field.UnStored(string name, string value) -- tokenized, indexed, not stored
Field.Keyword(string name, string value) -- not tokenized, indexed, stored
Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you wantExample code:
protected void addCustomFields(Document newDoc, Document previousRecordDoc) throws Exception {
String customContent = "Some content";
newDoc.add(Field.Text("mycustomefield", customContent));
}- Parameters:
newDoc- The newDocumentthat is being created for this resourcepreviousRecordDoc- An existingDocumentthat currently resides in the index for the given resource, or null if none was previously presentsourceFile- The sourceFile that is being indexed- Throws:
Exception- This method should throw and Exception with appropriate error message if an error occurs.
-
getFileContent
Gets the full content of the file as a String. If the file does not exist or the writer is processing a deleted doc, the content is pulled from the existing Lucene Document rather than the file.- Returns:
- The full content of the file
- Throws:
IOException- If error
-
getConfigAttributes
Gets the configuration attributes that were set when the writer was created.- Returns:
- The configuration attributes, or null if none were configured
-
setConfigAttributes
Sets the configuration attributes - called by the factory method that creates the FileIndexingServiceWriter.- Parameters:
attributes- The configuration attributes
-
getSessionAttributes
Gets a Map of attributes used in a single indexing session. A seesion is a portion of indexing for a given directory of records that will be added to the index as a block update. Since records are added to the index at the end of the session, the index can not be used to query information from those records during the session. Thus, these attributes can be used to communitcate information across records being indexed within a given session, such as the record IDs found so far in the session. The attributes are cleared at the end of each session.- Returns:
- A Map of records IDs keys, or null
-
getSourceFile
Gets the sourceFile that is being indexed. Only available after create() has been called.- Returns:
- The sourceFile value
-
getDocsource
Gets the absolute path to the file, which is indexed under the 'docsource' field.- Returns:
- The absolute path to the file
-
getSourceDir
Gets the sourceDir that holds the file being indexed. Only available after create() has been called.- Returns:
- The sourceDir value
-
getLuceneDoc
public org.apache.lucene.document.Document getLuceneDoc()Gets the Lucene Document that this Writer is building.- Returns:
- The Lucene Document
-
getPreviousRecordDoc
public org.apache.lucene.document.Document getPreviousRecordDoc()Gets the previous Document that currently resides in the index for the given resource, or null if none was previously present.- Returns:
- The previousRecordDoc value
-
setFileIndexingService
Sets the fileIndexingService attribute of the FileIndexingServiceWriter object- Parameters:
fileIndexingService- The new fileIndexingService.
-
getFileIndexingService
Gets the fileIndexingService attribute of the FileIndexingServiceWriter object- Returns:
- The fileIndexingService.
-
isValidationEnabled
public boolean isValidationEnabled()Returns true if the files being indexed should be validated, otherwise false. This method may be ignored by concrete classes if not needed.- Returns:
- true if validateion is enabled.
-
setValidationEnabled
public void setValidationEnabled(boolean validateFiles) Sets whether or not to validate the files being indexed and create a validation report, which is indexed. This value is set by theFileIndexingServiceprior to indexing. If true, the methodgetValidationReport()will be called, otherwise it will not.- Parameters:
validateFiles- True to validate, else false.- See Also:
-
getValidationReport
Gets a report detailing any errors found in the validation of the file, or null if no error was found. This method should be overridden by concrete classes that need to validate the underlying file before indexing. Otherwise, this default method will simply return null. This method is called after all other method calls.- Returns:
- Null if no file validation errors were found, otherwise a String that details the nature of the error.
- Throws:
Exception- If error.
-
addToDefaultField
Adds the given String to the 'default' and 'stems' fields as text and stemmed text, respectively. The default and stems fields may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.- Parameters:
value- A text string to be added to the indexed fields named 'default' and 'stems'
-
addToAdminDefaultField
Adds the given String to a text field referenced in the index by the field name 'admindefault'. The default field may be used in queries to quickly search for text across fields. This method should be called from the addCustomFields of implementing classes.- Parameters:
value- A text string to be added to the indexed field named 'admindefault.'
-
getDeletedDoc
public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document previousRecordDoc) throws Throwable Creates a LuceneDocumentequal to the exsiting FileIndexingService Document except the field "deleted" is to "true" and the field "modtime" has been set to the current time.Design note: This method should be overwritten by subclasses that require more envolved logic for deletes, and this super method should be called first and then subclassed should check
to execute as appropriate.invalid reference
#getIsMakingDeletedDoc- Parameters:
previousRecordDoc- An existing FileIndexingService Document that currently resides in the index for the given file- Returns:
- A Lucene FileIndexingService Document with appropriate fields updated
- Throws:
Throwable- Thrown if error occurs
-
setIsMakingDeletedDoc
protected void setIsMakingDeletedDoc(boolean isMakingDeletedDoc) Sets whether this DocWriter is making a deleted document. Used by subclassed that crate a DocWriter in theirgetDeletedDoc(org.apache.lucene.document.Document)method.- Parameters:
isMakingDeletedDoc- Sets the making deleted doc status
-
isMakingDeletedDoc
protected final boolean isMakingDeletedDoc()True if the current execution represents a deleted doc is being created.- Returns:
- True if a deleted doc is being created
-
abortIndexing
protected void abortIndexing()Aborts the indexing process by returning a null index document. -
addDocToRemove
Removes a matching item from the index during the FileIndexingService update. This method should be called to instruct the indexer to remove documents that should no longer be in the index.- Parameters:
field- The field to search in.value- The matching value for the item to remove.
-
create
public FileIndexingServiceData create(File sourceFile, org.apache.lucene.document.Document existingLuceneDoc, FileIndexingPlugin plugin, HashMap sessionAttr) throws Throwable Creates the LuceneDocumentfor the given resource or returns null if unable to create. This method is called by classFileIndexingService.- Parameters:
sourceFile- The source file to be indexedexistingLuceneDoc- An existing Document that currently resides in the index for the given resource, or null if none was previously presentplugin- The FileIndexingPlugin being used, or nullsessionAttr- Attributes used in a given indexing session- Returns:
- A Lucene Document with it's fields populated, or null.
- Throws:
Throwable- Thrown if error occurs
-
setFileIndexingPlugin
Sets the FileIndexingPlugin that will be used during the indexing process to index additional fields. Set to null to remove.- Parameters:
plugin- A FileIndexingPlugin to use during indexing.
-
getFileIndexingPlugin
Gets the FileIndexingPlugin that has been set for use during indexing, or null if none.- Returns:
- The FileIndexingPlugin configured for use used, or null.
-
prtlnErr
Output a line of text to error out, with datestamp.- Parameters:
s- The text that will be output to error out.
-
prtln
Output a line of text to standard out, with datestamp, if debug is set to true.- Parameters:
s- The String that will be output.
-
setDebug
public static final void setDebug(boolean db) Sets the debug attribute of the FileIndexingServiceWriter object- Parameters:
db- The new debug value
-