java.lang.Object

org.dlese.dpc.index.writer.FileIndexingServiceWriter

org.dlese.dpc.index.writer.XMLFileIndexingWriter

All Implemented Interfaces:: DocWriter

Direct Known Subclasses:: DleseAnnoFileIndexingServiceWriter, DleseCollectionFileIndexingWriter, ItemFileIndexingWriter, NCSCollectionFileIndexingWriter, NewsOppsFileIndexingWriter, SimpleXMLFileIndexingWriter

public abstract class XMLFileIndexingWriter extends FileIndexingServiceWriter

Creates a Lucene Document from any XML file by stripping the XML tags to extract and index the content. The reader for this type of Document is XMLDocReader.

The Lucene Document fields that are created by this class are (in addition the the ones listed for FileIndexingServiceWriter):

collection - The collection associated with this resource.

Author:

John Weatherley

See Also:

Constructor Summary

Constructors

Constructor

Description

XMLFileIndexingWriter()

Constructor for the XMLFileIndexingWriter.
Method Summary

Modifier and Type

Method

Description

protected abstract String[]

_getIds()

Return unique IDs for the item being indexed, one for each collection that catalogs the resource.

protected void

addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)

Adds the full content of the XML to the default search field.

protected abstract void

addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)

Adds additional fields that are unique the document format being indexed.

protected BoundingBox

getBoundingBox()

Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply.

protected String[]

getCollections()

Returns unique collection keys for the item being indexed.

org.apache.lucene.document.Document

getDeletedDoc(org.apache.lucene.document.Document existingDoc)

Creates a Lucene Document for the XML that is equal to the exsiting Document.

abstract String

getDescription()

Return a description for the document being indexed, or null if none applies.

String

getDocGroup()

Gets the collection specifier, for example 'dcc', 'comet'.

protected org.dom4j.Document

getDom4jDoc()

Gets the dom4j Document for use by sub-classes

protected String

getFieldContent(String[] values, String useVocabMapping, String metadataFormat)

Gets the vocab encoded keys for the given values, separated by the '+' symbol.

protected String

getFieldContent(String value, String useVocabMapping, String metadataFormat)

Gets the encoded vocab key for the given content.

protected String

getFieldName(String vocabFieldString, String metadataFormat)

Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'.

String[]

getIds()

Returns the ids for the item being indexed.

protected SimpleLuceneIndex

getIndex()

Gets the index used by this XML File Indexer

protected ResultDocList

getMyAnnoResultDocs()

Gets the annotations for this record, null or zero length if none available.

protected DleseCollectionDocReader

getMyCollectionDoc()

Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.

static final String

getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc)

Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.

String

getPrimaryId()

Returns the unique primary record ID for the item being indexed.

protected RecordDataService

getRecordDataService()

Gets the recordDataService used by this XML File Indexer

List

getRelatedIds()

Gets the ids of related records.

Map

getRelatedIdsMap()

Gets the ids of related records.

List

getRelatedUrls()

Gets the urls of related records.

Map

getRelatedUrlsMap()

Gets the urls of related records.

protected String

getTermStringFromStringArray(String[] vals)

Gets the appropriate terms from a string array of metadata fields.

abstract String

getTitle()

Return a title for the document being indexed, or null if none applies.

abstract String[]

getUrls()

Return the URL(s) to the resource being indexed, or null if none apply.

protected abstract Date

getWhatsNewDate()

Returns the date used to determine "What's new" in the library, or null if none is available.

protected abstract String

getWhatsNewType()

Returns the type of category for "What's new" in the library, or null if none is available.

protected XMLIndexer

getXmlIndexer()

Gets the XMLIndexer for use by sub-classes

protected XMLIndexerFieldsConfig

getXmlIndexerFieldsConfig()

Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.

abstract boolean

indexFullContentInDefaultAndStems()

Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class.

abstract void

init(File source, org.apache.lucene.document.Document existingDoc)

This method is called prior to processing and may be used to for any necessary set-up.

Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter
abortIndexing, addDocToRemove, addToAdminDefaultField, addToDefaultField, create, destroy, getConfigAttributes, getDocsource, getDocType, getFileContent, getFileIndexingPlugin, getFileIndexingService, getLuceneDoc, getPreviousRecordDoc, getReaderClass, getSessionAttributes, getSourceDir, getSourceFile, getValidationReport, isMakingDeletedDoc, isValidationEnabled, prtln, prtlnErr, setConfigAttributes, setDebug, setFileIndexingPlugin, setFileIndexingService, setIsMakingDeletedDoc, setValidationEnabled

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- XMLFileIndexingWriter
  
  public XMLFileIndexingWriter()
  
  Constructor for the XMLFileIndexingWriter.
Method Details
- getIds
  
  public String[] getIds() throws Exception
  
  Returns the ids for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.
  Returns:
  
  The id String
  
  Throws:
  
  Exception - If error
  
  See Also:
  
  getIds()
- getPrimaryId
  
  public String getPrimaryId() throws Exception
  
  Returns the unique primary record ID for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.
  Returns:
  
  The id String
  
  Throws:
  
  Exception - If error
  
  See Also:
  
  getIds()
- getRelatedIds
  
  public List getRelatedIds() throws IllegalStateException, Exception
  
  Gets the ids of related records.
  
  Returns:
  
  The related ids value, or null if none
  
  Throws:
  
  IllegalStateException - If called prior to calling method #indexFields
  
  Exception - If error
- getRelatedUrls
  
  public List getRelatedUrls() throws IllegalStateException, Exception
  
  Gets the urls of related records.
  
  Returns:
  
  The related urls value, or null if none
  
  Throws:
  
  IllegalStateException - If called prior to calling method #indexFields
  
  Exception - If error
- getRelatedIdsMap
  
  public Map getRelatedIdsMap() throws IllegalStateException, Exception
  
  Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.
  
  Returns:
  
  The related ids value, or null if none
  
  Throws:
  
  IllegalStateException - If called prior to calling method #indexFields
  
  Exception - If error
- getRelatedUrlsMap
  
  public Map getRelatedUrlsMap() throws IllegalStateException, Exception
  
  Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.
  
  Returns:
  
  The related urls value, or null if none
  
  Throws:
  
  IllegalStateException - If called prior to calling method #indexFields
  
  Exception - If error
- getCollections
  
  protected String[] getCollections() throws Exception
  
  Returns unique collection keys for the item being indexed. For example "dcc" (single collection) or "dcc dwel" (multiple collections). If more than one collection is provided, the first one must be the primary collection. May be overridden by sub-classes as appropriate (overridden by ADNFileIndexingWriter).
  
  Returns:
  
  The collection keys
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getDocGroup
  
  public String getDocGroup() throws Exception
  
  Gets the collection specifier, for example 'dcc', 'comet'.
  
  Specified by:
  
  getDocGroup in class FileIndexingServiceWriter
  
  Returns:
  
  The collection specifier
  
  Throws:
  
  Exception - If error occured
- getBoundingBox
  
  protected BoundingBox getBoundingBox() throws Exception
  
  Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply. Override if nessary.
  
  Returns:
  
  BoundingBox, or null
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- init
  
  public abstract void init(File source, org.apache.lucene.document.Document existingDoc) throws Exception
  
  This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs.
  
  Specified by:
  
  init in class FileIndexingServiceWriter
  
  Parameters:
  
  source - The source file being indexed
  
  existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
  
  Throws:
  
  Exception - If an error occured during set-up.
- _getIds
  
  protected abstract String[] _getIds() throws Exception
  
  Return unique IDs for the item being indexed, one for each collection that catalogs the resource. For example "DLESE-000-000-000-001" (single ID) or "DLESE-000-000-000-036 COMET-60" (multiple IDs). If more than one ID is present, the first one is the primary.
  
  Returns:
  
  The id(s)
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getTitle
  
  public abstract String getTitle() throws Exception
  
  Return a title for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'title' and is also indexed in the 'default' field.
  
  Returns:
  
  The title String
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getDescription
  
  public abstract String getDescription() throws Exception
  
  Return a description for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'description' and is also indexed in the 'default' field.
  
  Returns:
  
  The description String
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getUrls
  
  public abstract String[] getUrls() throws Exception
  
  Return the URL(s) to the resource being indexed, or null if none apply. If more than one URL references the resource, the first one is the primary. The URL Strings are tokenized and indexed under the field key 'uri' and is also indexed in the 'default' field. It is also stored in the index untokenized under the field key 'url.'
  
  Returns:
  
  The url String(s)
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- indexFullContentInDefaultAndStems
  
  public abstract boolean indexFullContentInDefaultAndStems()
  
  Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class. If true, the content is indexed using the #addToDefaultField method.
  
  Returns:
  
  True to have the full XML content indexed in the 'default' and 'stems'
- getWhatsNewDate
  
  protected abstract Date getWhatsNewDate() throws Exception
  
  Returns the date used to determine "What's new" in the library, or null if none is available.
  
  Returns:
  
  The what's new date for the item or null if not available.
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getWhatsNewType
  
  protected abstract String getWhatsNewType() throws Exception
  
  Returns the type of category for "What's new" in the library, or null if none is available. Must be a simple lower case String with no spaces, for example 'itemnew,' 'itemannocomplete,' 'itemannoinprogress,' 'annocomplete,' 'annoinprogress,' 'collection'.
  
  Returns:
  
  The what's new type.
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- addFields
  
  protected abstract void addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception
  
  Adds additional fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.
  The following Lucene Field types are available for indexing with the Document:
  Field.Text(string name, string value) -- tokenized, indexed, stored
  Field.UnStored(string name, string value) -- tokenized, indexed, not stored
  Field.Keyword(string name, string value) -- not tokenized, indexed, stored
  Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
  Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want
  Example code:
  protected void addCustomFields(Document newDoc, Document existingDoc) throws Exception {
  String customContent = "Some content";
  newDoc.add(Field.Text("mycustomefield", customContent));
  }
  
  Parameters:
  
  newDoc - The new Document that is being created for this resource
  
  existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
  
  sourceFile - The sourceFile that is being indexed
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- addCustomFields
  
  protected void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile) throws Exception
  
  Adds the full content of the XML to the default search field. Strips the XML tags to extract the content. Will not work properly if the XML is not well-formed.
  
  Specified by:
  
  addCustomFields in class FileIndexingServiceWriter
  
  Parameters:
  
  newDoc - The new Document that is being created for this resource
  
  existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
  
  sourceFile - The feature to be added to the CustomFields attribute
  
  Throws:
  
  Exception - This method should throw and Exception with appropriate error message if an error occurs.
- getDeletedDoc
  
  public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc) throws Throwable
  
  Creates a Lucene Document for the XML that is equal to the exsiting Document.
  
  Overrides:
  
  getDeletedDoc in class FileIndexingServiceWriter
  
  Parameters:
  
  existingDoc - An existing FileIndexingService Document that currently resides in the index for the given file
  
  Returns:
  
  A Lucene FileIndexingService Document
  
  Throws:
  
  Throwable - Thrown if error occurs
- getMyAnnoResultDocs
  
  protected ResultDocList getMyAnnoResultDocs() throws Exception
  
  Gets the annotations for this record, null or zero length if none available.
  
  Returns:
  
  The myAnnoResultDocs value
  
  Throws:
  
  Exception - NOT YET DOCUMENTED
- getXmlIndexerFieldsConfig
  
  protected XMLIndexerFieldsConfig getXmlIndexerFieldsConfig()
  
  Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.
  
  Returns:
  
  The xmlIndexerFieldsConfig value
- getFieldContent
  
  protected String getFieldContent(String[] values, String useVocabMapping, String metadataFormat) throws Exception
  
  Gets the vocab encoded keys for the given values, separated by the '+' symbol.
  
  Parameters:
  
  values - The valuse to encode.
  
  useVocabMapping - The mapping to use, for example "contentStandards".
  
  metadataFormat - The metadata format, for example 'adn'
  
  Returns:
  
  The encoded vocab keys.
  
  Throws:
  
  Exception - If error.
- getFieldContent
  
  protected String getFieldContent(String value, String useVocabMapping, String metadataFormat) throws Exception
  
  Gets the encoded vocab key for the given content.
  
  Parameters:
  
  value - The value to encode
  
  useVocabMapping - The vocab mapping to use, for example 'contentStandard'
  
  metadataFormat - The metadata format, for example 'adn'
  
  Returns:
  
  The encoded value, or unchanged if unable to encode
  
  Throws:
  
  Exception - If error
- getFieldName
  
  protected String getFieldName(String vocabFieldString, String metadataFormat) throws Exception
  
  Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'. If unable to get the field ID, the vocab field String is returned unchanged.
  
  Parameters:
  
  vocabFieldString - The field, for example 'gradeRange'
  
  metadataFormat - The metadata format, for example 'adn'
  
  Returns:
  
  The field key, for example 'gr', or unchanged if unable to determine
  
  Throws:
  
  Exception - If error
- getTermStringFromStringArray
  
  protected String getTermStringFromStringArray(String[] vals)
  
  Gets the appropriate terms from a string array of metadata fields. Uses all terms found after the last colon ":" found in the string.
  
  Parameters:
  
  vals - Metadata fields that must be delemited by colons.
  
  Returns:
  
  The individual terms used for indexing.
- getXmlIndexer
  
  protected XMLIndexer getXmlIndexer() throws Exception
  
  Gets the XMLIndexer for use by sub-classes
  
  Returns:
  
  The XMLIndexer
  
  Throws:
  
  Exception - If error
- getDom4jDoc
  
  protected org.dom4j.Document getDom4jDoc() throws Exception
  
  Gets the dom4j Document for use by sub-classes
  
  Returns:
  
  The Document
  
  Throws:
  
  Exception - If error
- getMyCollectionDoc
  
  protected DleseCollectionDocReader getMyCollectionDoc()
  
  Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.
  
  Returns:
  
  The myCollectionDoc value
- getOaiModtime
  
  public static final String getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc)
  
  Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.
  
  Parameters:
  
  sourceFile - The source file
  
  existingDoc - The existing Doc
  
  Returns:
  
  The oaiModtime value
- getRecordDataService
  
  protected RecordDataService getRecordDataService()
  
  Gets the recordDataService used by this XML File Indexer
  
  Returns:
  
  The recordDataService, or null if not available.
- getIndex
  
  protected SimpleLuceneIndex getIndex()
  
  Gets the index used by this XML File Indexer
  
  Returns:
  
  The index, or null if not available.

Class XMLFileIndexingWriter

Constructor Summary

Method Summary

Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter

Methods inherited from class java.lang.Object

Constructor Details

XMLFileIndexingWriter

Method Details

getIds

getPrimaryId

getRelatedIds

getRelatedUrls

getRelatedIdsMap

getRelatedUrlsMap

getCollections

getDocGroup

getBoundingBox

init

_getIds

getTitle

getDescription

getUrls

indexFullContentInDefaultAndStems

getWhatsNewDate

getWhatsNewType

addFields

addCustomFields

getDeletedDoc

getMyAnnoResultDocs

getXmlIndexerFieldsConfig

getFieldContent

getFieldContent

getFieldName

getTermStringFromStringArray

getXmlIndexer

getDom4jDoc

getMyCollectionDoc

getOaiModtime

getRecordDataService

getIndex