Class XMLIndexer

java.lang.Object
org.dlese.dpc.index.writer.xml.XMLIndexer

public class XMLIndexer extends Object
Adds index fields to a Lucene Document from any well-formed XML. Individual field names are derived from the xPath to each element and attribute in the XML instance document. Fields are encoded to support text, keyword and stemmed search. Also creates standard fields for IDs, URLs, title, description and geospatial bounding box footprint. The 'default' and 'stems' fields are also indexed as text and stemmed text, respectively.

A XMLIndexerFieldsConfig may be supplied to configure specific search fields for given XML formats. If a field is defined in the XMLIndexerFieldsConfig, and content is avialable at the given xPath, it will override the value set for ids, urls, title or description. In addition, field values configured by schema override those configured by xmlFormat.

Author:
John Weatherley
See Also:
  • Constructor Details

    • XMLIndexer

      public XMLIndexer(org.dom4j.Document localizedXmlDocument, String xmlFormat, XMLIndexerFieldsConfig xmlIndexerFieldsConfig)
      Constructor for the XMLIndexer object
      Parameters:
      localizedXmlDocument - A localized XML Document
      xmlFormat - The XML format being indexed, for example adn or oai_dc
      xmlIndexerFieldsConfig - The config, or null if not used
    • XMLIndexer

      public XMLIndexer(String xmlString, String xmlFormat, XMLIndexerFieldsConfig xmlIndexerFieldsConfig) throws Exception
      Constructor for the XMLIndexer object
      Parameters:
      xmlString - A valid XML string
      xmlFormat - The XML format being indexed, for example adn or oai_dc
      xmlIndexerFieldsConfig - The config, or null if not used
      Throws:
      Exception - If error
    • XMLIndexer

      public XMLIndexer(URL urlToXml, String xmlFormat, XMLIndexerFieldsConfig xmlIndexerFieldsConfig) throws Exception
      Constructor for the XMLIndexer object
      Parameters:
      urlToXml - URL to an XML document
      xmlFormat - The XML format being indexed, for example adn or oai_dc
      xmlIndexerFieldsConfig - The config, or null if not used
      Throws:
      Exception - If error
  • Method Details

    • setIndexDefaultAndStemsField

      public void setIndexDefaultAndStemsField(boolean indexDefaultAndStemsField) throws IllegalStateException
      Sets whether to index the default, admindefault, and stems field for this record.
      Parameters:
      indexDefaultAndStemsField - The value to assign indexDefaultAndStemsField.
      Throws:
      IllegalStateException - If called after method #indexFields has been called
    • getTitle

      public String getTitle() throws IllegalStateException
      Returns the value of title.
      Returns:
      The title value
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • setTitle

      public void setTitle(String title) throws IllegalStateException
      Sets the value of title.
      Parameters:
      title - The value to assign title.
      Throws:
      IllegalStateException - If called after method #indexFields has been called
    • getDescription

      public String getDescription() throws IllegalStateException
      Returns the value of description.
      Returns:
      The description value
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • setDescription

      public void setDescription(String description) throws IllegalStateException
      Sets the value of description.
      Parameters:
      description - The value to assign description.
      Throws:
      IllegalStateException - If called after method #indexFields has been called
    • getUrls

      public String[] getUrls() throws IllegalStateException
      Returns the value of urls.
      Returns:
      The urls value
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • setUrls

      public void setUrls(String[] urls) throws IllegalStateException
      Sets the value of urls.
      Parameters:
      urls - The value to assign urls.
      Throws:
      IllegalStateException - If called after method #indexFields has been called
    • getIds

      public String[] getIds() throws IllegalStateException
      Returns the value of ids.
      Returns:
      The ids value
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • setIds

      public void setIds(String[] ids) throws IllegalStateException
      Sets the value of ids.
      Parameters:
      ids - The value to assign ids.
      Throws:
      IllegalStateException - If called after method #indexFields has been called
    • getIdsEncoded

      public String[] getIdsEncoded() throws IllegalStateException
      Returns unique IDs for the item being indexed encoded for indexing. If more than one ID is present, the first one is the primary.
      Returns:
      The id Strings encoded for indexing
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
      See Also:
    • getRelatedIds

      public List getRelatedIds() throws IllegalStateException
      Gets the ids of related records.
      Returns:
      The related ids
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getRelatedUrls

      public List getRelatedUrls() throws IllegalStateException
      Gets the urls of related records.
      Returns:
      The related urls
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getRelatedIdsMap

      public Map getRelatedIdsMap() throws IllegalStateException
      Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.
      Returns:
      The related ids
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getRelatedUrlsMap

      public Map getRelatedUrlsMap() throws IllegalStateException
      Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.
      Returns:
      The related urls
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getXPathFieldsPrefix

      public String getXPathFieldsPrefix()
      Returns the value of xPathFieldsPrefix, or null if none.
    • setXPathFieldsPrefix

      public void setXPathFieldsPrefix(String xPathFieldsPrefix) throws IllegalStateException
      Sets the value of xPathFieldsPrefix, which is appended at the front of the xPath fields when indexed. Set to null to use none (default).
      Parameters:
      xPathFieldsPrefix - The value to append to the xPath fields, or null for none
      Throws:
      IllegalStateException
    • getBoundingBox

      public BoundingBox getBoundingBox()
      Returns the value of boundingBox.
    • setBoundingBox

      public void setBoundingBox(BoundingBox boundingBox)
      Sets the value of boundingBox.
      Parameters:
      boundingBox - The value to assign boundingBox.
    • getFullXmlElementContent

      public String getFullXmlElementContent() throws IllegalStateException
      Gets the full content of each Element in the XML. Attribute content is not included. If this is a Java Bean, gets the contnet of all Bean properties. Method #indexFields must be called prior to using this method.
      Returns:
      The full Element content
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getFullXmlAttributeContent

      public String getFullXmlAttributeContent() throws IllegalStateException
      Gets the full content of each Attribute in the XML. Element content is not included. Method #indexFields must be called prior to using this method.
      Returns:
      The full Attribute content
      Throws:
      IllegalStateException - If called prior to calling method #indexFields
    • getXmlDocument

      public org.dom4j.Document getXmlDocument()
      Gets the localized Dom4j Document for this XML instance.
      Returns:
      The xml Document
    • indexFields

      public void indexFields(org.apache.lucene.document.Document luceneDoc) throws Exception
      Indexes the contents of the XML, adding fields to the Lucene Document that is supplied.
      Parameters:
      luceneDoc - The Document to add fields to
      Throws:
      Exception - If error, provides an appropriate message to display in indexing reports.
    • indexXpathFields

      public void indexXpathFields(org.apache.lucene.document.Document luceneDoc) throws Exception
      Indexes the content of each element and attribute in the source XML as individual search fields, using the xPath to the element or attribute as the field name. If an xPath field prefix has been indicated it will be inserted at the beginning of the field path.
      Parameters:
      luceneDoc - The Document to add fields to
      Throws:
      Exception - If error, provides an appropriate message to display in indexing reports.
      See Also:
    • indexJavaBeanFields

      public boolean indexJavaBeanFields(org.apache.lucene.document.Document luceneDoc) throws Exception
      Indexes Java Bean XML that was encoded with the java.beans.XMLEncoder class, using the bean properties as field names. If this is not Java Bean encoded XML, nothing is done, returns false.
      Parameters:
      luceneDoc - The Document to add fields to
      Returns:
      True if this is a Java Bean and property fields were indexed.
      Throws:
      Exception - If error, provides an appropriate message to display in indexing reports.