Package org.dlese.dpc.oai.harvester
Class Harvester
java.lang.Object
org.dlese.dpc.oai.harvester.Harvester
- All Implemented Interfaces:
ErrorHandler
Harvests metadata from an OAI data provider, saving the results
to file or returning the raw XML as an array of Strings. Supports data providers that use resumption
tokens for flow
control , selective harvesting by
date or
set , gzip
response compression and other protocol features. Supports OAI protocol versions 1.1 and 2.0 .
To perform a harvest, use one of the following methods:
- The static harvest method (for general use):
harvest(java.lang.String, java.lang.String, java.lang.String, java.util.Date, java.util.Date, java.lang.String, boolean, org.dlese.dpc.oai.harvester.HarvestMessageHandler, org.dlese.dpc.oai.harvester.OAIChangeListener, boolean, boolean, boolean, int)}
- The static main method (for command-line use):
main(java.lang.String[])
- The non-static doHarvest method (provides a few additional options):
doHarvest(java.lang.String, java.lang.String, java.lang.String, java.util.Date, java.util.Date, java.lang.String, boolean, java.lang.String, java.lang.String, boolean, boolean, boolean).
- Author:
- Steve Sullivan, John Weatherley
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionCreates a Harvester that uses no HarvestMessageHandler or OAIChangeListener.Harvester(HarvestMessageHandler msgHandler, OAIChangeListener oaiChangeListener, int timeOutMilliseconds) Creates a Harvester that uses the given HarvestMessageHandler. -
Method Summary
Modifier and TypeMethodDescriptionString[][]doHarvest(String baseURL, String metadataPrefix, String setSpec, Date from, Date until, String outdir, boolean splitBySet, String zipName, String zDir, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord) Performs the harvest.voiderror(SAXParseException exc) Handles errors.voidHandles fatal errors.longGets the endTime when the havest completed either because of an error or at the end of a successful harvest.Gets the harvestedRecordsDir attribute of the Harvester objectlongReturns a unique ID for this harvest.intGets the current number of records that have been harvested by this harvester.intGets the number of resumption tokens that have currently been issued by the data provider.longGets the startTime when the harvest began, or 0 if it has not begun yet.static String[][]harvest(String baseURL, String metadataPrefix, String setSpec, Date from, Date until, String outdir, boolean splitBySet, HarvestMessageHandler msgHandler, OAIChangeListener oaiChangeListener, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord, int timeOutMilliseconds) Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings.booleanDetermines whether this Harvester is currently running or not.voidkill()Gracefully kills the harvest after the current record is finished being harvested.static voidCommand line interface for the harvester.static voidsetDebug(boolean db) Sets the debug attribute objectvoidsetNumRecordsForNotification(int numRecords) Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.voidwarning(SAXParseException exc) Handles warnings.
-
Constructor Details
-
Harvester
public Harvester()Creates a Harvester that uses no HarvestMessageHandler or OAIChangeListener. -
Harvester
public Harvester(HarvestMessageHandler msgHandler, OAIChangeListener oaiChangeListener, int timeOutMilliseconds) Creates a Harvester that uses the given HarvestMessageHandler.- Parameters:
msgHandler- The HarvestMessageHandler that will receive messages as the harvest progresses, or null if none.oaiChangeListener- The OAIChangeListener that will recieve notifications, or null for none.timeOutMilliseconds- Number of milliseconds the harvester will wait for a response from the data provider before timing out
-
-
Method Details
-
main
Command line interface for the harvester. Harvest status messages are output to standard out.Arguments (required arguments must be in this order, optional arguments may be in any order):
- outdir (required) - Path to the directory to write the harvested record files, for example "." or "/home/user/harvested_files"
- baseURL (required) - Base URL to harvest from, for example "http://www.dlese.org/oai/provider"
- metadataPrefix (required) - The metadata prefix, for example "oai_dc"
- [ -set:setSpec ] (optional) - The set to harvest, for example -set:myset
- [ -from:fromDate ] (optional) - The harvest from date, for example, -from:2003-12-31T23:59:59Z
- [ -until:untilDate ] (optional) - The harvest until date, for example, -until:2004-12-31T23:59:59Z
- [ -splitBySet:true|False ] (optional) - True to save each record in separate directories split by set inside outdir, false to save all records to the root of outdir (default is false)
- [ -writeHeaders:true|False ] (optional) - True to have OAI headers written to the output, false not to (default is false)
- Parameters:
args- The command line arguments
-
harvest
public static String[][] harvest(String baseURL, String metadataPrefix, String setSpec, Date from, Date until, String outdir, boolean splitBySet, HarvestMessageHandler msgHandler, OAIChangeListener oaiChangeListener, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord, int timeOutMilliseconds) throws Hexception, OAIErrorException Harvest the given provider, saving the resulting metadata to file or returning the results as an array of Strings. A HarvestMessageHandler may be specified to capture harvest progress messages. Use aSimpleHarvestMessageHandlerto have harvest messages sent to standard out. AOAIChangeListenermay be specified to recieve messages about chages to harvested records.- Parameters:
baseURL- The baseURL of the data provider, for example "http://www.dlese.org/oai/provider"metadataPrefix- The metadataPrefix, for example "oai_dc"setSpec- The set to harvest, for example "testset", or null to harvest all setsfrom- The from date, for example "2003-12-31T23:59:59Z", or null for noneuntil- The until date, for example "2003-12-31T23:59:59Z", or null for noneoutdir- The path of output dir. If null or "", we return the String[][] array; if specified we return nullsplitBySet- True to save each record in separate directories split by set inside outdir, false to save all records to the root of outdirmsgHandler- A handler for status messages that occur during the harvest, or null to ingnore messagesoaiChangeListener- The OAIChangeListener that will recieve notifications, or null for nonewriteHeaders- True to have OAI headers written to the output, false not toharvestAll- True to delete previous harvested record files and harvest all records again from scratch; false to preserve previous record files and replace or delete only those that have changedharvestAllIfNoDeletedRecord- True to harvest all record files from scratch if deleted records are not supportedtimeOutMilliseconds- Number of milliseconds the harvester will wait for a response from the data provider before timing out- Returns:
- If outdir is specified returns null; if outdir is null or "", returns
one row for each record harvested. Each row has two elements:
- identifier, encoded
- content xml record, or the String deleted if status=deleted.
- Throws:
Hexception- If serious errorOAIErrorException- If OAI error
-
kill
public void kill()Gracefully kills the harvest after the current record is finished being harvested. -
setNumRecordsForNotification
public void setNumRecordsForNotification(int numRecords) Sets the number of records harvested before statusMessage notifications to the HarvestMessageHandler are made.- Parameters:
numRecords- The new numRecordsForNotification value
-
getStartTime
public long getStartTime()Gets the startTime when the harvest began, or 0 if it has not begun yet.- Returns:
- The startTime, or 0 if not started yet.
-
getHarvestedRecordsDir
Gets the harvestedRecordsDir attribute of the Harvester object- Returns:
- The harvestedRecordsDir value
-
getHarvestUid
public long getHarvestUid()Returns a unique ID for this harvest.- Returns:
- The harvestId value
-
getEndTime
public long getEndTime()Gets the endTime when the havest completed either because of an error or at the end of a successful harvest. Returns 0 if the harvest is still in progress.- Returns:
- The endTime, or 0 if the harvest is still in progress.
-
getNumRecordsHarvested
public int getNumRecordsHarvested()Gets the current number of records that have been harvested by this harvester. This number increases as the harvest progresses.- Returns:
- The numRecordsHarvested value
-
getNumResumptionTokensIssued
public int getNumResumptionTokensIssued()Gets the number of resumption tokens that have currently been issued by the data provider. This number increases as the harvest progresses. This number gives a rough indication of the progression and duration of the harvest.- Returns:
- The numResumptionTokensIssued value.
-
isRunning
public boolean isRunning()Determines whether this Harvester is currently running or not.- Returns:
- True if the harvest is in progress, false otherwise.
-
doHarvest
public String[][] doHarvest(String baseURL, String metadataPrefix, String setSpec, Date from, Date until, String outdir, boolean splitBySet, String zipName, String zDir, boolean writeHeaders, boolean harvestAll, boolean harvestAllIfNoDeletedRecord) throws Hexception, OAIErrorException Performs the harvest. Note that his method is not safe for multiple harvests - a separate Harvester instance should be created for each havest performed.- Parameters:
baseURL- The baseURL of the data provider.metadataPrefix- metadataPrefix. e.g., "oai_dc", or null to harvest all formatssetSpec- set. e.g., "testset" or null for none.from- from date. May be null.until- until date. May be null.outdir- path of output dir. If null or "", we return the String[][] array; if specified we return null.splitBySet- To split setzipName- Name of the zip file to save to, or null for no zippingzDir- Directory of the zipfilewriteHeaders- True to have oai headers written to file, false not to.
The directory structure under outdir is:
outdir/set/subset/subset/metadataPrefix/oaiId_hdr.xml OAI header
outdir/set/subset/subset/metadataPrefix/oaiId_data.xml OAI contentsharvestAll- True to delete previous harvested records and harvest all records again from scratchharvestAllIfNoDeletedRecord- True to harvest all records from scratch if deleted records are not supported- Returns:
- If outdir is specified returns null; if outdir is null or "", returns
one row for each record harvested. Each row has two elements:
- identifier, encoded
- content xml record.
- Throws:
Hexception- If serious error.OAIErrorException- If OAI error was returned by the data provider.
-
setDebug
public static void setDebug(boolean db) Sets the debug attribute object- Parameters:
db- The new debug value
-
fatalError
Handles fatal errors. Part of ErrorHandler interface.- Specified by:
fatalErrorin interfaceErrorHandler- Parameters:
exc- The Exception thrown
-
error
Handles errors. Part of ErrorHandler interface.- Specified by:
errorin interfaceErrorHandler- Parameters:
exc- The Exception thrown
-
warning
Handles warnings. Part of ErrorHandler interface.- Specified by:
warningin interfaceErrorHandler- Parameters:
exc- The Exception thrown
-