| Authors: | Seek You Too |
|---|---|
| Organization: | Seek You Too |
| Version: | 0.1 |
| Copyright: | 2008 by Seek You Too |
| License: | Attribution-Noncommercial-No Derivative Works 3.0 License |
| Document History: | |
|---|---|
|
|
Table of Contents
This document describes the Example Application supplied with Meresco. It also will try to establish the path of the data as it travels from the originating OAI repository to the end-user. The Example Application consists out of a server which provides:
- SRU Query support
- SRU Update support
- SRU Term Drilldown
- RSS support
The Example Application uses the output of the Meresco harvester as input for indexing records. Currently the Example Application can index OAI Dublin Core. The XML Hierarchy of the OAI Dublin Core record is flattened and can be queried; by doing so, Meresco creates a one on one representation of the original data and the fields upon which can be queried.
For example, the following OAI Dublin Core record:
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/">
<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">The title</dc:title>
</oai_dc:dc>
Can be queried for after indexing by Meresco with the statement:
dc.title="The title"
The basic structure of the Example Server is that there is an ObservableHttpServer which is observed by three PathFilters. These PathFilters determine based on the path of the request if the call should be let through or not.
The support for SRU queries is provided by the following part of the Example Application DNA:
...
unqualifiedTermFields = [('dc', 1.0)]
DRILLDOWN_PREFIX = 'drilldown.'
drilldownFieldnames = ['drilldown.dc.subject']
drilldownComponent = Drilldown(drilldownFieldnames)
indexHelix = \
(LuceneIndex(join(databasePath, 'index'), timer=reactor),
(drilldownComponent,)
)
...
(Sru( host=host, port=portNumber,
defaultRecordSchema='oai_dc', defaultRecordPacking='xml'),
(CQL2LuceneQuery(unqualifiedTermFields),
indexHelix
),
(storageComponent,),
(SRUDrilldownAdapter(),
(SRUTermDrilldown(),
(DrilldownRequestFieldnameMap(
lambda field: DRILLDOWN_PREFIX + field,
lambda field: field[len(DRILLDOWN_PREFIX):]),
(drilldownComponent,)
)
)
)
)
This part of the Example Application DNA will convert the arguments specified in the URL into a SRU query. Next the query will be converted from CQL to the native Lucene format after will it will be executed. The query will result into just the identifiers of the records that match. These identifiers are then fetched from the storage in the request record-schema and the result will be rendered uing the specified record-packing.
In addition to answering queries, this part of the Example Application will also answer SRU Term Drilldown queries. During the index process, fields marked to be suitable for drilldown have been indexed under a different fieldname suitable for drilldown. This fieldname is the originalfield name with a 'drilldown.' prefix. The SRU Term drilldown query contains the name of the field by which the enduser know the field and thus it will need to have the 'drilldown.' prefix prepended. This is done by the DrilldownRequestFieldnameMap component before the fieldnames are passed to the drilldown component.
The support for SRU Update is provided by the following part of the Example Application DNA:
...
fields2LuceneDocument = \
(TransactionFactory(lambda tx:
Fields2LuceneDocumentTx(tx, untokenized=drilldownFieldnames)),
index
)
indexingHelix = \
(Transparant(),
fields2LuceneDocument,
(FilterField(lambda name:
DRILLDOWN_PREFIX + name in drilldownFieldnames),
(RenameField(lambda name: DRILLDOWN_PREFIX + name),
fields2LuceneDocument
)
)
)
...
(WebRequestServer(),
(SRURecordUpdate(),
(Amara2Lxml(),
(TransactionScope(),
(Venturi(
should=[
('metadata', '/document:document/document:part[@name="metadata"]/text()')
],
namespaceMap={
'document': 'http://meresco.com/namespace/harvester/document'}),
(XmlXPath(['/oai:metadata/oai_dc:dc']),
(XmlPrintLxml(),
(RewritePartname('oai_dc'),
(storageComponent,)
)
),
(Xml2Fields(),
indexingHelix,
(RenameField(lambda name: "dc"),
indexingHelix
),
),
),
)
)
)
)
)
The SRURecordUpdate component currently in Meresco is one of the older components present and is therefor not completely up to data when it comes to how its used in with the other components. As a first step towards getting the component up to date, two wrapper components have been created so that to the other components the SRURecordUpdate component seems to work like any other component.
The record that is offered for indexing by the external harvesting source (in most cases this will most likely be the meresco harvester) is accepted and the SRUUpdate protocol envelope is removed. The data contained within this envelope should have the following XML structure:
<document xmlns="http://meresco.com/namespace/harvester/document">
<part name="my_data1"><data1/></part>
<part name="my_data2"><data2/></part>
<part name="my_data3"><data3/></part>
...
</document>
The TransactionScope component will start a new transaction and pass on the data. After the data has been processed and no error occured, the TransactionScope will signal for the changes to be commited. The Venturi component will perform xpath queries in the data it receives. If an xpath query returns a result, the result of that query will be passed on the the observers. The name specified in front of the xpath query will be passed along as the partname of the xpath query result.
Next the XmlXPath component will filter out the OAI DC tag which will then be stored in the storage under the name 'oai_dc'. The Xml2Fields component will flatten the XML and create the labels which will be indexed by the index later on.
The observant reader will have noticed a variable called 'unqualifiedTermFields' in the previous chapter. The unqualifiedTermFields is a list of fields which will be queried if there is no field specified in the query. A good practise is to index data under the label of the root tag and to use that as a default field. In order to do this with Meresco, the RenameField component will add an additional label named after the root tag and copy the contents.
To do.
To do.