Tuesday, March 17, 2009

OAI-PMH in ePrints

I was chatting with Peter Sefton today about OAI-PMH and resource type names so I thought I'd get back to documenting some ePrints work.

So, to configure OAI-PMH for eprints is straight forward. Within your archive's folder you'll find cfg/cfg.d/oai.pl. You can make some basic alterations such as defining sets and setting up your metadata policy.

Actually deciding how to represent your PMH data is important. For repositories in Australia, the NLA's Arrow Discovery Search provides an interesting angle on this. Just because your repository can be harvested doesn't mean it's interoperable. You should try and see if there's a naming scheme for resource types that others around you are using. Check out Peter Sefton's blog post for a more in-depth commentary on this.

So, what do you do if internally you want to define a Resource Type "Article" but the outside world want to know it as "Journal Article"? Well, for one, you can use the phrase files to call it whatever you want for users of the system. So if eprints internally calls it "Paper" your phrase file can call it "Article" to make the data submission less confusing.

But, if you want to change the way that the metadata is presented to systems such as OAI-PMH, you need to look at how ePrints deals with Dublin Core. Basically, the OAI-PMH data contains the metadata in DC. Each eprint page also puts the DC into the page's meta tags.

Under perl_lib/EPrints/Plugin/Export you'll find DC.pm. Now, this is an important file if you define your own resource type (with non-inbuilt fields) as eprints won't know about them and you'll have to make sure you adapt files such as DC.pm to output your Resource Type's metadata correctly. Look at the convert_dataobj procedure and you'll see how fields are then put into DC format.

So, we created a qut_thesis type to indicate QUT-based theses for collection within the ADT. Now, eprints doesn't know anything about qut_thesis and we needed to edit DC.pm:

if ($eprint->get_value( "type" ) eq 'thesis' || $eprint->get_value( "type" ) eq 'qut_thesis') {

push @dcdata, [ "publisher", $eprint->get_value( "institution" ) ] if( $eprint->exists_and_set( "institution" ) );

} else {
push @dcdata, [ "publisher", $eprint->get_value( "publisher" ) ] if( $eprint->exists_and_set( "publisher" ) );


This is a basic example - I just make sure that qut_thesis DC is the same as the thesis representation. But the MACAR resource types don't include qut_thesis (funnily enough) and we want to make sure that our DC is what external readers/harvesters expect. So, the DC.pm file changes the DC type to Thesis for qut_thesis types:

if( $eprint->exists_and_set( "type" ) )
# We need to map the types to that of MACAR
# But only if the text displayed in the eprint_types.xml
# phrases file does not match MACAR
my $type = $eprint->get_value( "type" );
if ($type eq 'qut_thesis') {
push @dcdata, [ "type", "Thesis" ];
} else {
push @dcdata, [ "type", EPrints::Utils::tree_to_utf8( $eprint->render_value( "type" ) ) ];

So, if you visit a sample eprint, and select the Export Dublin Core link you'll see that the metadata indicates that the document is a Thesis. qut_thesis is only interesting to QUT so we keep it out of the Dublin Core.

Nearly there...

The problem is, qut_thesis represents theses from QUT and this is harvested by the ADT system. You have to see how eprints delineates OAI-PMH sets to do this. Basically, eprints does a search on various fields such as type, subject and ispublished. It does this search and then formats the DC metadata. The code above only changes the output at this latter step. Check your archive's cfg/cfg.d/oai.pl to see the sets being defined.

Checkout those import and export plugin folders. If you define your own types and properties you'll need to make sure that the import/export matches your structures. Only enable the plugins that you know work.

There are better ways for EPrints to do this - e.g. provide an archive mapping system for import/export. However, it doesn't and the work isn't that difficult. Besides, that's part of the strength of a working open source product.

Other PMH Stuff

It's also important to make sure that you have a metadata policy so that harvesters know what they can do. OpenDOAR provides a policies tool to make this easy - it even exports an eprints compliant file that you can then put in your archive's cfg/cfg.d/oai.pl file. For examples of this, see the QUT ePrints Policy and the one at UTas.

Once you're ready to go, check out the Open Archives Initiative site and register as a provider . If anything, it gives you a tick that your PMH output conforms. If you get the tick you'll be added to the list of registered providers. It's quite painless - except for getting people to decide on the actual policy.

* I'm often told "we can't do that - it's hacking the source code". I take this comment to mean that Open Source means $free or that the project has no future capacity to maintain the system. This might set off your alarm bell and you really need to read Just say no to maverick-manager jobs by Dorothea Salo. Consider this - software such as Peoplesoft and ResearchMaster are often "customised" - then again they also get a squadron of full-time staff. Why isn't your IR getting similar attention?

No comments :

Post a Comment