Tuesday, March 17, 2009

Desktop Fascinator: File Synch

We've started work on the Desktop Fascinator and it's coming together well. Oliver has managed to grab a snapshot of the filesystem, put it into FEDORA and index it with SOLR. Hopefully we'll have something up soon for people to check out and comment.

For now, though, I've turned my attention to the part of the system reading the filesystem. The Fascinator uses harvesters to grab data from various places via various means and put the object/metadata into Fedora. For example, we have ORE and PMH harvesters to schlurp (technical term) up repository data. The current filesystem harvester basically takes a snapshot of the filesystem and load metadata into Fedora. We don't make a copy of the file in Fedora as we're expecting the files to get quite large and don't want to replicate that within the desktop.

The main goal is to pick up what the user has in their directories and give them a more expansive (metadata/tags/etc) view of it. This means that the user can continue to use the filesystem and their preferred apps. It also means that we have to keep up with the filesystem state.

The first thought was to poll the filesystem but that is rather intensive. Luckily, one our team members, Linda, has done a thesis that covers the alternatives and, with some quick research, I located some python options for Windows/Linux. I'm not certain how this works in OS X so I'll have to get one of the developers to test it. So for each platform:
This gives us a common code base and we can wrap the code up as a service that logs filesystem going-ons - the FS Watcher Service. Thinking further, The Fascinator may not be running whilst the FS Watcher is doing its thing so we can push each event into a queue to be consumed once The Fascinator is running.

The team laughs at my diagrams so I like to make sure I include one:

Diagram to match filesystem watcher description

There may still be an issue with staying current with the filesystem state. Something may be lost (if the service dies for some reason) so we might still need the scanning system in case a disconnect occurs between the filesystem and the repository. This would potentially be something that the user can run when they're not finding their file in The Fascinator.

OAI-PMH in ePrints

I was chatting with Peter Sefton today about OAI-PMH and resource type names so I thought I'd get back to documenting some ePrints work.

So, to configure OAI-PMH for eprints is straight forward. Within your archive's folder you'll find cfg/cfg.d/oai.pl. You can make some basic alterations such as defining sets and setting up your metadata policy.

Actually deciding how to represent your PMH data is important. For repositories in Australia, the NLA's Arrow Discovery Search provides an interesting angle on this. Just because your repository can be harvested doesn't mean it's interoperable. You should try and see if there's a naming scheme for resource types that others around you are using. Check out Peter Sefton's blog post for a more in-depth commentary on this.

So, what do you do if internally you want to define a Resource Type "Article" but the outside world want to know it as "Journal Article"? Well, for one, you can use the phrase files to call it whatever you want for users of the system. So if eprints internally calls it "Paper" your phrase file can call it "Article" to make the data submission less confusing.

But, if you want to change the way that the metadata is presented to systems such as OAI-PMH, you need to look at how ePrints deals with Dublin Core. Basically, the OAI-PMH data contains the metadata in DC. Each eprint page also puts the DC into the page's meta tags.

Under perl_lib/EPrints/Plugin/Export you'll find DC.pm. Now, this is an important file if you define your own resource type (with non-inbuilt fields) as eprints won't know about them and you'll have to make sure you adapt files such as DC.pm to output your Resource Type's metadata correctly. Look at the convert_dataobj procedure and you'll see how fields are then put into DC format.

So, we created a qut_thesis type to indicate QUT-based theses for collection within the ADT. Now, eprints doesn't know anything about qut_thesis and we needed to edit DC.pm:

if ($eprint->get_value( "type" ) eq 'thesis' || $eprint->get_value( "type" ) eq 'qut_thesis') {

push @dcdata, [ "publisher", $eprint->get_value( "institution" ) ] if( $eprint->exists_and_set( "institution" ) );

} else {
push @dcdata, [ "publisher", $eprint->get_value( "publisher" ) ] if( $eprint->exists_and_set( "publisher" ) );


This is a basic example - I just make sure that qut_thesis DC is the same as the thesis representation. But the MACAR resource types don't include qut_thesis (funnily enough) and we want to make sure that our DC is what external readers/harvesters expect. So, the DC.pm file changes the DC type to Thesis for qut_thesis types:

if( $eprint->exists_and_set( "type" ) )
# We need to map the types to that of MACAR
# But only if the text displayed in the eprint_types.xml
# phrases file does not match MACAR
my $type = $eprint->get_value( "type" );
if ($type eq 'qut_thesis') {
push @dcdata, [ "type", "Thesis" ];
} else {
push @dcdata, [ "type", EPrints::Utils::tree_to_utf8( $eprint->render_value( "type" ) ) ];

So, if you visit a sample eprint, and select the Export Dublin Core link you'll see that the metadata indicates that the document is a Thesis. qut_thesis is only interesting to QUT so we keep it out of the Dublin Core.

Nearly there...

The problem is, qut_thesis represents theses from QUT and this is harvested by the ADT system. You have to see how eprints delineates OAI-PMH sets to do this. Basically, eprints does a search on various fields such as type, subject and ispublished. It does this search and then formats the DC metadata. The code above only changes the output at this latter step. Check your archive's cfg/cfg.d/oai.pl to see the sets being defined.

Checkout those import and export plugin folders. If you define your own types and properties you'll need to make sure that the import/export matches your structures. Only enable the plugins that you know work.

There are better ways for EPrints to do this - e.g. provide an archive mapping system for import/export. However, it doesn't and the work isn't that difficult. Besides, that's part of the strength of a working open source product.

Other PMH Stuff

It's also important to make sure that you have a metadata policy so that harvesters know what they can do. OpenDOAR provides a policies tool to make this easy - it even exports an eprints compliant file that you can then put in your archive's cfg/cfg.d/oai.pl file. For examples of this, see the QUT ePrints Policy and the one at UTas.

Once you're ready to go, check out the Open Archives Initiative site and register as a provider . If anything, it gives you a tick that your PMH output conforms. If you get the tick you'll be added to the list of registered providers. It's quite painless - except for getting people to decide on the actual policy.

* I'm often told "we can't do that - it's hacking the source code". I take this comment to mean that Open Source means $free or that the project has no future capacity to maintain the system. This might set off your alarm bell and you really need to read Just say no to maverick-manager jobs by Dorothea Salo. Consider this - software such as Peoplesoft and ResearchMaster are often "customised" - then again they also get a squadron of full-time staff. Why isn't your IR getting similar attention?

Thursday, March 5, 2009

(e)Research monkey on my back

Many years ago I was studying an MTeach/MEd, having dropped out of full-time coding and into full-time study. I lived in a share house and owned a vintage Dell laptop. Having done my BIT I admit that I never used the Library and did almost no written assignments. Studying education called on me to access articles, read them and try to turn them into a paper. Sometimes I even had to undertake observations. Most of the "research" was qualitative.

So, armed with a stack of highlighted papers I drew up (paper) concept maps and started to flesh out a piece. I used emacs and LaTeX with BibTeX in that first year - it was fun but really relied heavily on my technical background to keep it afloat. For my MEd I had a newer PC so gave EndNote a go. By the time I'd set it up to understand that APA should mention that the article was on the web, I'd left that product.

Now, I don't pretend that my heady coursework days can match your 5 year research effort - nuh-uh. But what I can say is that, if I nearly threw my laptop across the room just trying to create 5000 words, I can only imagine how a doctorate feels.

There has to be a better way. Really.

It also has to be made better for people not hitting the whoa-o-meter with their project. That's people like educators and historians. In fact anyone whose research requires them to collate data that amounts to something less than a large European dodgem circuit. It really seems that, if you're not munching PetaBytes, no-one wants to share your lunch. University ICT teams give you just enough storage to hold a picture of your cat and many eResearch data people are looking for that bigger bang.

So you store all your data on your laptop and a few disks and roll the dice.
Investigating Data Management Practices in Australian Universities and The Next Generation of Academics really showed me some truths:
  • Researchers don't have time to play with their computers and eResearch tools: they just want them to work
  • Researchers aren't catalogers: they don't want to create comprehensive metadata for everything they're reading/watching/creating.
  • Researchers don't run data centres: they want institutional storage and backup so that they don't have to think about it.
  • Researchers (often) work in teams: let them share
  • If your eResearch idea will create more administrative work for researchers, go back to the drawing board. I hope repository admins are listening.
  • It's not about the software - it's about the research getting done.
So, whilst at QUT we worked really hard to integrate our ePrints data collection into the Research Master (HERDC) data collection. This essentially sought to stop doubled-up administrivia. Putting your data into ePrints actually also meant you were largely killing two birds. I really hope this work has been effective. I also really hope that the stand-alone job of submitting to an IR will become as forgotten as the night cart.

So there's a lot of work to be done to create software that helps rather than hinders and workflows that flow, rather than fail.

Leaving the ramble here - ready to hone these ideas into the Desktop eResearch Revolution