Tuesday, December 1, 2009

London Trip (Part 1)

I headed off to London on Saturday. The first flight was an uneventful 14 hours from Brisbane to Dubai in an Emirates A340 - uneventful is a good thing.

The next leg was the 8 hours from Dubai to London and the first flight I've had in an Emirates A380. It's a big plane (you knew that) and had 3 bridges for boarding. I was at the back of the plane on the first floor and it was obvious that these flying beasts take a long time for boarding.

I was always skeptical of some of the airline promos going on - you know, with the idea of bars/lounges, meeting areas and indoor cricket. In reality, the seats and in-seat entertainment screens were slightly wider than those on the A340. Food also took a while to serve - on a longer flight I imagine that the rear seats must be getting breakfast as lunch is being served at the front.

The underground trip from Heathrow to Gloucester Road started off well, with a lady giving me a day-pass as she didn't need it anymore. The hotel proved to be close to the station and the room is quite pleasant. Too easy!

So, with only spasmodic naps under my belt from the past 36 hours I headed out for a quick trip to the London Science museum. It was a cold and rainy Sunday and, you guessed it, the place was full of kids. I don't mind kids (especially those that are stunned by trains and atmospheric engines) and the museum is now free (unlike the last time) so it was all fine. I got a few shots of Puffing Billy, Rocket and the Energy Hall before heading off.

Dinner was at Garfunkel's. I'd never tried it and won't try it again.

Next morning was an early awakening - 4am. After moderating myself at the breakfast buffet I headed out. Yup, out into the wind and rain - it reminded me of London :) I only got so far before stopping in for a coffee and a quick dry...

View Larger Map

My main target was the London Transport Museum to see the displays about the early Underground. It's a nice museum and worth the visit. I was happy to see some stuff on Marc Brunel's tunnelling shield as well as a fair bit on the early underground steam trains. Frankly, I have no desire to have experienced a form of travel that would feel like sucking on the household chimney. Funny thing is, they banned smoking from the underground whilst steam was still in use...

I topped the day with a roast beef dinner and pint at the pub down the road. Happy to see yorkshire pud but wondered if I'd bought the cask's last pint of London Pride. It was about as lively as my old long-term unemployed flat mate. I'll have to find someplace with more than 3 taps...

No pictures sorry - I didn't bring the cable for my camera. D'oh.

PS Love the UK's Yesterday channel's close signal: "Yesterday will return at 6am"

Friday, November 20, 2009

Fascinator 0.3.0 and 2009 in review

Well, Linda has just released Version 0.3.0 of The Fascinator and I thought I'd take the opportunity to sit back and reflect on our work in 2009.

So, to keep it short, here's my Top 5 "big things" from our Fascinator work this year:

1. Complete Fascinator redesign
We took a look at The Fascinator as it had been built for the ARROW project and started to reconceptualise it as a desktop system. Our work has lead us to take on a plugin notion throughout the system and this gives us the flexibility of plugging in different harvesters, storage layers and transformers.

Maybe calling it a desktop eResearch system is a misnomer. We are certainly working to ensure that it runs on the desktop in a friendly manner but it should also scale so that it can be the faculty eResearch system, the institutional eResearch system and so on. eResearch - isn't it all about data and collaboration?

2. Advancing our Agile approach
The team (esp. Linda) has been working to fulfill more of the Agile approach. We're getting a lot better at scoping specific releases and actually releasing them. We're also finding a good balance in our documentation and knowledge sharing, with the idea that all the developers can work across the system.

The hope now is to draw a line at some point near Easter and get The Fascinator 2.0 out there.

3. Embracing Maven

Oliver had utilised Maven in the original Fascinator and we continued this work. It's quite a learning curve but investment in time has meant that we can easily get instances of TF up and running and (hopefully) be more open to external developers wanting to take a crack at the code. Complementing this is our Nexus repository which allows us to manage dependencies without chunks of doco and frustration.

This aspect is still developing and we hope to have a Continuous Integration service (Hudson) up and running in the new year. This will allow us to release daily snapshots, keep a live Maven site constantly updated and allow for Darth Tater to be quickly relocated to anyone that breaks the build.

4. Working with RDF
This proved a steep learning curve. We picked up RDF2Go and RDFReactor on the back of implementing the Aperture system. As we started to develop new harvesters and indexing rules we found the need to read/write RDF. I even got in and developed an RDF Reactor plugin for Eclipse with the hope of easing the development of my long overdue feed reader plugin.

The area of RDF development still has a long way to go before it matches the abstraction provided to RDBMS developers.

We're also getting a grip on SPARQL and may even have a triple store running at the back of TF sometime in 2010.

5. Presenting Fascinator
Well, this was big for me - I presented at eResearch Australasia 2009 and had a room full of people.

I'm off to the UK for the DCC and IEEE eScience conferences in the first half of December. I'm looking forward to meeting a raft of new people and (hopefully) showing off our work over a pint.

Importantly, it's not just me or Peter working on this stuff - Bron, Linda, Oliver, Ron and Cynthia have been developing, designing, planning and dealing with our flights of fancy (including harried IMs from conferences). Thanks guys!

Monday, November 16, 2009

eResearch Australasia 2009

Well, it was a busy week last week as Peter Sefton and I attended eResearch Australasia 2009. This post represents my report on the main items of interest for our development work - it's not a blow by blow account.

I'm told that slides and video will be available online shortly.

Monday (Workshops)
Two workshops on Monday. The first was "Tools and Technologies for the Social Sciences and Humanities" and focussed on the ASSDA (Australian Social Science Data Archive). Main points of interest were:
  • ASSDA is working to incorporate more qualitative research artefacts
  • Work is being done to provide quantitative analysis tools via the Nesstar tool
  • The Historical Census and Colonial Data Archive discussed the difficulty of digitising older texts (in this case fiche) - a fair bit of manual work is involved, esp. for tabular data. This work was outsourced to India.
The afternoon saw me trying out R, a data mining package. It was interesting, if a little out of my normal mode of operation.

Tuesday (Conference)
The presentation on the Black Loyalist repository was an interesting look at a project that took historical documents and attempted to map the lives of little known slaves in the US. Of interest to me was the user interface which provides timeline, map and network visualisations that help you discover an individual's movements and relationships. This is backed up through links to the original source. Furthermore, the project team is working to crowd source the project by allowing others to comment and contribute to the project. Behind the scenes is the timeline from Simile widgets and I'm not sure where the network map is from.

Mitchell Whitelaw's visualisations of archival datasets was very interesting. Of note was the A1 Explorer which provides a tag cloud that would be really interesting to see within the Fascinator. See http://visiblearchive.blogspot.com/

I presented The Fascinator and that seemed to go well. I really feel that we're working on "new" stuff here and was encouraged by people's interest in the project and the various technologies we've been utilising.

Wednesday (Conference)
I attended Anne Cregan's introduction to Linked Open Data in the morning and a BOF by Peter Sefton, Anna Gerber and Peter Murray-Rust on the same topic in the afternoon. Peter describes the BOF in his blog. I'll only add that the W3C's Media Fragments work was mentioned and this looks to provide a method for linking to video segments. I haven't looked into this standard (yet) and interested as to how it relates to SMIL.

Rob Chenrich's presentation on the Atlas of Australia was a good look at Danno, an RDF-based annotation server for text and images. It's completely browser-based and I'm really interested in setting this system up on my PC and annotate my local Fascinator. Now, if we could annotate media fragments....

Thursday (Conference)
The sessions where generally informative but not specifically related to our work. I did enjoy the text mining session by Calum Robertson. With the NLA putting newspapers online it would be interesting to mine old news to find emerging patterns. Specifically for The Fascinator, text mining could bring out patterns in content such as interviews.

Friday (Workshop)
This was an eResearch Project Management session that covered a lot of stuff in a few hours. It was a generally OK session but a lot of our work is of a size that I feel a weighty PM approach would slow us down. I'm a big fan of our work in the Maven space and out on-going work to refine our development practise. I can see the need to scope our projects to a reasonably formal level but, beyond that, PM starts to dominate the actual work.

Wednesday, November 11, 2009

The Fascinator @ eResearch Australasia

Well, I gave my first substantial conference presentation today and, although I felt really nervous, I'm told it didn't show. The presentation ran for 15 minutes with 5 on top for questions.

I was asked 2 questions - one by Andrew Treloar on the design and another by Jim Richardson about tagging. Andrew asked about the code model and I thought I'd add some extra clarification: the whole system is a plugin model so already runs as components. What it lacks, however, is an asynchronous, parallel form of communication. So, when you harvest a file, the system will transform and index in serial and then look at the next file. What we're wanting to do is move towards a message queue system (e.g. Rabbit MQ or Apache's MQ) that allows the system to break up and spread things like transformation. This is very useful when you hit a 1Gig video that you want to transform to flv. Time, however, is always challenging us...

Jim's question was handy as I had forgotten to show off the tagging system. We're using the CommonTag schema (http://commontag.org/) so can point to endpoints. We're currently creating a user endpoint using their email as the URI but hope to have you linking to ontologies and places like dbPedia soon(ish).

On the tagging front, I'd like to see us build an ontology/taxonomy/thesearus builder. This may be based on SKOS and will allow the user to create their own thesaurus. For example, in our current work, Leonie could create a list of participants for use in tags. Peter's also interested in hierarchical tagging (e.g. people/duncan) that doesn't require you to define anything formally. With this data we could create at least a basic SKOS for the user at publication time.

At some point in the near future (once it's been cleared) you'll be able to check out the slides via USQ ePrints: http://eprints.usq.edu.au/6090.

Wednesday, October 14, 2009

Thoughts on the ANDS Roadshow

On October 29 and 30 I attended the Australian National Data Service (ANDS) Roadshow. The primary topics were:
  • Research data policy and the Australian Code for the Responsible Conduct of Research

  • ANDS software services
This post is a bit of my thinking about the various topics discussed rather than a didactic report back.

Identify my Data

The ANDS Identify My Data service allows you to persistently identify your data.
The primary "marketing" around handles is that it provides an identifier that points to the underlying resource. You can move this resource, update the handle and people using the handle will always be able to get to the document (or an explanation as to why it doesn't exist).

The W3C talks about "Cool URIs" (http://www.w3.org/Provider/Style/URI) and, in essence, a repository manager that tends their URL garden according to the notion of "cool" will be achieving this availability aspect of handles anyway.

Let's look at a USQ ePrint: http://eprints.usq.edu.au/5259/. This was written by my boss, Peter Sefton and Oliver Lucido, a member of our coding team. The URI is pretty "cool" as it doesn't specify a file name/extension. The domain name may be argued as not very "cool" as the "eprints" aspect may be seen to relate to the ePrints software from Southampton. However the eprints actually refers to the notion of a research document (see http://en.wikipedia.org/wiki/Eprint).

We don't use handles in USQ ePrints so what would happen if we moved to a new server or even a new software platform? Well, we'd dump out the data and move it over to the new instance. We know what the old URL is and any decent platform should indicate the new URL (or at least identifier). So we'd then write an Apache Rewrite rule or even a small script that maps to the new location.

If we did have handles we'd just have to add something to our migration script that updated the handle at the same time.

Thinking from the user end of the process though, say we did have a handle for http://eprints.usq.edu.au/5259/. When Peter created his eprint he'd open the page (either via the eprint URL or a handle) and in the URL box he'd see....
http://eprints.usq.edu.au/5259/. So what do you think he'd email people, put on his powerpoints or display in his staff profiles? Chances are it's http://eprints.usq.edu.au/5259/.

So all this basically means is that http://eprints.usq.edu.au/5259/ should be "cool" and always resolve to either the metadata page or a page explaining why it has disappeared. In effect, any good repository manager is already making the decision to keep the URL operating for a good stretch of time. If they also had handles they would have to update those as well.

So I don't see that handles really change this aspect of data management. I see a more tangible role for it in my later discussion.

Register my Data

The ANDS Register My Data service allows you to register collections of research materials.
The current state of institutional repositories would indicate that people register metadata in their home institution rather than just defer to an entry hosted elsewhere. For example, a USQ researcher and a QUT researcher that write an article together will each submit a copy to their institutional repository. Why? for a variety of reasons - reporting, staff profiles, different policies etc. This was a concern echoed by QUT's Lance DeVine at the ANDS Roadshow.

Each repository is harvested by various sources (e.g. OAIster and the NLA's ARO service). The big international harvesters don't scrub the data so you'll often see repeated entries. The NLA does look for matches and only displays one entry for the publication. I checked this with Natasha over at the NLA and this matching is done on identifiers, not titles so using a handle in your local repository could provide an aggregator with the ability to match across multiple repositories.

But who's metadata wins? What if the researchers disagree on some of the metadata?
This may not seem to be an issue but start to think about 5, 10, 20 years into the future. So a good idea put forward by ANDS is that the research team decides who will manage the data from the outset. This person will register and manage the handle/identifier and take on the role of ensuring that data/metadata are available for a reasonable length of time. This person manages what I'll call the authoritative metadata.

But what we have seen in institutional repositories will, I suspect, re-occur in data repositories. People will start to put data (or at least metadata) into their own local repositories. This may be due to local policies (we want to have any data you've worked on), local access (2Tb is a large amount to download each time you have a new postgrad student) or even for peace of mind. This creates a concern when this non-authoritative metadata is harvested by ANDS - it creates duplication that can start to affect the usefulness of search results.

The easiest option is to only put authoritative metadata into any dataset repository being offered up to the ANDS harvester. This means that you're not supplying potentially redundant information but also means you may need to run more that one repository.

Another option is to have 2 OAI-PMH sets - one for authoritative metadata and the other for "local" metadata. I think maybe the term canonical comes to play here.
A further option is to look at the data/metadata in a FRBR fashion. The notion of the dataset is a Work and all local instances are an Expression/Manifestation/Item - whichever makes sense. The Work could be denoted by a persistent identifier (e.g. a handle) that is referred to by each local instance. This would mean a change to RIF-CS so an easier model could be the one described to me by StJohn Kettle, as described in the next section.

Handles as Works

If we create a handle we can do more than just create a simple link-through. A handle would effectively be the identifier for that Work (ala FRBR).

Let's say QUT and USQ start a collaborative research effort. We agree that QUT will manage the data and create a handle to point to their dataset's metadata. They take on responsibility to look after the metadata and the handle. USQ, however, might have a policy that they hold a local copy in their own data repository.

So, we might end up with ANDS harvesting the same data from QUT and USQ. However, with a "tweak" their harvester could say "for every handle I get I am going to determine the primary link and only index that".

The main problem is that the Handle points to the metadata location and not the OAI-PMH record for the metadata. However OAI-PMH identifiers can be any URI so, potentially, the identifier could be the handle URI. I need to think more on this one.

What we also get is a distributed data access system - if the handle system can't find the primary record it will indicate other possible locations.

Classify my data

Briefly, ANDS is working on a system to provide "end points" to useful thesauri such as the ANZSRC. This means that we can eliminate the problems faced by the NLA ARO system of inconsistent resource type and subject naming. This may also provide a platform for institutional repositories to normalise their labels.

I don't want to go into it here as it's all a work in progress and I don't want to send you down the wrong path. However, I will say that this work will be useful to watch with regards to repository software as the service and vocabs will provide a handy central reference. I'll also say that my enquiries about the service were quickly and comprehensively answered by ANDS team members.

Monday, June 22, 2009

Staying on Trac

I admit it - I am growing tired of Trac wiki. It's probably our fault - we're using an old version - but I get the feeling it's a bigger problem than that. I work with a University-based technology team that tries to keep the following things together:

  • Organisational news and info

  • Project information

  • Source code

  • Job Tickets

  • Server details

  • Documentation

  • Publications

  • Procedures and Policy

We've got a range of skills on our team but obscure wiki markup format is not necessarily a precondition of employment. What's more, we've ended up with several sites running Trac or ICE that makes learning where to put stuff rather onerous.

My thought is to start trying to bring this stuff together but the question is how. In a previous job I created a controlled vocabulary within Confluence Wiki to bring together reports and project info but source code didn't really come into the equation. XWiki might be a good open alternative but there'd be some coding to do.

I've also been picking up on Maven and see that it could provide a good basis for the coding side of things but that doesn't help non-technical staff.

For presenting the content we could use The Fascinator to harvest from all of our sources and present (mash) it in a variety of combinations (public, developer or manager). That still leaves us with lots of entry points.

So I have some leads but nothing solid (yet). Ideas welcome.

The Fascinator 2

The team found itself with a little bit of breathing space this past week or so and we focussed on developing The Fascinator Desktop. There was a fair bit of whiteboard time with Peter early on and the coding began. Call it agile or whatever, a team sharing design issues whilst developing components just seems to get their stuff together better than a highly pre-spec'd system.

So, what did we achieve? Well:

  • Linda got Watcher up and running - even despite a moving goal.

  • Ron and Oliver worked on creating a storage API to allow us to test against Fedora or CouchDB

  • Bron and I created components to get the Watcher queue and extract metadata and full text via Aperture.

  • Linda created a transformation API to convert files into a variety of renditions.

This gives us a tool chain where we:

  1. Watch your system for file changes

  2. Extract the metadata and fulltext from the file

  3. Transform your various file types to renditions such as html and pdf

  4. Store the data in a repository

From this point, we can lay The Fascinator search engine over the top and give you a faceted search of your files. It's not all there yet - we need to finish off some of the storage work and get it all tied together - but here's hoping that the end of the week brings version 0.1 of The Fascinator Desktop!

My admission from the week: I must integrate unit tests into my development approach.

Monday, June 15, 2009

RDF and mod_rewrite

I was reading the Best Practice Recipes for Publishing RDF Vocabularies and looking for an easy way to provide HTML and RDF on my site. At the moment I have (very) limited RDF but I wanted something that would allow me to have cool(ish) URIs automatically. Basically, a system that would work out if I have an HTML or RDF file depending on your request.

So, http://duncan.dickinson.name/card should give you:

  • card.html if you're a web browser

  • card.rdf if you want semantic data

This stretched my mod_rewrite skills but the following seems to work:

# Turn off MultiViews
Options -MultiViews -Indexes
DirectoryIndex card.html index.html index.htm index.php

# Directive to ensure *.rdf files served as appropriate content type,
# if not present in main apache config
AddType application/rdf+xml .rdf

# Rewrite engine setup
RewriteEngine On
RewriteBase /

#Check if an RDF page exists, and return it
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteCond %{REQUEST_FILENAME}.rdf -f
RewriteCond %{REQUEST_URI} !^/.*/$
RewriteRule (.*) $1.rdf [L,R=303]

#Provide a default RDF page
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteCond %{REQUEST_URI} ^/$
RewriteRule .* /card.rdf [L,R=303]

#Provide the HTML for the request
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteCond %{REQUEST_URI} !^/.*/$
RewriteRule (.*) $1.html [L,R=303]

#provide the PHP page for the request
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteCond %{REQUEST_URI} !^/.*/$
RewriteRule (.*) $1.php [L,R=303]

Using some help I got curl looking at my site's rdf:

curl -H "Accept: application/rdf+xml" http://duncan.dickinson.name/card

So now I get back a 303 redirect.

Thursday, April 16, 2009

Easy semantic linking for authors

I've been playing with RDF a bit lately to see what I can make of it in terms of practical applications. The first hurdle is the rather long specs. Now, I won't pretend I'm someone that can pick up a spec and read it cover to cover. I like to play with some code as I read so that I can sort it in my head. So, as part of my reading I put together semanto - it's wrong in a couple of ways and generally basic but it's my live learning.

This got me thinking about people that don't want to read the W3 specs and hunt for schema that suits their needs. Peter Sefton discussed a method for authors to embed a triple into a document's link. Once the article is completed, the publisher can pass the document to a system that will turn these links into RDF/RDFa and output a webpage.

As he's my boss, I tend to agree with Peter. Actually, no, I tend to agree with the idea as it provides part of an "easy in" for authors.

Having played with the various RDF stuff out there, I can see that an essential part of the "easy in" is to remove the chase for RDF schemas. Bascially, I want to author something and then have an easy to use UI for classifying the information. If that system can provide me standard predicates for my items then I don't really need to think too much about semantics.

To base my thoughts on this workflow:
  1. Do research
  2. Write article
  3. Indicate document predicates/objects
  4. (Maybe) Determine other predicates/objects
  5. Publish
Steps 1 & 2 are really in your court (though you may want to keep an eye on The Fascinator).

I pick up Peter's idea in step 3. You can go through your document and add links to useful information. For example, you can assert that "Jim Smith" is a dc:creator and the dc:title is "My Weekend" etc. In Peter's model, these all appear as hyperlinks. You could even highlight the abstract and create a dc:description link. It'd be ugly and (from my experience unwieldy) but it is possible and it is cross app. You could even do some fancy grouping *.

What Step 3 needs is a predefined set of terms for you to plug into. For example, we would cherry pick the various schema elements and provide those best suited to the work being produced. You could base this in an eprints-style workflow:

What sort of publication are you describing?

... then we present the usual

The following properties are available for an article:

From that session we could produce an RDF document for the article using Dublin Core and the Bibliographic Ontology. The user will get a generated RDF file that has all the info and no need for them to work out which namespaces/schemas are the most appropriate. This isn't new - it's a little like the FOAF-a-matic.

We could also provide an interface with something like

What are you describing?

The system can then spit out rdf triples or a link for Peter's word processor. What matters here is that, again, the author can be largely unaware of the underlying rdf complexities.

This last point leads to Step 4, in which we could throw the article at a system like OpenCalais to find content/metadata in the article that may be worth describing in RDF/RDFa. The author can select/deselect elements as they deem sensible and those that remain are either linked via RDFa or put into the associated RDF file.

Now, all I need is to find the time to try this out....

* Not being completely across the spec, RDFa does seem to be limited in terms of some aspects of academic publishing. The issue of author order comes to mind. Using the basic RDFa examples, I link the authors but can't contain them ala an RDF:Seq. This is discussed in RDFa Containers and solvable - even in word processors as they have (un)ordered lists....

Tuesday, April 14, 2009

Attempt 1: URLs with semantics

Having read Peter Sefton's Journal 2.0 post, I thought I'd have a play and create a basic URL encoder for such information. The result is the semant-o-matic and it's basic but a start.

Excuse the poor formatting - I just wanted to have a play with some code (I get only rare chances).

Monday, April 6, 2009

Accessing the Personal Knowledge Network

I get my RSS/ATOM feeds through Google Reader and can't always get to reading *everything*. This is where the search tool is such an excellent component. Like GMail, Reader allows those of us who don't tag everything to recall articles and posts that we glanced but didn't tag/store/zotero etc.

Working on The Fascinator has made me start to think where my pool of "knowledge" comes from. Naturally, there are a few things in my head but I really rely on my various data sources to form my personal knowledge network.

Whilst The Fascinator desktop edition will scan sections of my drive for things that I've saved, I often don't save articles and posts to my drive. If I know I want to keep something, I put it into my poorly organised Zotero library. Otherwise, I might tag it via Delicious. If it's a blog post I sometimes tag or star it but usually I am happy to know that it's somewhere in that mess of posts.

So, based on this, I think that The Fascinator would benefit from allowing this personal knowledge network to be aggregated - even at only the search level. This would mean that we can allow users to access their full network and tag/comment/associate across it.

My initial targets are selfish ones - Zotero, Delicious and Google Reader.

Thursday, April 2, 2009

Scholary HTML and Article 2.0

I wanted to respond to Peter Sefton's blog about Scholarly HTML in light of the Article 2.0 competition winners so, instead of doing it here, I posted a log-winded response on Peter's blog.

Tuesday, March 17, 2009

Desktop Fascinator: File Synch

We've started work on the Desktop Fascinator and it's coming together well. Oliver has managed to grab a snapshot of the filesystem, put it into FEDORA and index it with SOLR. Hopefully we'll have something up soon for people to check out and comment.

For now, though, I've turned my attention to the part of the system reading the filesystem. The Fascinator uses harvesters to grab data from various places via various means and put the object/metadata into Fedora. For example, we have ORE and PMH harvesters to schlurp (technical term) up repository data. The current filesystem harvester basically takes a snapshot of the filesystem and load metadata into Fedora. We don't make a copy of the file in Fedora as we're expecting the files to get quite large and don't want to replicate that within the desktop.

The main goal is to pick up what the user has in their directories and give them a more expansive (metadata/tags/etc) view of it. This means that the user can continue to use the filesystem and their preferred apps. It also means that we have to keep up with the filesystem state.

The first thought was to poll the filesystem but that is rather intensive. Luckily, one our team members, Linda, has done a thesis that covers the alternatives and, with some quick research, I located some python options for Windows/Linux. I'm not certain how this works in OS X so I'll have to get one of the developers to test it. So for each platform:
This gives us a common code base and we can wrap the code up as a service that logs filesystem going-ons - the FS Watcher Service. Thinking further, The Fascinator may not be running whilst the FS Watcher is doing its thing so we can push each event into a queue to be consumed once The Fascinator is running.

The team laughs at my diagrams so I like to make sure I include one:

Diagram to match filesystem watcher description

There may still be an issue with staying current with the filesystem state. Something may be lost (if the service dies for some reason) so we might still need the scanning system in case a disconnect occurs between the filesystem and the repository. This would potentially be something that the user can run when they're not finding their file in The Fascinator.

OAI-PMH in ePrints

I was chatting with Peter Sefton today about OAI-PMH and resource type names so I thought I'd get back to documenting some ePrints work.

So, to configure OAI-PMH for eprints is straight forward. Within your archive's folder you'll find cfg/cfg.d/oai.pl. You can make some basic alterations such as defining sets and setting up your metadata policy.

Actually deciding how to represent your PMH data is important. For repositories in Australia, the NLA's Arrow Discovery Search provides an interesting angle on this. Just because your repository can be harvested doesn't mean it's interoperable. You should try and see if there's a naming scheme for resource types that others around you are using. Check out Peter Sefton's blog post for a more in-depth commentary on this.

So, what do you do if internally you want to define a Resource Type "Article" but the outside world want to know it as "Journal Article"? Well, for one, you can use the phrase files to call it whatever you want for users of the system. So if eprints internally calls it "Paper" your phrase file can call it "Article" to make the data submission less confusing.

But, if you want to change the way that the metadata is presented to systems such as OAI-PMH, you need to look at how ePrints deals with Dublin Core. Basically, the OAI-PMH data contains the metadata in DC. Each eprint page also puts the DC into the page's meta tags.

Under perl_lib/EPrints/Plugin/Export you'll find DC.pm. Now, this is an important file if you define your own resource type (with non-inbuilt fields) as eprints won't know about them and you'll have to make sure you adapt files such as DC.pm to output your Resource Type's metadata correctly. Look at the convert_dataobj procedure and you'll see how fields are then put into DC format.

So, we created a qut_thesis type to indicate QUT-based theses for collection within the ADT. Now, eprints doesn't know anything about qut_thesis and we needed to edit DC.pm:

if ($eprint->get_value( "type" ) eq 'thesis' || $eprint->get_value( "type" ) eq 'qut_thesis') {

push @dcdata, [ "publisher", $eprint->get_value( "institution" ) ] if( $eprint->exists_and_set( "institution" ) );

} else {
push @dcdata, [ "publisher", $eprint->get_value( "publisher" ) ] if( $eprint->exists_and_set( "publisher" ) );


This is a basic example - I just make sure that qut_thesis DC is the same as the thesis representation. But the MACAR resource types don't include qut_thesis (funnily enough) and we want to make sure that our DC is what external readers/harvesters expect. So, the DC.pm file changes the DC type to Thesis for qut_thesis types:

if( $eprint->exists_and_set( "type" ) )
# We need to map the types to that of MACAR
# But only if the text displayed in the eprint_types.xml
# phrases file does not match MACAR
my $type = $eprint->get_value( "type" );
if ($type eq 'qut_thesis') {
push @dcdata, [ "type", "Thesis" ];
} else {
push @dcdata, [ "type", EPrints::Utils::tree_to_utf8( $eprint->render_value( "type" ) ) ];

So, if you visit a sample eprint, and select the Export Dublin Core link you'll see that the metadata indicates that the document is a Thesis. qut_thesis is only interesting to QUT so we keep it out of the Dublin Core.

Nearly there...

The problem is, qut_thesis represents theses from QUT and this is harvested by the ADT system. You have to see how eprints delineates OAI-PMH sets to do this. Basically, eprints does a search on various fields such as type, subject and ispublished. It does this search and then formats the DC metadata. The code above only changes the output at this latter step. Check your archive's cfg/cfg.d/oai.pl to see the sets being defined.

Checkout those import and export plugin folders. If you define your own types and properties you'll need to make sure that the import/export matches your structures. Only enable the plugins that you know work.

There are better ways for EPrints to do this - e.g. provide an archive mapping system for import/export. However, it doesn't and the work isn't that difficult. Besides, that's part of the strength of a working open source product.

Other PMH Stuff

It's also important to make sure that you have a metadata policy so that harvesters know what they can do. OpenDOAR provides a policies tool to make this easy - it even exports an eprints compliant file that you can then put in your archive's cfg/cfg.d/oai.pl file. For examples of this, see the QUT ePrints Policy and the one at UTas.

Once you're ready to go, check out the Open Archives Initiative site and register as a provider . If anything, it gives you a tick that your PMH output conforms. If you get the tick you'll be added to the list of registered providers. It's quite painless - except for getting people to decide on the actual policy.

* I'm often told "we can't do that - it's hacking the source code". I take this comment to mean that Open Source means $free or that the project has no future capacity to maintain the system. This might set off your alarm bell and you really need to read Just say no to maverick-manager jobs by Dorothea Salo. Consider this - software such as Peoplesoft and ResearchMaster are often "customised" - then again they also get a squadron of full-time staff. Why isn't your IR getting similar attention?

Thursday, March 5, 2009

(e)Research monkey on my back

Many years ago I was studying an MTeach/MEd, having dropped out of full-time coding and into full-time study. I lived in a share house and owned a vintage Dell laptop. Having done my BIT I admit that I never used the Library and did almost no written assignments. Studying education called on me to access articles, read them and try to turn them into a paper. Sometimes I even had to undertake observations. Most of the "research" was qualitative.

So, armed with a stack of highlighted papers I drew up (paper) concept maps and started to flesh out a piece. I used emacs and LaTeX with BibTeX in that first year - it was fun but really relied heavily on my technical background to keep it afloat. For my MEd I had a newer PC so gave EndNote a go. By the time I'd set it up to understand that APA should mention that the article was on the web, I'd left that product.

Now, I don't pretend that my heady coursework days can match your 5 year research effort - nuh-uh. But what I can say is that, if I nearly threw my laptop across the room just trying to create 5000 words, I can only imagine how a doctorate feels.

There has to be a better way. Really.

It also has to be made better for people not hitting the whoa-o-meter with their project. That's people like educators and historians. In fact anyone whose research requires them to collate data that amounts to something less than a large European dodgem circuit. It really seems that, if you're not munching PetaBytes, no-one wants to share your lunch. University ICT teams give you just enough storage to hold a picture of your cat and many eResearch data people are looking for that bigger bang.

So you store all your data on your laptop and a few disks and roll the dice.
Investigating Data Management Practices in Australian Universities and The Next Generation of Academics really showed me some truths:
  • Researchers don't have time to play with their computers and eResearch tools: they just want them to work
  • Researchers aren't catalogers: they don't want to create comprehensive metadata for everything they're reading/watching/creating.
  • Researchers don't run data centres: they want institutional storage and backup so that they don't have to think about it.
  • Researchers (often) work in teams: let them share
  • If your eResearch idea will create more administrative work for researchers, go back to the drawing board. I hope repository admins are listening.
  • It's not about the software - it's about the research getting done.
So, whilst at QUT we worked really hard to integrate our ePrints data collection into the Research Master (HERDC) data collection. This essentially sought to stop doubled-up administrivia. Putting your data into ePrints actually also meant you were largely killing two birds. I really hope this work has been effective. I also really hope that the stand-alone job of submitting to an IR will become as forgotten as the night cart.

So there's a lot of work to be done to create software that helps rather than hinders and workflows that flow, rather than fail.

Leaving the ramble here - ready to hone these ideas into the Desktop eResearch Revolution

Thursday, February 5, 2009

Repository Stats

For anyone that's implemented an institutional repository that has some level of researcher support, the issue of statistics is an unavoidable one. Just like a javascript counter on a web page, people like to see that their paper was downloaded n times. Now, we all know that this doesn't lead to funding in the same way that citations do but there's evidence that open access leads to increased citations.

So, as IR people you can put in a stats package that really just counts hits. Something like AWStats or the like. With a bucket of aspro you might even try IR Stats. You can provide your users with trend data that indicates if the link given in their conference presentation attracted hits. These are useful indicators and depositors like them.

But you might also sit there and ask if there isn't something more that the IR can give you. Well, consider what you've got in your IR. There's information about institutional staff, what they study and what they write. In this world of complex/chaotic problems your IR could be mined to provide possible research collaborations that teams hadn't even dreamt of.

Expand this to a system like the NLA's ARROW Discovery Service and you extend beyond the campus. You have a national body of knowledge that can be mined to exploit possible research cross-overs.

I mentioned this some time ago to a colleague and their response was "most researchers know others in their field - they know who to contact". Yup, agreed. But that's within their field. What about this linkages that they'd never thought of? I K Brunel is often noted as someone who didn't excel at inventing but combining: steam + iron + screw propellor = SS Great Britain. Would I be wrong to say that new "big" problems are now distributed and rather more complex? Where one person could be the innovator, we then looked at teams. But maybe the teams thing is dead - maybe it's a Fourth Blueprint world of networked organisations. Maybe the IR can be mined to discover links that we didn't think of.

Maybe it's also a good way to find a superviser for that thesis you're wanting to write.

There's some work going on about visualisations and mining within repositories and it'll be interesting to see if that wave makes the beach.

Lies, damn lies, and statistics

Wednesday, January 28, 2009

EPrints: Random EPrint

Having an eprint of the day on your IR site is a useful tool. Authors like to see their work come up and it gives people something new to look at. I created a basic script to select a full-text eprint and either redirect the browser or create a citation. The redirection would be useful if you wanted a link along the lines "Find a random eprint" but this maybe isn't overly useful. By outputting a citation you can embed the information into a web page. We needed two types of output: one creates a citation as a snippet of html for use in web pages. The other outputs the citation within the archive's template.

So, the code below lives in a file called "random" in the cgi folder. You have some options:

  • http://myeprints.org/cgi/random: Displays the random eprint within the archive template

  • http://myeprints.org/cgi/random?insert=1: Displays an HTML snippet

  • http://myeprints.org/cgi/random?redirect=1: Redirects the browser to the full abstract of a random eprint

You may also notice that, in the code, I search eprints for items with public full text.


Now, here's that code:

# Returns a random eprint

use EPrints;

use strict;
use Data::Dumper;

my $session = new EPrints::Session;
exit(0) unless ( defined $session );

#load the archive data set
my $ds = $session->get_repository->get_dataset("archive");

my $searchexp = EPrints::Search->new(
satisfy_all => 1,
session => $session,
dataset => $ds,
$searchexp->add_field( $ds->get_field("full_text_status"), 'public' );
my $results = $searchexp->perform_search;
my $offset = rand int( $results->count );

my @ids = @{ $searchexp->get_ids };

if ( $session->param("redirect") == 1 ) {
$session->redirect( "/" . $ids[$offset] );

#prepare a citation string
my $ep = EPrints::DataObj::EPrint->new( $session, $ids[$offset] );
my $citation = $ep->render_citation_link("default");

if ( $session->param("insert") eq '1' ) {
$session->send_http_header( content_type => "text/plain" );
print $citation->toString;
else {

#Build a display page
my $title = $session->html_phrase("cgi/random:title");
my $page = $session->make_doc_fragment();
$session->build_page( $title, $page, "latest" );



So, how can you use this? Well, we could have setup a cron job to wget a random citation snippet. One of the library pages accessed this via JSP and inserted it into their page.

However, we also wanted the eprint of the day on our eprints home page. After a few thoughts on the best way to do this, I settled with phrases. So, in the code below, I request a random eprint in HTML snippet and do 2 things. Firstly (easily) I output this to a text file that can be grabbed over the web. Secondly, I create a phrase file with the citation in it. I can then use this phrase in any xpage with <epc:phrase ref="eprint_of_the_day" />

This is the script that does the job:

#!/usr/bin/perl -w
use LWP::UserAgent;
use HTTP::Request;

$ua = LWP::UserAgent->new( env_proxy => 1, keep_alive => 1, timeout => 30, );

my $response = $ua->request(
HTTP::Request->new( 'GET', 'http://my.eprints/cgi/random?insert=1' ) );


">/usr/local/eprints/archives/myeprints/cfg/lang/en/static/random.txt" );

print PHRASE_FILE "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE phrases SYSTEM \"entities.dtd\">
<epp:phrases xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:epp=\"http://eprints.org/ep3/phrase\"
<epp:phrase id=\"eprint_of_the_day\">"

print PHRASE_FILE $response->content;
print INCLUDE_FILE $response->content;

print PHRASE_FILE "</epp:phrase></epp:phrases>";


`su - eprints -c '/usr/local/eprints/bin/generate_static myeprints'`;

print $response->content;

syntax highlighted by Code2HTML, v. 0.9.1

Tuesday, January 27, 2009

Eprints: Context sensitive labels and help

As a part of the QUT ePrints upgrade we held several meetings to discuss the submission workflow. As we were integrating the data with other systems we had to make sure that those undertaking the data entry (researchers, admin officers) wouldn't be thrown off by the language. One thing that came up was the need to have fields given a diferent label based on the resource type. For example, a book has an author but a painting may have an artist. You don't want to create an extra field because the data is the same, it's just the human interface that needs to be flexible. Likewise, the help text should change based on resource type.

So, the idea was to provide a sub-phrase within the phrases file. The system would default to a base phrase if none was found.

This would mean that we can have the following in our phrases file:

<epp:phrase id="eprint_fieldname_volume">Volume</epp:phrase>

<epp:phrase id="eprint_fieldname_volume#book">Series Volume</epp:phrase>

<epp:phrase id="eprint_fieldhelp_volume">
Enter the volume number of the journal or series in which your item appeared. Please just use the number, do not include text such as "vol".
<epp:phrase id="eprint_fieldhelp_volume#book">
If this book is a part of a series, please provide the volume number here. Please just use the number, do not include text such as "vol".

As I hope you can see, if you're entering the information for a book, you get a contextual label and help. Any other resource types that use the volume field will fall back to the default text.

As usual, when the doco falls short, I hit the mailing list. This can go one of two ways. You can get a good answer or you can get ignored. My question got a discussion going and a solution was reached. My final message can be read here.

The biggest difficulty we faced was working out which objects were available from within the given piece of code...

So, in perl_lib/EPrints/MetaField.pm, I changed the render_name and render_help to check for a resource type specific phrase. I've used the hash (#) to denote the separation but you could change this.

sub render_name {
my ( $self, $session ) = @_;

if ( defined $self->{title_xhtml} ) {

return $self->{title_xhtml};
my $phrasename = $self->{confid} . "_fieldname_" . $self->{name};

# START: Changes made to provide context sensitive names
if ( defined $session->{query} ) {

my $eprintid = $session->{query}->{eprintid}->[0];

if ( $eprintid eq "" ) {

$eprintid = $session->{query}->{param}->{eprintid}->[0];

my $ep = EPrints::DataObj::EPrint->new( $session, $eprintid );

if ($ep) {

my $eptype = $ep->get_type;
$phrasename .= "#$eptype"
if $session->get_lang->has_phrase("$phrasename#$eptype");


# END: Changes made to provide context sensitive names

return $session->html_phrase($phrasename);

sub render_help {
my ( $self, $session ) = @_;

if ( defined $self->{help_xhtml} ) {

return $self->{help_xhtml};
my $phrasename = $self->{confid} . "_fieldhelp_" . $self->{name};

# START: Changes made to provide context sensitive help
my $eprintid = $session->{query}->{eprintid}->[0];

if ( $eprintid eq "" ) {

$eprintid = $session->{query}->{param}->{eprintid}->[0];

my $ep = EPrints::DataObj::EPrint->new( $session, $eprintid );

if ($ep) {

my $eptype = $ep->get_type;
$phrasename .= "#$eptype"
if $session->get_lang->has_phrase("$phrasename#$eptype");

# END: Changes made to provide context sensitive help

return $session->html_phrase($phrasename);

syntax highlighted by Code2HTML, v. 0.9.1

The code base modified was 3.1.1.

Naturally, no warranty is offered.

Thursday, January 22, 2009

Omeka Man

Just been checking out Omeka. Very interesting. Installing it now so will tr and post my thoughts soonish

Tuesday, January 13, 2009

Upgrading QUT ePrints

One of the main reasons I wanted to start this blog was to document the work I recently completed in upgrading the QUT ePrints system. I did not do this alone so will state outright that there was a team of Librarians, a couple of developers and other QUT staff. I won't name them here for privacy reasons.

So, this is my attempt to feed back to the eprints community with some code and discussion about the work.

What is QUT ePrints?
QUT operates its institutional repository out of the Library. There is high-level support for the repository - not just rhetoric. The insitution alo has some key people - one of whom lives and breathes this stuff - they're fun to work with :)

Chances are that if you don't know what ePrints is then you have moved away from this page for now. QUT was a member of the now defunct ARROW group and bought into the VITAL software. However, due to a variety of reasons, we chose to upgrade from eprints v2 to v3. I won't go into the whys...

What we did
A fair bit... Whilst the eprints team provides a complete out of the box solution, the beauty of its open sourciness is that we could adapt it to QUT and Australian higher-ed requirements.

So the main items we produced were:
  • Integrate QUT's ESOE infrastructure for single sign on
  • Transfer the data from ePrints 2 to a new server running ePrints 3
  • Bring over the ADT theses to be served from our IR
  • Expand the metadata to capture data for the HERDC collection - this could reduce a fair bit of work for researchers
  • ... and some general user interface work
The data data transfer gave us a chance to normalise our metadata - the system had grown "organically" over several years so needed a little landscaping
What we didn't do
Well, there were a few things dropped along the way:
  • Oracle integration: We'd thought about using the university's corporate Oracle infrastructure but, after battling with the rather new database layer, killed the idea off so as to meet the deadline.
  • We'd really wanted to link into QUT's Identity system but it wasn't ready yet. This would give us a name authority that could ensure that systems with which we share data were all on the same page
On that last point, I received a bit of flack as I was talking of using a QUT-local name authority. I tended to disagree with people on this one. For one, we actually had a name authority at QUT (though the SOAP interface was delayed). The NLA have the People Australia work that would meet our needs but it sin't there yet. There was talk of other projects but it all started to sound rather overly crowded in terms of scope. Secondly, the SOA interface could easily provide any info needed once given that unique QUT ID - we all know that one ID won't be enough, especially for the public sector.
Moving on
Well, enough of a ramble. I'll start posting code and overviews shortly.

Obligatory first post

I guess there has to be a first post and this is it.

Well, ok then