Here is a wonderful offer for the Green Chain reaction project at #solo2010.
Dan Hagon says:
// Hi Peter, sounds a really fun project. I’m happy to help out with some Java coding. Also I have a cloud-hosted virtual machine I’m not really making much use of right now which you’re welcome to use.//
This is exactly one of the skills we shall need for this project. If we are going to look at patents over many past years we are going to have to use either/or a lot or humans or a lot of computing.
Dan worked with us as a summer student and then moved on to RAL. He helped us get much of the automation into crystal structure repositories. So I know that he knows this contribution is possible and valuable.
I’ll explain in more detail what we are going to do, but this is about how. We have written most of the tools (in Java) and we’ll be able to offer them so they can run standalone on any machine. This may require wrapping them as a WAR or other self-starting distributable. We’ll also need to make sure they run remotely (Java is described as write-once-run-anywhere and parodied as write-once-debug-everywhere. So people who know what debugging looks like are highly valued).
The main distributed tool will be natural-language-processing (NLP) for chemical documents and specifically reactions. I’ll describe this in detail in a later post. The overall strategy looks something like:
* Download N documents from remote site (e.g. patents, Acta Crystallographica E)
* Find all reactions in the document (can be hundreds in patents, only one in Acta)
* Carry out NLP on each reaction.
* Create a datafile from each
* Index each datafile (probably using RDF)
* Search for green concepts in the RDF repository
* Present the results
We’ve got code for 1-4. We’ll need help and imagination with the later stages (5-7), especially since they may come slightly later than the initial parsing. But there will be many of you out there who have some experience of this sort of thing.
Note that the cloud is an ideal place to do this sort of work as it is embarrassingly parallel – or can be created as map-reduce. For example each volunteer could take a year of patents (many tens of thousands of reactions in each year)
So please volunteer for help with the computing – it should be fun.