PMR - I have blogged some of these.

1) Our ELN online for our open project here
2) Reaction Attempts includes UsefulChem(Bradley) and Todd notebooks - click on explore link
3) Chemspider Synthetic Pages
4) OrgPrepDaily procedures
5) Beilstein Journal of Organic Chemistry
6) Small Molecule Papers in PLoS One but there aren't many with synthetic schemes…
7) What about publicly available theses - are you wanting html for this, or pdf?

Initial comment here - sample size here might be small, and hence the kinds of reactions looked at may skew the analysis a little. We're focussing on one main reaction type, and so is Jean-Claude. Not exclusively, just a preference. Obviously the wider the sample set the better the analysis.

PMR: To this I will add:

8) About 7-9000 syntheses from Acta Crystallographica E (some will be simply “we took this from a bottle” but most are actual preparations). Licence CC-BY
9) Somewhere about 100,000 reactions per year in patents. We expect the historical quality to be textually lower. Licence PUBLIC DOMAIN
10) A number of donated theses in Cambridge
11) A number of theses in University repositories. Most licences CC-DONTKNOW, some CC-SA, some CC-BY. Smallish numbers (guess about 100-1000 theses if we work hard. A really good opportunity for collaboration)

PMR The main aspects that will be important are:

* What are the explicit permissions on the site?

This is more important than anything else. We cannot use material that is visible on the web unless there is explicit permission. This is an ideal opportunity to use IsItOpenData? Of the sources above I shall assume 1 and 2 are Open, 3 is unknown until we hear from Chemspider, 4 is unknown (most blogs are CC-BY or CC-SA but we have to check). We cannot use CC-NC. 5, 6 are fully CC-BY
* What is the format? PDF is the worst, but probably usable for this project. XHTML is normally excellent. *.doc (theses) is also excellent.
* How is the information structured? If it’s in diagrams it’s very difficult. If it’s in running text it depends on the style. Formal reports of single compounds are often quite tractable. Highly detailed accounts are potentially much more valuable but harder to parse as there is less consistency.
* What information is given? Acta E and Patents do not normally give yields (a pity). Theses are usually very rich.

I’ll start liasing with Heather about asking for formal permission on IsItOpen?

