CMLRSS: A Peer to Peer Chemical Information System

1. Background to RSS and CMLRSS

RSS is a protocol for creating news "Feeds" or "alerts", and can programs implementing it can be thought of as sitting somewhere between a Web browser and email. The underlying syntax used is XML/RDF, and CMLRSS is a chemically-aware extension (module). The key differences between this mechanism for delivering chemical content to a user and the conventional web page are;

  1. The system is designed for use by either humans or automatic software (agents)
  2. Meta-data (data about data) is a mandatory feature; its absence would be flagged as an error
  3. The chemical content (and its provenance) is specifically and uniquely identified
  4. Chemical content is presented in a consistent and declared manner suitable for further processing (e.g. Web services)
  5. Individual chemical "feeds" can be aggregated and/or "grouped"
  6. Such aggregates can then be sorted (filtered) according to chemical criteria, i.e. molecular formula etc
  7. Feed collections can be set up to update automatically, no (revisiting) action by the user is needed
  8. The system is set up for extension, into e.g. Maths, bio-science etc.
  9. The system is set up for archival into e.g. structured databases.
  10. The system is set up for (future) context-rich searching

More information can be found in the published articles:

  1. P. Murray-Rust and H. S. Rzepa, "Towards the Chemical Semantic Web. An introduction to RSS", Internet J. Chem., 2003, 6, article 4.
  2. P. Murray-Rust, H. S. Rzepa, M. J. Williamson and E. L. Willighagen, "Chemical Markup, XML and the Worldwide Web. Part 5. Applications of Chemical Metadata in RSS Aggregators", J. Chem. Inf. Comp. Sci., 2004, 44, 462-469.

2. Downloading the CMLRSS Distribution

A packaged Windows installer is available here along with a Wiki page for collaborative development. A single ZIP archive is available for download here (16,901,641 bytes) for manual and non-Windows installation.

3. Creating a CMLRSS Feed

This can be accomplished in either a static or a dynamic manner.

  1. Static feeds take the form of creating an XML document (example here with source) describing a feed channel and a list of items within that channel. To include molecular coordinates, you can use openBabel to convert from legacy formats.

    To invoke Babel, use the command line (Windows users, Run cmd, and CD to the directory which contains BOTH babel.exe and cygwin1.dll)

    ./babel -imdl  moleculefilename.mol -ocml moleculefilename.xml -x2an
    

    (if the legacy format is Molfile. Modify as needed for other formats). The resulting xml file can then be text edited into the overall item container of the RSS file. Note that you might have to inspect the "end-of-lines" characters for Molfile, ie if they written with eg Mac ends-of-lines, they may not process on OS X until a text editor (eg BBedit) has been used to convert the file to Unix ends-of-lines! You will also notice that the xml file is produced with each element written out as a single line. For large molecules, this means e.g. the <cml:atomArray ... /> element can be very "wide". Some Newsreaders appear to have an implicit limitation on how many characters a single text line can hold, and you may have to use your text editor to introduce "hard" wraps into this file (careful, because some text editors introduce "soft" wraps, which make it appear that the line is wrapped, but in fact it does not contain break the file with explicit ends-of-lines). The character size of <cml:atomArray ... /> has also caused a problem with some RSS validators, which appear again to limit how many characters any one attribute (i.e. x3="coordinates" ) can have. For a molecule with e.g. 10,000 atoms, this is a lot of characters! One possible solution is to use -x2n for the Babel flag (rather than -x2an, the "a" signifying the problematic array) but this will increase the size of the CML/XML file. Although we have succeeded in getting eg molecules with 9000 atoms to display in cmlrss, we currently recommend that full coordinates for molecules with more than 1000 atoms should be treated with caution since they may cause some problems (but we have observed no issues with the Jmol/JChempaint systems).

  2. Dynamic feeds can be created by querying a suitable database. Thus
    http://www.ch.ic.ac.uk/csdemo/feed.php
    
    will query a MySQL database (using default values for the search variables, in this case returning the last ten items entered into the database).

4. Validating a CMLRSS Feed

http://feedvalidator.org/ and http://xml.mfd-consult.dk/syn-sub/ both offer validation (in the XML sense) services.

5. Viewing a CMLRSS Feed

  1. If the feed contains the following stylesheet declaration
    <?xml version="1.0" encoding="iso-8859-1"?>
    <?xml-stylesheet href="http://www.w3.org/2000/08/w3c-synd/style.css" type="text/css"?>
    
    then the RSS will be formatted to more or less display sensibly in a Web browser.
  2. Generic RSS viewers can be used. A wide variety are available (an overview is available here). These viewers will honour the default RSS elements, and will gracefully ignore the chemistry specific components.
  3. Chemically aware RSS viewers include Jmol (3D coordinates) and JChemPaint (2D coordinates). We anticipate that extension to Crystallographic coordinates and unit cells will be shortly available via JMol.
  4. JChemPaint can be installed in a manner similar to Jmol. This allows display and editing of 2D structures (and allows 2D coordinates to be created from structures which only have 3D coordinates).
    JChemPaint with RSS browser window

6. Subscribing to a Feed

<a href="feed:www.ch.ic.ac.uk/motm/index.rss">Subscribe to motm</a> is a proposed method for subscribing to a feed via a Web browser. This works with some generic RSS clients (tested on NetNewswire). A method which first validates the feed (useful for checking your XML syntax and well-formedness) is as (<a href="http://purl.org/net/syndication/subscribe/?rss=http://www.ch.ic.ac.uk/motm/index.rss" title="RSS Channel" target="new"><img src="rss.gif" alt="" /></a>). Currently the Chemical RSS clients (Jmol, Jchempaint) do not support automatic subscription, which must still be hand-edited within the rssviewer.props file. An example of this file is shown below. We anticipate that the syntax of this will will shortly change to adopt OPML (used by other RSS clients).

ChannelCount=8

Channel0=http://almost.cubic.uni-koeln.de/jetspeed/NmrshiftdbServlet?nmrshiftdbaction=rss
Channel0Title=NMRShiftDB (Cologne University BioInformatics Center)
Channel1=http://www.woc.sci.kun.nl/cgi-bin/rssfeed.rss
Channel1Title=Dutch Dictionary on Organic Chemistry
Channel2=http://www.ch.ic.ac.uk/csdemo/feed.php
Channel2Title=ChemStock Demo (Imperial College)
Channel3=http://www.ch.ic.ac.uk/motm/index.rss
Channel3Title=Molecules-of-the-Month (Imperial)
Channel4=http://www.bristol.ac.uk/Depts/Chemistry/MOTM/rss.xml
Channel4Title=Molecules-of-the-Month (Bristol)
Channel5=http://www.chem.ox.ac.uk/mom/index.rss
Channel5Title=Molecules-of-the-Month (Oxford)
Channel6=http://wwmm.ch.cam.ac.uk//cmlrss/index.rss
Channel6Title=WWMM@UCC (Cambridge)
Channel7=http://wwmm.ch.cam.ac.uk/cmlrss/cryst/index.rss
Channel7Title=Cryst@UCC (Cambridge)

7. How to create lists of RSS Feeds (aggregation)

OPML (Outline Processor Markup Language) is "a file format that can be used to exchange subscription lists between programs that read RSS files, such as feed readers and aggregators". Many RSS Feed clients allow import/export of these lists. An example collection of CMLRSS Feeds is here with the source shown below:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- OPML generated by NetNewsWire -->
<opml version="1.1">
    <head>
        <title>mySubscriptions</title>
        </head>
    <body>
        <outline text="CML" description="CML: CML Highlights" title="CML" type="rss" version="RSS" htmlUrl="http://www.xml-cml.org/index.html" xmlUrl="http://www.xml-cml.org/cml.rss"/>
        <outline text="Molecules-of-the-month" description="Molecules-of-the-month: A Project started in December 1995 by Henry Rzepa (Imperial College) and Paul May (Bristol University)" title="Molecules-of-the-Month" type="rss" version="RSS" htmlUrl="http://www.ch.ic.ac.uk/motm/" xmlUrl="http://www.ch.ic.ac.uk/motm/index.rss"/>
    </body>
</opml>

A syndication feature known as "share your OPML" allows groups to exchange their lists. This can be done server side, http://minutillo.com/steve/feedonfeeds/. Also highly promising is the Urchin opensource project to produce a customisable RSS aggregator and filter. For an article by Timo Hannay see here.


H. S. Rzepa, March 13th, 2004.