Keywords: RSS (RDF Site Summary), XML (eXtensible Markup language), CML (Chemical Markup language), Semantic Web, News alerts.
It is appropriate after almost exactly ten years of Web-based publishing to assess what this revolution has meant for the dissemination of scientific and chemical information. Well in excess of 2 billion documents, of which perhaps 5% have some scientific content, have been generated (authored is perhaps putting it a trifle too strongly). To this one must add a limitless number of "virtual" documents generated dynamically upon request. Even within a relatively specialised discipline such as chemistry, the effect upon a human can be overwhelming; keeping up with the literature has evolved from a weekly browse of the tables of content of perhaps ten key paper-based journals to having to visit a much larger number of computer-based Web sites. It might be fair to say that many humans are not coping too well! Ironically of course, computers are much more efficient than humans at scouring a large number of information sources in an error, boredom-free and periodic manner; the challenge is merely in specifying what it is they should be on the lookout for. The first generation Web (circa 1993-present) turns out in retrospect to have been rather unsuited to this task. This is not actually the fault of the original design, merely a facet of how it was (or more accurately was not) implemented. The key omission was meta-data, this being a concise and structured declaration of the content model of the document.
Before this last topic is elaborated, its worth noting the "Google phenomenon", an index of the entire Web which for most scientists renders it actually useful as a scientific information source.1 Thus the erstwhile need to constantly "bookmark" key locations found by a possibly accidental and probably not reproducible path, is now productively replaced by being able to rely on Google to find information based on a few choice key words. Google augments this with additional meta-information derived from choices made by earlier searchers using the same key words; in effect a coarse but gradually accumulating peer review mechanism which naturally selects the survival of the fittest information.
The original and still current Web, Google not withstanding, in practice contains very little overtly structured information in the form of declared meta-data; this may be simply the title of a document and little else. The human authors of most documents have either cared little or knew little about the relatively arcane mechanisms the original designers of e.g HTML had put in place to capture meta-data. This is something of a chicken and the egg issue; the average chemist has hitherto had no compelling reason to learn these arcane methods, preferring instead to focus on chemical applications and doing what they do best! The consequence has been that the Web has not fulfilled its anticipated potential for "serendipity", or the art of accidental and fortuitous discovery of unexpected information. This is still an act expected to be achieved by humans exercising their perceptive skills and not by computers.
At this point in the discourse we introduce RDF-based Site Summary or RSS (acronyms can indeed be successful at capturing the world's imagination, viz HTML!). This is a simple but powerful XML-based implementation of meta-data of which (like HTML) the arcane aspects can be hidden behind user-friendly software referred to variously as a news reader, aggregator or RSS client. RSS grew out of an idiosyncratic phenomenon known as Web logging. Web Logs (now known as "Blogs") are essentially personalized Web servers containing chronologically organised items of information. Four items of meta-data are implicit and indeed mandatory in a Web Log; the identity of the author, the date of each item, a brief description of it and how to link to it. The "blogging" community perceived the need to summarise Blogs using such meta-data and then to broadcast these summaries in a manner which allowed "aggregation" of themes into larger "channels" of such information, and thence into syndication; a procedure not very different from e.g. newspapers or broadcast television. This process in due course merged with another and rather more formal vision of evolving the homogeneous unstructured Web into a Semantic Web,3 where in addition to meta-data, information is carried which allows machines to process documents without the necessity of human intervention. The resulting fusion has resulted in a formal XML-based specification known as RSS2, which is supported by ready availability of RSS-aware software which can act upon the contents of such RSS files. The remainder of this article shows an example of how RSS is currently applied and used, and follows with a call for its application in chemistry and a scenario of how it might be applied in this subject in the future.
One way of regarding RSS is as relating to the management and description of hyperlinks in the same way that the earlier HTML (and more recently XML) markup languages relate to the management and declaration of content and data. RSS therefore augments a conventional Web site (or the smaller scale Web log), and can be added to a site without the need for any additional infrastructures or software. Many major news sites already do this, but it is still rare for chemically related sites. At the server end, deploying RSS requires nothing more than the addition to the contents of a Web site of two entries;
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://.../index.rss" />
In fact, the RSS document need not even be real; it can (perhaps even should be) generated dynamically by appropriate query of any content management system which is used to generate documents for the Web site. Like HTML, RSS can also be generated using simple authoring tools; a good one is to be found at http://rssxpress.ukoln.ac.uk/. An example of an RSS file is shown in Scheme 1:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cml="http://www.xml-cml.org/schema/cml2/core" > <channel rdf:about="http://www.xml-cml.org/"> <title>Chemical Markup Language</title> <link>http://www.xml-cml.org/index.html</link> <description>CML Highlights</description> <webMaster>admin@cmlconsulting.com</webMaster> <image rdf:resource="http://ww.xml-cml.org/cml.gif" /> <items> <rdf:Seq> <rdf:li rdf:resource="http://www.xml-cml.org/rss/" /> </rdf:Seq> </items> </channel> <item rdf:about="http://www.xml-cml.org/rss/"> <title>This RSS file contains an embedded Molecule in CML</title> <link>http://www.xml-cml.org/cml.rss</link> <description>Currently, RSS clients are not capable of acting upon the cml namespace in the RSS descriptor, and so ignore this information. In future, one can anticipate that a CML-aware RSS client will not only extract the molecular information, but be capable of searching/filtering/transforming it into more useful forms (ultimately for example being able to detect particular sub structures or other specific items of interest to the reader). </description> <dc:creator>Henry Rzepa</dc:creator> <dc:date>2003-04-06T10:00:00-05:00</dc:date> <cml:molecule id="a1" title="water"> <cml:formula conciseForm="H 2 O 1"/> </cml:molecule> </item> </rdf:RDF>
This can be done either by selecting from pre-determined lists provided with the RSS client (although "science" as a category is included with most clients, this currently contains little of overt chemical interest), by discovery at visited web sites (RSS feeds are normally indicated by displaying the following icon: ) or by searches at sites specializing in RSS discovery5. We emphasize that numerous RSS channels already exist; it might be expected that in due course, syndication sites specializing in chemical content will arise.
The results of such requests could then be re-used in a different context, as for example automated entry to a database, use by a synthesis robot, entry in a personal diary etc. One might imagine an example in the future where RSS channels from primary chemical journals will contain explicit chemical information encoded in e.g. CML7 which are automatically screened by the user's software for the presence of say a particular molecular structure, or molecules with particular properties. Any found matching these criteria could be automatically submitted for e.g. quantum mechanical calculation of further properties.8 The human comes to their (probably virtual) desk in the morning to find not only that overnight the system has discovered a molecule of interest to them, but has arranged calculation of its properties, or even its synthesis in the laboratory!
This in fact is a dynamically generated RSS feed, resulting for a query fed to the (MySQL) database of metalloproteins, with the output being formatted in RSS. Installing this will alert the user to any new (in this case Zinc containing) metalloproteins recently added.
This is an feed generated using a Java servlet application to again query a database and return an appropriately formatted RSS file.
An obvious and immediate application might be as an alerting service for the primary chemical journals. Consider the current somewhat haphazard procedure most chemists have come to adopt since the majority of science journals have gone on-line. One periodically finds a browser "bookmark" to a favorite journal (and anticipating it is still functional) a visit to the "latest issue" section of the table of contents will reveal titles of the latest articles, and possibly an abstract that may impart more explicit chemical information (but only visually since it is almost certainly a graphical image which requires a human for perception). Further analysis will require the download of e.g. an Acrobat file, which normally arrives on the user's disk in the "downloads" directory. It still requires much action by the user to organise this reprint into a bibliographic database (for example Endnote), and the process has to be repeated for each journal, with the added difficulty that each publishers' "user-interface" is different and has again to be learnt. Most of us probably realise that this procedure does not "scale", and are wondering how we will cope with this in the future. This procedure could be replaced by a much more structured one based on RSS-derived metadata, derived (in the future) by automated processing of the original full (XML-based) article. In order to discover new and potentially interesting articles, the user subscribes to the RSS feeds of relevant publishers, and can e.g. simply search the latest items that appear automatically for key words of interest. The article download is still necessary, although it may be possible for the RSS client to automatically invoke e.g bibliographic software (or alternatively such software could support RSS directly). When primary scientific publications become available directly in XML (rather than e.g. Acrobat) the possibilities for their re-use increase enormously; no longer is one limited merely to printing the article!
Another immediate application of RSS is as an alerting services for new additions to chemical data bases (although here the sheer volume of new additions might require immediate mechanisms for filtering this down to a manageable quantity). In addition to the two examples noted above, we have easily implemented a simple extension to our php/MySQL-based ChemStock inventory system12 to alert users to e.g the last five added entries at any given time. It is also apparent that the strength of the system is that separate alert streams from different communities (say chemistry and bio-informatics) might be semantically combined to create connections that are simply not being currently made due to the sheer overload of information. Perhaps the most significant aspect of an increasing deployment of RSS is that it could serve as a focal point for increasing awareness of the importance of creating properly structured information, which includes well defined meta-data, and of managing the links between such information (the hyperlinks) in a manner which allows software as well as humans to utilise these connections in a semantically meaningful manner. RSS does seem to be a tool which is bringing the ultimate vision of a chemical semantic web one step closer.3