Chemical Information Technology 2007-08

Definition of Chemoinformatics: The collection, representation and organisation of chemical data to create chemical information, to which theories and models can be applied to create chemical knowledge.

Objectives of these lectures: To introduce the background to the course, and the skills to be acquired during the course laboratories, including the use of computer workstations, computer software and network information resources and prioritising and organising the information obtained using these tools. To introduce the chemistry computer laboratory sessions and what you are expected to do during these sessions. The lecture summaries below contain numbered items. At the end of the course, you will be expected to be familiar with the techniques and concepts outlined in each of these 29 categories, and be able to apply them to researching a project.

This course does not deal with any aspects of data logging, analysis and mining (often called Chemometrics) e.g. Excel spreadsheets, Mathematica, MatLab etc.

Lecture 1: 5 Oct. Prologue: What you need to know about computers

Data: Managed by Operating Systems (OS) on Computers (Windows XP, Mac OS X, Redhat Linux), on Phones (e.g. Symbian, Win CE, OS X), Organisers (e.g. PalmOS), iPods (OS X)
Access: On computers by authentication against User names/passwords. Some users (root,admin) have special permissions.
Organisation: Data normally held in Files located in Hierarchical Folders (Directories). Directories referred to as Home or My documents have special status for each user.
- Files: naming convention uses 8.3 (DOS) vs 31 (MacOS 9) vs 256 (Modern) characters. Allowable filename characters based on ASCII set with some exclusions (space, $, /, :, ? etc). Some OS Filenames are case sensitive (Unix), others are not (Windows). Often the cause of much confusion!
- File Content/Data type: normally indicated by adding a 2-4 character extension after a period (.doc) to the name. The extension may or may not be visible. Special types, used by the OS, may be invisible by virtue of starting with a period. The (free text) file content may have been indexed and hence is searchable by the OS.
- File Metadata (Properties): Creation/Modification Dates, sizes, access permissions, "ownership", content, etc is organised by the OS. ( "the ability to look at your hard drive through a metadata-filtered view"
- File Location in hierarchy performed by searches using file metadata as criteria.
- File Size: In "bytes" (approximately, 1 character = 1 byte, sometimes 2 bytes). 10⁶ bytes =~1 Mbyte, 10⁹ bytes = ~1 Gbyte, 10¹² bytes = ~1 Tbyte. Maximum size for any file normally 2 Gbyte (Windows) or very much larger (Linux, Mac OS X).
- Archives: A collection of Folders and Files which preserves the hierarchy and file metadata (.zip, .sit, .tar).
Storage:
- Permanent Data Storage, as files on:
  - Local hard drives (capacity 40 Gbytes to 750 Gbytes)
  - Network Drives:
    1. Home directory (Desktop icon Home, also known as drive H:\, capacity ~200 Mbytes per user)
    2. Drive L: (Your "Home" on Linux systems)
    3. Drive N: (A data-silo)
    4. Drive R: (Where files from departmental NMR Spectrometers are placed)
  - Removable media (PenDrives, iPods, CD-RW/DVD, capacity 128Mbyte - 16Gbyte - 120 Gbyte)
- Temporary Data Storage, as
  - "clipboard" in "System Memory" (capacity not known by user, but probably < 10 Mbyte)
  - cache or temporary files, not normally seen by the user but can wreak havoc if corrupt!
File Usage: Data Files are created and exchanged using:
- Combinations of programs, typically a Word processor (Word), a chemical drawing program (Chemdraw) and Bibliographic database (EndNote).
- Data exchange between these programs using copy/paste via clipboards or via files (drag-n-drop, save/open).
File Data Structures: Internal structure of files can be hidden or exposed.
- Hidden (binary) file (or clipboard) formats are normally understood only by specific programs. Examples include .DOC, .RTF (Rich Text format), .GIF, .PNG, .JPEG (Graphics), .MPEG (audio, video), .PDF (Acrobat).
- Exposed structures include HTML (structured Hypertext markup language), SVG (Scalable Vector graphics), TXT (un or semi-structured text)
- Chemical types include:
  - Molecule specifications, with atom connection co-ordinate types such as SMILES, PDB, Molfile
  - Spectral/analytical specifications such as JCAMP
  - Query specifications such as SD
- Data: Semantics (meaning) can be added to data structures to make it re-usable in different contexts: XML (eXtensible markup language) is the best known way of doing this.
  - Chemical Specifications include: CML (Chemical Markup Language) and CMLRSS (a chemical news alerting type).
- MetaData: Data should have descriptions to add context. HTML can have exposed metadata (i.e. this document). Acrobat has structure for metadata (XMP) but this is rarely used!
Data Transport:
- Using Wires/Fibres
  - Local to computer: USB2 (480 Mbps), Firewire (800 Mbps), internal workings.
  - Between Computers: Ethernet (up to 1 Gbps)
- Wireless
  - Local to computer: Bluetooth (e.g. keyboards, mice, phone, ~1 Mbps)
  - Between Computers: WiFi (Chemistry library, labs, lecture theatres, ~40 Mbps)
Data Exchange: Human/Computer Interactions (i.e. human specifies search query, computer responds with an answer)
- Session-centric (i.e. the context between the query and the answer is preserved during the session, 1977-present)
  1. Exchange of General Graphics using X-Windows (e.g. eXceed), Citrix (proprietary), Windows Remote Desktop (Windows).
  2. Exchange of Chemistry graphics: Beilstein Commander, SciFinder.
  3. Exchange of Programs and Services: .Net or Java
  4. Real-time: MOO/Chat/IRC/AIM programs, Whiteboards, Games, Realtime media Streaming/Broadcast, videoconferencing.
- Document-centric (1993-present) via Web Browsers via URLs, HTML and MIME/chemical MIME
- Information Object-centric (1997-present) Web Browsers, NewsFeeds/Podcasts using RSS/XML.
- Data Exchange: Computer<=>Computer/Human Interactions: The Semantic Web; A Trusted semantic web: Digital Certificates.
⇒ Coursework

Lecture 2: 8 Oct. Keyword-based General Bibliographic Searches (1-D)

Objectives of these lectures: The use of Bibliographic and library indices using Web-browser interfaces. Introduction to:

Boolean logical operators
Stemming characters
Grouping
Metadata-driven searches (fielded searches)

This section is centred around the search for the conversion of penicillin to cephalosporin and how to fine tune it. The EndNote bibliographic software will be introduced showing how it operates with Word.

SciFinder Scholar: A "natural language" search system, i.e. "conversion of penicillin to cephalosporin".

Robot based Internet Indices:

College	Global	Chemical
Search for:	Google Scholar SciRus Microsoft Live Academic

BL OPAC (Online Public Access Catalogue): Boolean operators AND, OR, NOT, Truncation symbol ? = any number of wild characters, Grouping done with parentheses (...). Search qualified by metadata descriptors (title, author, etc), often also called a Field Search.
Unicorn (IC Site specific index of local resources): AND, OR, NOT, XOR (exclusive OR, which retrieves either term, but not both terms), $ = 1 Wild character
Try a search of the World's patent literature-US Patent office: [AND, OR, ANDNOT, $ (Truncation)] Example: ISD/1/$/2006 and (penicillin andnot cephalosporin)
WOS (Web-of-Science): Uses field tags and Booleans: AND, OR, NOT, SAME = Proximity operator, ? = 1 wild character, SUL*UR and BIOLOG* (but not *NATAL, ie middle and right) = 1 or more wild character, (...) for grouped expressions, i.e. A NOT (B OR C)
Beilstein Crossfire: AND, OR, NOT, PROXIMITY, NEAR, NEXT (first term always before second term), WildCards: ? = 1 character, ?? = 2 characters, * = any number. Wild cards can be used left, middle, right. Try "penicillin and cephalosporin" as a text search, then MP 155-156 AND MF = C29H28N2O6S1 AND ORP 190-200 as a field search"

Lectures 3,4: 9, 11 Oct. Chemical Connectivity and Structure Searches (2-D)

Objectives of these lectures: Searching for chemical structures, sub-structures and reactions using 2D molecule definitions, starting with text descriptors of molecular connectivity (SMILES strings) generated using ChemDraw, and moving to the use of proprietary programs for defining connectivity and searching for molecular properties and molecular reactions. Illustrated via the following databases;

- Searching the eMolecules database using a SMILES string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N) for "similar" molecules.
- Searching the PubChem database using a SMILES string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N) for "similar"
- CambridgeSoft Chemfinder System
- Organic syntheses for specific molecule queries.
- The on-line Corina service to convert a (1D) SMILES string to 3D molecular coordinates is an example of an "added-value" service (in this case a 1D to 3D conversion!).
SciFinder sub-structure search and 3D coordinate display/export.
Beilstein Crossfire molecule sub-structure, reaction searches and application of AUTONOM to naming drawn structures. Stereochemistry. Export of data.

Lecture 5: 16 October. Chemical Structure, Property and Shape Based Searches (3-D), Integrated Collections (4-D!)

Sub-structure searching of the Cambridge crystal database of organic and organometallic molecules for specific molecules, and intermolecular interactions (e.g unusual π-H-O hydrogen bonds).
A Search of the NIST Chemistry WebBook for thermodynamic and spectral searches. Export of spectral data. compound substructures.
Use of Jmol to display complex Protein Structures (also demo page). Brief overview of bio-informatics, Protein Databank (Keywords penicillin and tetrahedral) and Protein Explorer (direct entry).
Use of "added-value" sites such as ChemCalc for property calculations.
Survey of modern integrated electronic information systems, ACS (including Enhanced Web Objects), RSC and its Project Prospect, and Science Direct sites (Direction jump from Molecule to Beilstein via Dymond: http://dx.doi.org/10.1016/j.tetlet.2005.09.104) and the Digital object identifier (DOI): http://dx.doi.org/publisher/article

Lecture 6: October 18. Introduction to The ChemWiki Project

Introduction to the Wiki project, including objectives.
How to Create and Edit a Wiki page
Display of images. These will be obtained either from existing web pages or by "screen snapshots" (Screen Grab Pro or others)
Special characters: Use special box!

Molecule needs Java 1.4

Display of 3D molecular structures using the JMol applet (for demos see http://jmol.sf.net/demo/). The following shows how to include a molecule in a Wiki page:

<jmol>
<jmolApplet>
<size>200</size>
<color>white</color>
<script>zoom 80; cpk on;frame 1; move 10 -20 10 0 0 0 0 0 3; delay 1;</script>
<inlineContents>water.mol
title1
title2
  3  2  0  0  0                 1 V2000
   -1.0400    1.5290    0.0000 O   0  0  0  0  0
   -1.0331    2.4710    0.0000 H   0  0  0  0  0
   -1.9535    1.2993    0.0000 H   0  0  0  0  0
  1  2  1  0  0  0
  1  3  1  0  0  0
M  END</inlineContents>
</jmolApplet>
</jmol>