As data repositories start to flourish, it is reasonable to ask questions such as what sort of chemistry can be found there and how can I find it? Here I give an updated[cite]10.1515/ci-2016-3-408[/cite] worked example of a digital repository search for chemical content and also pose an important issue for the chemistry domain.
Firstly, I should say this search is restricted just to those data repositories that submit indexing terms (metadata) to DataCite, which is the agency that will be used to conduct the searches. Each type of metadata is defined by a prefix or operator field (much in the same way that an advanced Google search can be prefixed with an operator, e.g. author:♥). I will use just two such DataCite field prefixes† here as exemplars (there are many more).
- media: This specifies the media type for the data being searched. For restriction to chemistry one takes advantage of the chemical/x- media type, as described previously.[cite]10.1021/ci9803233[/cite]
- SubjectScheme: This is a new declaration, as specified in the DataCite V4 metadata schema.[cite]10.5438/0012[/cite] The subject scheme in effect declares a subject-specific term, and is designed to be used by domains such as chemistry.
This latter is best illustrated by one specific example of a search which I will dissect here:
https://search.datacite.org/works?query=media:chemical\/x\-gaussian*+SubjectScheme:inchikey+subject:XZYDALXOGPZGNV-UHFFFAOYSA-M+media:chemical\/x\-mnpub*‡
- https://search.datacite.org/works?query= queries the DataCite MDS† (metadata store).
- media:chemical\/x\-gaussian* defines a media type which contains the string chemical/x-gaussian, with the * being a wild-card which allows any characters to follow this string. This now is specifying any data repository where Gaussian files have been deposited and assigned this media type.
- + represents a Boolean AND operator.
- SubjectScheme:inchikey restricts a subject search to a subjectScheme having the value inchikey, whilst
- subject:XZYDALXOGPZGNV-UHFFFAOYSA-M defines the value of the subject itself.
- media:chemical/x-mnpub completes the search definition, this relating to the mandatory additional presence of an Mpublish[cite]10.1186/s13321-017-0190-6[/cite] file indicating (spectroscopic, probably NMR) data readable by the MestreNova program.
One hit with these restrictions has doi: 10.14469/HPC/2635 and clicking the button on the landing page for this object labelled metadata resolves to e.g.
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/2635,
and downloads the metadata record for this object. Part of this record looks a bit like:
This brings me to the important issue for the chemistry domain, which is to agree upon a core set of SubjectSchemes for implementation in data repositories with domain-specific chemical content. The two subjects above, the InChI and the InChIKey seem obvious candidates for inclusion. But how the list is extended and how the SubjectScheme is specified are now matters for the community to discuss. Perhaps the IUPAC GoldBook is one starting point for the SubjectScheme URIs. Watch this space.
‡The \ syntax indicates an “escaped” character. Thus in chemicalx\-gaussian a \ ensured that the following / is treated as part of the search string, and not as part of the search syntax. Likewise \- ensures the minus character is part of the string and not a syntactic negation. The current list of characters requiring escaping is + - & | ! ( ) { } [ ] ^ " ~ * ? : \ /
† The documentation lists common fields, but there are far more specified in V4 of their schema. The ones you see used here are not (yet?) documented at https://search.datacite.org/help.html
♥ This Google page has a rich plethora of powerful searches, which I suggest almost no-one knows about!
Tags: chemical content, chemical/x- media type, chemical/x-gaussian, Company: DataCite, Company: Google, digital repository search, domain-specific chemical content, media type, media:chemical/x-mnpub, media:chemical\/x\-gaussian*, Question, search definition, search engines, search string, search syntax, subject search
Yesterday, a Webinar on various aspects of FAIR data was held. Participants were encouraged to leave issues and questions on the topic on a Github forum. You can see these at https://github.com/FAIR-Data-EG/consultation/issues.
There will be more such discussions, and if you are interested, do register to participate.