Questions about the (metadata) components of a scientific article.

The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.

The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).^‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.

Let me analyse a recent example.

For the article[cite]10.1021/acsomega.8b03005[/cite] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.^† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
- - This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
    - This last inference can be tested using metadata from this article[cite]10.1039/C7SC03595K[/cite] using e.g.
      https://api.crossref.org/v1/works/10.1039/C7SC03595K or
      https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1039/C7SC03595K
      which reveals a full citation list, including explicit citations to data objects as per: https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/1620
  - Of the 37 citations listed in the article itself,[cite]10.1021/acsomega.8b03005[/cite] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
  - An alternative method of invoking a metadata record;
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
    retrieves a sub-set of the article metadata available using the CrossRef query,^‡ but again with no included references and again nothing for the data citation #22.
Citation #22 in the above does have its own metadata record, obtainable using:
https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
- This has an entry
  <relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
  which points back to the article.[cite]10.1021/acsomega.8b03005[/cite]
To summarise, the article noted above[cite]10.1021/acsomega.8b03005[/cite] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![cite]10.1039/C7SC03595K[/cite]

For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[cite]10.1021/acsomega.8b03005[/cite] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[cite]10.1002/mrc.4806[/cite], [cite]10.1006/jmre.1997.1214[/cite], [cite]10.1006/jmre.2000.2071[/cite] relating to the one discussed above[cite]10.1021/acsomega.8b03005[/cite] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.

This now raises the following questions:

Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?

I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!

^‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. ^†JSON, which is not particularly human friendly.

Author

Henry Rzepa

Henry Rzepa is Emeritus Professor of Computational Chemistry at Imperial College London.
View all posts

Tags: Academic publishing, American Chemical Society, author, Business intelligence, Company: DataCite, CrossRef, data, Data management, DataCite, editor, EIDR, Information, Information science, JSON, Knowledge representation, Metadata repository, Records management, Technology/Internet, The Metadata Company

This entry was posted on Monday, April 8th, 2019 at 8:26 pm and is filed under Chemical IT. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Henry Rzepa's Blog