In my previous post on the topic, I introduced the concept that data can come in several forms, most commonly as “raw” or primary data and as a “processed” version of this data that has added value. In crystallography, the chemist is interested in this processed version, carried by a CIF file. However on rare occasions when a query arises about the processed component, this can in principle at least be resolved by taking a look at the original raw data, expressed as diffraction images. I established with much appreciated help from CCDC that since 2016, around 65 datasets in the CSD (Cambridge structural database) have appeared with such associated raw data. The problem is easily reconciling the two sets of data (the raw data is not stored on CSD) and one way of doing this is via the metadata associated with the datasets. In turn, if this metadata is suitably registered, one can query the metadata store for such associations, as was illustrated in the previous post on the topic. Here I explore the metadata records for five of these 65 sets to find out their properties, selected to illustrate the five data repositories thus far that host such data for compounds in the CSD database.
Raw data repository |
Raw Data DOI |
Raw data →CSD? |
CSD→ Raw data? |
⇐Journal⇒ |
---|---|---|---|---|
Zenodo | 10.5281/zenodo.4271549 | No | No | 10.1039/C6RA28567H |
Imperial College research data repository | 10.14469/hpc/2298 | Yes | Yes | 10.1021/acsomega.7b00482 |
RepoD, a Harvard Dataverse instance | 10.18150/repod.6628285 | No | No | 10.1021/acs.cgd.0c01252 |
Cambridge university repository | 10.17863/CAM.21968 | No | No | 10.1016/j.inoche.2018.08.024 |
Isis neutron and muon source data journal | 10.5286/ISIS.E.RB1620465 | No | No | 10.1039/D0CC02418J |
Ideally, one is looking for bidirectional links between the data as expressed in the metadata and in both directions. As you can see from the above, these links are present in only one of the five sets. More common is that both the raw and the processed data will contain links to the journal article where the data is discussed. Very much less commonly are there links from the journal article to the raw data, although such links are slightly more likely to exist from the journal to the processed data. If you click on the link in any of the last three columns, a copy of the metadata will download for you to inspect. There you can verify if the assertions made above are correct.
What the metadata records demonstrate above is a very small scale so-called PID graph (DOI: [cite]10.5438/jwvf-8a66[/cite] 10.5438/jwvf-8a66) where each DOI is a node in that graph and if a connection exists, it is shown by a line connecting the nodes. The PID graph can be extended to include a third type of node, the journal article and then it starts to get interesting! I will investigate if I can generate the PID graph for the above, although be prepared, it will not (yet) contain very many lines between nodes!
The news that some 992 crystal structures are being investigated with “expressions of concern” reinforces the need for crystal structures to come with a complete set of experimental data, and not just the refined results. See also this preprint and the CCDC notice.
The IUCr (International Union of Crystallography) is now well on board with FAIR crystallographic data; https://doi.org/10.1107/S2414314622008215 and “complete” or raw versions of it.