OCA_training_pathway

Identifiers in Research

In research, referencing research objects such as datasets, organisms, publications, software or chemicals is very important. We want to make sure that everyone participating in research activities (such as via publications, datasets, or collaborations) are all in agreement about what specific research object is being referenced.

For example, to make sure we correctly identify a reference publication we include bibliographic information in the citation so that everyone knows exactly which publication is being referred to. Unique references are so important we have created international committees to standardize naming. For example, organisms are given unique and globally specific identifying names (e.g. Eschereicia coli, Gallus gallus etc.) and chemicals having a formal name derived from explicit naming rules (e.g. IUPAC nomenclature).

Persistent Identifiers or PIDs

With digitization, scientists have continued to use unique and global identifiers to correctly and persistently identify research objects. We call these kinds of digital identifiers PIDs (persistent identifiers).

Common PIDs in research include DOIs (digital object identifiers) and ORCiDs (PID yourself and get an ORCiD!). At the University of Guelph, you can deposit datasets and other research objects (including schemas and OCA bundles generated by the Semantic Engine) in the Agri-Environmental Research Data Repository. When you deposit your data or other research object in this repository your submission is given a DOI; a unique PID that you can use when referencing your data or other research object.

Self-Addressing Identifiers

Another type of PID are Self-Addressing Identifiers (SAIDs) which take advantage of hashing to create an identifier (aka digest) that is calculated directly from the content of the thing being identified. You can think of digests as unique fingerprints derived directly from any type of digital object. Most familiar PIDs (such as DOIs) are assigned by a central authority. In contrast, SAIDs are not assigned by anyone, they are calculated directly from the content.

When you create a digest value, you take a digital object, plug it into a one-way formula (the hashing function), and generate another value ā€“ the digest. A hash function always gives the same digest for the same content and if you change the digital object in any small way, even a single character or space, the resulting digest will be completely different.

One important characteristic about hash functions is that they do not work in reverse. If you are given a digest, you cannot determine what the original content was. This means that if your digital object contains sensitive information, you cannot recreate that data from the digest.

For the OCA schema, all parts of the schema and the schema bundle itself are given SAIDs. Each component of the schema bundle is hashed and each digest starts with an ā€˜Eā€™. You can discover the digest by looking at each JSON filename of the OCA schema, by looking at the JSON schema bundle, or you can calculate it directly using the SAID hashing function.

OCA schema JSON bundle showing hashes for overlays and bundle

Benefits of using SAIDs for Research Objects

Self-Addressing Identifiers are very useful for tracking digital resources and they can be considered digital fingerprints. If you find two schemas in two different locations (perhaps one is a published standard and the other is published with a dataset), you can compare the SAIDs of each schema (i.e. digital fingerprints) and if the schema SAIDs are the same, then the schemas are identical.

Alternatively, if you only have a SAID reference for a schema, you can find the corresponding schema (or schema part) by looking for the identical SAID of other schemas. If the SAIDs are the same, then the documents are also identical.

You can confirm that the SAID is legitimate by performing your own SAID hashing of the schema. By comparing the freshly generated SAID to the claimed SAID you can check if the document or SAID has been altered.