Skip to content

Citable identifiers

katjabercic edited this page May 16, 2018 · 1 revision

A citable unique identifier is obtained by taking the first 12 hexadecimal digits of the hash, splitting them into groups of 4 characters, and prepending a letter Z:

123456789abcdef... ->  Z1234-5678-9abcd...

Optionally, this may be followed by one or more letters denoting the type of the object. Finally, the name of the algorithm used to produce the identifier may be appended after a colon at the end. For example, we can obtain two different identifiers for the Petersen graph from two different canonical representations:

Zc74c-6028-a25a:bliss and ZGca5e-bcae-4138.

The first identifier specifies that it has been derived from the canonical representation of the graph computed using Bliss. The G letter in the second identifier denotes that it represents a graph; however, the algorithm used to obtain it has not been specified.

By shortening the hash to 12 characters out of the original 64, the number of such shortened identifiers drops down considerably, yet it still remains at 248 ~ 2.8 * 1014. Due to the birthday paradox, a collision (two objects with the same identifier) is expected to occur with probability 0.5 when the number of identifier in the database grows to about 20 million, which could conceivably happen in the not so distant future. However, we do not see it as a problem: if this happens, the affected identifier may be extended with additional characters from the hash for disambiguation. Note that such an approach is also taken by the Git versioning system: although the objects are identified by 40-character hashes, they are usually referred to by simply taking the first 7 characters.

Clone this wiki locally