Generating Deterministic UUIDs in Python

UUID stands for Universally Unique Identifier. It is a standard way to generate unique IDs for elements in software systems. The latest versions of the UUID almost guarantee uniqueness. As a fellow colleague said:

"Is more probable that both our houses will be struck by lightning at the same time that to have repeated UUIDs"

A typical way to generate UUIDs in python is to use the standard library package: uuid. The usage is very simple


import uuid

my_id = str(uuid.uuid4())

print(my_id)

Quite straightforward.

However, there might be the case that you want to generate the same UUIDs multiple times. The reason can be:

Ease for creating tests for functions that rely on UUID.

Reproducibility of data pipelines.

Create collisions "on purpose" when you want to deduplicate entities based on specific fields.

Working in Data and ML, I have recently encountered more than one time the number 1 and 2. Number 3 I have just encountered it when doing a particular data extraction for news data in which I want to make sure I assign the same UUID to the same article (date + title + article body)

I have searched the web on how to do this, and although is also simple, is not as straightforward as just generating a UUID4. I found 2 approaches that worked quite well.

Update Jan 31, 2024: I have added a third way of doing this using UUID5 instead of UUID4.

1. Time + Namespace

The first approach for me was to make the UUID dependent on two things.

First, the execution date time: this should be passed as a parameter to the generator. When orchestrating data pipelines with Airflow and other data orchestration tools, is very common to have the execution date as a parameter. With this, you can ensure the reproducibility of past executions.

However, timestamp alone can lead to unintended collisions with other pipelines. So it is important to include another parameter in the equation. The parameter is a namespace, which can be virtually anything. For my use case, we assigned as namespace the name of the pipeline in which the code is run.

I created a class for this IDGenerator. We used the execution time and the namespace to generate a seed for the python's random module. Then, the UUID class accepts a 128 bits integer to generate the UUID deterministically.

The code will look something like this


import random
import uuid

from datetime import datetime


class IDGeneratorV1:
    def __init__(self) -> None:
        self.random: random.Random = random.Random()

    def get(self, namespace: str, execution_date: datetime) -> str:
        seed = self._create_seed(namespace, execution_date)
        self.random.seed(seed)
        return uuid.UUID(version=4, int=self.random.getrandbits(128)).hex

    def _create_seed(self, namespace: str, execution_date: datetime) -> int:
        namespace_part: str = "".join([str(n) for n in list(namespace.encode("utf8"))])
        timestamp_part: str = str(int(execution_date.timestamp()))
        seed: int = int(namespace_part + timestamp_part)
        return seed

This approach is simple enough to help you write tests for your code, and to generate the same id in pipelines in which you know that the number of elements doesn't change between executions. However, it is naive to the actual contents of the data so it will assign different UUIDs to duplicate entries if there were any. If at some point in a data correction you add or remove elements, this approach will change the ids for those elements which are not ideal.

2. Content-based

To overcome the drawbacks of the previous approach, I implemented a second version. This one works more general than the previous one, taking a free text as the seed for the random module. As the text can be arbitrarily long, we first hash the text and then use the hash to form the seed.


import hashlib
import random
import uuid


class IDGeneratorV2:
    def __init__(self) -> None:
        self.random: random.Random = random.Random()

    def get(self, text: str) -> str:
        h = hashlib.sha224(text.encode("utf-8"))
        self.random.seed(int(h.hexdigest(), 16))
        return uuid.UUID(version=4, int=self.random.getrandbits(128)).hex

3. (Update) Use UUID version 5

There is actually a version of UUID that is actually intended to work in a deterministic way. While UUID 4 is just 128 random bits, UUID 5 uses a hash of some data. It needs two inputs

namespace → another UUID.

name → any arbitrary string.

The idea of using another UUID as a namespace is to be able to define hierarchy relationships between UUIDs.

In practice, we can implement a IDGenerator that is instanciated per namespace and a get method that receives a name


import uuid


class IDGeneratorV3:

    null_uuid: uuid.UUID = uuid.UUID("00000000-0000-0000-0000-000000000000")

    def __init__(self, root_uuid: str | uuid.UUID | None = None) -> None:
        self.root_uuid: uuid.UUID = self.null_uuid
        if root_uuid is not None:
            if isinstance(root_uuid, str):
                self.root_uuid = uuid.UUID(root_uuid)
            else:
                self.root_uuid = root_uuid

    def get(self, name: str) -> str:
        return uuid.uuid5(namespace=self.root_uuid, name=name).hex

In this implementation, the __init__ method has some logic to receive optionally the parent UUID or use the default null UUID.

The advantages of using this method are:

Take advantage of an already implemented standard (don’t reinvent the wheel)

Performance increase


In [1]: from id_generator import IDGeneratorV1, IDGeneratorV2, IDGeneratorV3

In [2]: id_generator_v1 = IDGeneratorV1()
   ...: id_generator_v2 = IDGeneratorV2()
   ...: id_generator_v3 = IDGeneratorV3()

In [3]: %timeit id_generator_v1.get("my-name-string", datetime.now())
11.4 µs ± 40.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit id_generator_v2.get("my-name-string")
9.42 µs ± 451 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit id_generator_v3.get("my-name-string")
2.3 µs ± 9.54 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

We can see that this last version is the fastest of the three. Here the simplicity and usage of standards pays off!

Bonus: `Random()` objects

You might have noticed that in this implementation we create a random.Random() object as a class attribute. This is a feature that I didn't know exists until recently.

The random python module acts as a singleton instance. But we can create separated Random objects. This is great if you have multiple IDGenerator objects in different places on your code and you don't want the seed to be overwritten.

References

https://sudhir.io/uuids-ulids

https://docs.snowflake.com/en/sql-reference/functions/uuid_string

https://stackoverflow.com/a/28776880/9674758

https://www.rfc-editor.org/rfc/rfc4122