UUID stands for Universally Unique Identifier. It is a standard way to generate unique IDs for elements in software systems. The latest versions of the UUID almost guarantee uniqueness. As a fellow colleague said:
"Is more probable that both our houses will be struck by lightning at the same time that to have repeated UUIDs"
A typical way to generate UUIDs in python is to use the standard library package:
uuid
. The usage is very simpleimport uuid my_id = str(uuid.uuid4()) print(my_id)
Quite straightforward.
However, there might be the case that you want to generate the same UUIDs multiple times. The reason can be:
- Ease for creating tests for functions that rely on UUID.
- Reproducibility of data pipelines.
- Create collisions "on purpose" when you want to deduplicate entities based on specific fields.
Working in Data and ML, I have recently encountered more than one time the number 1 and 2. Number 3 I have just encountered it when doing a particular data extraction for news data in which I want to make sure I assign the same UUID to the same article (date + title + article body)
I have searched the web on how to do this, and although is also simple, is not as straightforward as just generating a UUID4. I found 2 approaches that worked quite well.
Update Jan 31, 2024: I have added a third way of doing this using UUID5 instead of UUID4.
1. Time + Namespace
The first approach for me was to make the UUID dependent on two things.
First, the execution date time: this should be passed as a parameter to the generator. When orchestrating data pipelines with Airflow and other data orchestration tools, is very common to have the execution date as a parameter. With this, you can ensure the reproducibility of past executions.
However, timestamp alone can lead to unintended collisions with other pipelines. So it is important to include another parameter in the equation. The parameter is a namespace, which can be virtually anything. For my use case, we assigned as namespace the name of the pipeline in which the code is run.
I created a class for this
IDGenerator
. We used the execution time and the namespace to generate a seed for the python's random
module. Then, the UUID
class accepts a 128 bits integer to generate the UUID deterministically.The code will look something like this
import random import uuid from datetime import datetime class IDGeneratorV1: def __init__(self) -> None: self.random: random.Random = random.Random() def get(self, namespace: str, execution_date: datetime) -> str: seed = self._create_seed(namespace, execution_date) self.random.seed(seed) return uuid.UUID(version=4, int=self.random.getrandbits(128)).hex def _create_seed(self, namespace: str, execution_date: datetime) -> int: namespace_part: str = "".join([str(n) for n in list(namespace.encode("utf8"))]) timestamp_part: str = str(int(execution_date.timestamp())) seed: int = int(namespace_part + timestamp_part) return seed
This approach is simple enough to help you write tests for your code, and to generate the same id in pipelines in which you know that the number of elements doesn't change between executions. However, it is naive to the actual contents of the data so it will assign different UUIDs to duplicate entries if there were any. If at some point in a data correction you add or remove elements, this approach will change the ids for those elements which are not ideal.
2. Content-based
To overcome the drawbacks of the previous approach, I implemented a second version. This one works more general than the previous one, taking a free text as the seed for the
random
module. As the text can be arbitrarily long, we first hash the text and then use the hash to form the seed.import hashlib import random import uuid class IDGeneratorV2: def __init__(self) -> None: self.random: random.Random = random.Random() def get(self, text: str) -> str: h = hashlib.sha224(text.encode("utf-8")) self.random.seed(int(h.hexdigest(), 16)) return uuid.UUID(version=4, int=self.random.getrandbits(128)).hex
3. (Update) Use UUID version 5
There is actually a version of UUID that is actually intended to work in a deterministic way. While UUID 4 is just 128 random bits, UUID 5 uses a hash of some data. It needs two inputs
namespace
→ another UUID.
name
→ any arbitrary string.
The idea of using another UUID as a
namespace
is to be able to define hierarchy relationships between UUIDs.In practice, we can implement a
IDGenerator
that is instanciated per namespace
and a get
method that receives a name
import uuid class IDGeneratorV3: null_uuid: uuid.UUID = uuid.UUID("00000000-0000-0000-0000-000000000000") def __init__(self, root_uuid: str | uuid.UUID | None = None) -> None: self.root_uuid: uuid.UUID = self.null_uuid if root_uuid is not None: if isinstance(root_uuid, str): self.root_uuid = uuid.UUID(root_uuid) else: self.root_uuid = root_uuid def get(self, name: str) -> str: return uuid.uuid5(namespace=self.root_uuid, name=name).hex
In this implementation, the
__init__
method has some logic to receive optionally the parent UUID or use the default null UUID.The advantages of using this method are:
- Take advantage of an already implemented standard (don’t reinvent the wheel)
- Performance increase
In [1]: from id_generator import IDGeneratorV1, IDGeneratorV2, IDGeneratorV3 In [2]: id_generator_v1 = IDGeneratorV1() ...: id_generator_v2 = IDGeneratorV2() ...: id_generator_v3 = IDGeneratorV3() In [3]: %timeit id_generator_v1.get("my-name-string", datetime.now()) 11.4 µs ± 40.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [4]: %timeit id_generator_v2.get("my-name-string") 9.42 µs ± 451 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [5]: %timeit id_generator_v3.get("my-name-string") 2.3 µs ± 9.54 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
We can see that this last version is the fastest of the three. Here the simplicity and usage of standards pays off!
Bonus: Random()
objects
You might have noticed that in this implementation we create a
random.Random()
object as a class attribute. This is a feature that I didn't know exists until recently.The
random
python module acts as a singleton instance. But we can create separated Random
objects. This is great if you have multiple IDGenerator
objects in different places on your code and you don't want the seed to be overwritten.