Visit all the articles in the series of blogs post about Best Practices for Data Developers
The work of data engineers and data scientist involves writing a lot of code. As this code grows, the cost of maintaining it increases exponentially. The software industry has developed patterns and best practices for decades to overcome or mitigate these costs. As the Data Developer role is very new, we must find a way to adopt these patterns in our daily development work, with the necessary adaptations for our use case.
In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. For this chapter, we will focus on how to use the Factory pattern in our code.
Polymorphism is powerful
One of the best things in OOP is the Polymorphism. This feature allows us to treat each child class as its parent in a homogeneous way. One can think for example of a transformation classes with a common interface and then chain the transformations like this:
from typing import List class Transform: def execute(x): raise NotImplementedError class Sum100(Transform): def execute(x): return x + 100 class Double(Transform): def execute(x): return x * 2 transformations: List[Transform] = [Double(), Sum100(), Double()] value = 10 for t in transformations: value = t.execute(value)
In this silly example, we can see how elegant and clean is to write code this way. This is because the downstream logic doesn’t need to know about the class implementation that is using, only cares that it implements the
execute
method.You can also see that there is a problem in this example. We are initializing manually the objects inside the array. This causes the class to be hardcoded in the list, which is not ideal as we lost that dynamic definition and homogeneity of objects. It will be best if we can find a way to instantiate the appropriate object in a dynamic way.
Enter the Factory Pattern
To solve the issue above we can create a “Factory” of objects. This is a new class that will return an object based on a set of conditions or parameters. There are multiple ways of implementing this in python. This is an example implementation I like to use:
from typing import Dict, List, Type class Transform: def execute(x): raise NotImplementedError class Sum100(Transform): def execute(x): return x + 100 class Double(Transform): def execute(x): return x * 2 class TransformationFactory: transformation_classes: Dict[Type[Transform]] = { 'sum_100': Sum100, 'double': Double, } @classmethod def get_transform(cls, transform_type: str) -> Transform: if transform_type not in cls.transformation_classes: raise ValueError return cls.transformation_classes[transform_type]()
With this new class, we can now define a list of transformations in a configuration file, or to be an input of the main program, and then run it like this:
transform_config = ['double', 'sum_100'] transformations: List[Transform] = [ TransformationFactory.get_transform(t_name) for t_name in transform_config ] value = 10 for t in transformations: value = t.execute(value)
As you can see, nothing is hardcoded and our code takes the most advantage of polymorphism.
Extending the Factory
With this type of implementation is also very easy to extend the transformation types without changing anything else from the main program flow. We just need to add another key value to the
TransformationFactory.transformation_classes
dictionaryclass Half(Transform): def execute(x): return x / 2 class TransformationFactory: transformation_classes: Dict[Type[Transform]] = { 'sum_100': Sum100, 'double': Double, 'half': Half, }
Conclusions
This is one of the most advanced topics we treated in this series. It will not apply to the major part of the code you will write as a data developer, but it is another useful tool in your arsenal. Use the Factory Pattern whenever you can see that Polymorphism gives an advantage and do not hardcode object initialization.