Factory Pattern for Data Developers

👀

Visit all the articles in the series of blogs post about Best Practices for Data Developers

Principles of Object Oriented Programming for Data Developers

MVC for Data Developers

Dependency Injection for Data Developers

Factory Pattern for Data Developers

The work of data engineers and data scientist involves writing a lot of code. As this code grows, the cost of maintaining it increases exponentially. The software industry has developed patterns and best practices for decades to overcome or mitigate these costs. As the Data Developer role is very new, we must find a way to adopt these patterns in our daily development work, with the necessary adaptations for our use case.

In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. For this chapter, we will focus on how to use the Factory pattern in our code.

Polymorphism is powerful

One of the best things in OOP is the Polymorphism. This feature allows us to treat each child class as its parent in a homogeneous way. One can think for example of a transformation classes with a common interface and then chain the transformations like this:


from typing import List

class Transform:
    def execute(x):
        raise NotImplementedError

class Sum100(Transform):
    def execute(x):
        return x + 100

class Double(Transform):
    def execute(x):
        return x * 2

transformations: List[Transform] = [Double(), Sum100(), Double()]
value = 10
for t in transformations:
    value = t.execute(value)

In this silly example, we can see how elegant and clean is to write code this way. This is because the downstream logic doesn’t need to know about the class implementation that is using, only cares that it implements the execute method.

You can also see that there is a problem in this example. We are initializing manually the objects inside the array. This causes the class to be hardcoded in the list, which is not ideal as we lost that dynamic definition and homogeneity of objects. It will be best if we can find a way to instantiate the appropriate object in a dynamic way.

Enter the Factory Pattern

To solve the issue above we can create a “Factory” of objects. This is a new class that will return an object based on a set of conditions or parameters. There are multiple ways of implementing this in python. This is an example implementation I like to use:


from typing import Dict, List, Type

class Transform:
    def execute(x):
        raise NotImplementedError

class Sum100(Transform):
    def execute(x):
        return x + 100

class Double(Transform):
    def execute(x):
        return x * 2


class TransformationFactory:
    transformation_classes: Dict[Type[Transform]] = {
        'sum_100': Sum100,
        'double': Double,
    }

    @classmethod
    def get_transform(cls, transform_type: str) -> Transform:
        if transform_type not in cls.transformation_classes:
            raise ValueError
        return cls.transformation_classes[transform_type]()

The factory class can return the object initialized or the class itself. I normally prefer to return the object.

With this new class, we can now define a list of transformations in a configuration file, or to be an input of the main program, and then run it like this:


transform_config = ['double', 'sum_100']
transformations: List[Transform] = [
    TransformationFactory.get_transform(t_name) for t_name in transform_config
]
value = 10
for t in transformations:
    value = t.execute(value)

As you can see, nothing is hardcoded and our code takes the most advantage of polymorphism.

Extending the Factory

With this type of implementation is also very easy to extend the transformation types without changing anything else from the main program flow. We just need to add another key value to the TransformationFactory.transformation_classes dictionary


class Half(Transform):
    def execute(x):
        return x / 2


class TransformationFactory:
    transformation_classes: Dict[Type[Transform]] = {
        'sum_100': Sum100,
        'double': Double,
        'half': Half,
    }

Conclusions

This is one of the most advanced topics we treated in this series. It will not apply to the major part of the code you will write as a data developer, but it is another useful tool in your arsenal. Use the Factory Pattern whenever you can see that Polymorphism gives an advantage and do not hardcode object initialization.