Dependency Injection for Data Developers

👀

Visit all the articles in the series of blogs post about Best Practices for Data Developers

Principles of Object Oriented Programming for Data Developers

MVC for Data Developers

Dependency Injection for Data Developers

Factory Pattern for Data Developers

The work of data engineers and data scientist involves writing a lot of code. As this code grows, the cost of maintaining it increases exponentially. The software industry has developed patterns and best practices for decades to overcome or mitigate these costs. As the Data Developer role is very new, we must find a way to adopt these patterns in our daily development work, with the necessary adaptations for our use case.

In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. For this chapter, we will focus on how to use the Dependency Injection pattern in our code.

Dangers of dependencies

We call a dependency any external system, package, or class used by our code. Depending on other software is a must when developing any kind of task, as otherwise, it will imply that we will need to code everything from scratch (from the OS to our business logic), and we will end up with a massive monolith.

Although dependencies with external packages can be problematic, the most common and problematic dependencies in my opinion are access to databases, REST APIs, and other storage systems.

For example: how we can test our business logic when we depend on a PostgreSQL server to be running and with the appropriate tables and data to run our pipeline? Dependency injection helps with these kinds of issues.

What is dependency injection?

From Wikipedia

In software engineering, dependency injection is a design pattern in which an object receives other objects that it depends on. A form of inversion of control, dependency injection aims to separate the concerns of constructing objects and using them.

As you can see this is closely related to my previous posts on

MVC for Data Developers. Let’s evolve the example we used in that post. Now our ResultWriter will send the data into an HTTP server for which we have a client package installed. How can we handle this?

The first approach will be something like this:


from myhttp_client_library import Client

# A simple view class
class ResultWriter:
    def __init__(self, api_key: str, host: str):
        self.client = Client(api_key=api_key, host=host)

    def write_results(self, df: pd.DataFrame):
        self.client.post_report(payload=df.to_json())

Let’s ignore the details of the client library and assume that everything is correctly setup

This example should work well. It achieves its goal to send the results. The problem is that we no longer can have unit tests for this class, as it will require launching an HTTP server somewhere safe to test (local, or in a development environment)

Let’s try with dependency injection.


from myhttp_client_library import Client

# A simple view class
class ResultWriter:
    def __init__(self, client: Client):
        self.client = client

    def write_results(self, df: pd.DataFrame):
        self.client.post_report(payload=df.to_json())

The code hasn’t changed much actually. The ResultWriter class now receives an instantiated object instead of the pieces to create the object. This little fundamental difference has a lot of value to create our tests. Now we can mock that object with the unittest.mock.Mock package.


from unittest.mock import Mock

def test_result_writer(test_report_df):
    mock_client = Mock(Client)
    writer = ResultWriter(mock_client)
    writer.write_results(test_report_df)
    mock_client.post_report.assert_called()

We just pass a Mock object in our tests. We can do other types of assertions depending on the logic interaction.

This abstraction means other qualitative changes to our class:

The class initialization signature is simpler.

Changes in the client library initialization don’t influence our class as it is constructed outside it.

Multiple clients can be reused in different objects (this might be a good or bad idea depending on the behavior and concurrency of the class)

Common guidelines to apply it to data pipelines

Retrieve and write data outside the business logic class whenever you can

Pass the object instance to other objects in the initialization.

If your business logic is defined as SQL or as API queries, the previous point might not apply. Tried to encapsulate these queries inside other objects with appropriate data retrieval methods, and pass the connection objects as initialization.