Visit all the articles in the series of blogs post about Best Practices for Data Developers
The work of data engineers and data scientist involves writing a lot of code. As this code grows, the cost of maintaining it increases exponentially. The software industry has developed patterns and best practices for decades to overcome or mitigate these costs. As the Data Developer role is very new, we must find a way to adopt these patterns in our daily development work, with the necessary adaptations for our use case.
In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. For this chapter, we will focus on a twist of the traditional MVC pattern to adapt it to our data pipeline jobs.
Disclaimer
Multiple software patterns use a similar structure that we can map to our use case. For example the ECB, MVA, or ADR. I am using the MVC as it is one of the oldest and most well-known patterns, and its entities have easy-to-recognize names.
The most important thing is that we understand:
- Why do we use these patterns?
- What are the core principles?
- When should we apply them?
Edit: May 15, 2024
Revisiting this article after two years, I probably would have chosen another name for the pattern an analogies to use. It is probably more useful to relate this concepts with hexagonal architecture patterns (ports and adapters kind of entities)
My latest revision of these principles are reflected in the The DataAccessLayer Class, which combines a lot of software patterns and best practices in a real life curated class used in production.
Tradicional MVC framework
Source Wikipedia
Model–view–controller (MVC) is a software design pattern commonly used for developing user interfaces that divide the related program logic into three interconnected elements. This is done to separate internal representations of information from the ways information is presented to and accepted by the user.
As you might be thinking, the typical data pipeline has nothing to do with user interfaces. However, we can recycle the main ideas, and remove the specifics of the web development.
The core principle is that the information, its representation, and the interactions with the work (users or other systems) are conceptually separated in the source code using different classes/objects.
Applications in data processing
The traditional MVC entities can be mapped like this
Controller
- Manage configuration parameters on the job initialization. Like reading environment variables or parsing configuration files.
- Controls the main flow of the application.
Model
- Represent data transformations, data analysis, or ML inference.
- It is the main class that encapsulates business logic.
- The configuration is set at the class creation. The input is the data to process. The output is the processed data.
- Initialized, configured, and launched by the controller.
View
- Manage read and writing of the input and output data.
- Abstract the storage for the business logic
Depending on the complexity of our pipeline, the View object might not be needed and can be absorbed into a method of the Controller. However, the principle we must adhere to is: separate reading and writing from business logic.
A practical example
Let’s see the theory in action with a practical example. Let’s develop the code for a job that will return reporting metrics of a given dataset. The dataset is stored in an S3 bucket, and we must return the report to another S3 bucket. We will use Python and Pandas for this example.
import os import pandas as pd #Â Our Model class with a more meaninful name class ReportGenerator: def genrate_report(self, df: pd.DataFrame) -> pd.DataFrame: # Let's not complicate ourselves for the example and use a pandas native method return df.describe() # A simple view class class ResultWriter: def write_results(self, df: pd.DataFrame, dst: str): df.to_parquet(dst) # The main Controller class class Controller: def __init__(self): # On initialization we will parse the configuration and create the needed objects self.src = os.getenv("S3_SRC_URL") self.dst = os.getenv("S3_DST_URL") self.report_generator = ReportGenerator() self.result_writer = ResultWriter() def run(self): # The main method that controlls the flow of the job df = pd.read_parquet(self.src) report = self.report_generator.genrate_report(df) self.result_writer.write_results(report)
We can see that immediately we can create a small test for our business logic like this:
import pandas as pd import pytest class TestReportGenerator: @pytest.fixture def input_data() -> pd.DataFrame: return pd.DataFrame({'test_column': [1, 1, 1, 1]}) @pytest.fixture def expected_output_data() -> pd.DataFrame: return pd.DataFrame( { 'test_column': { 'count': 4.0, 'mean': 1.0, 'std': 0.0, 'min': 1.0, '25%': 1.0, '50%': 1.0, '75%': 1.0, 'max': 1.0, } } ) def test_basic(self, input_data: pd.DataFrame, expected_output_data: pd.DataFrame): report_generator = ReportGenerator() output = report_generator.genrate_report(input_data) pd.testing.assert_frame_equal(output, expected_output_data)
This is a very small example, so some things look overkill. For example, wrapping up the Pandas
to_parquet()
method in a class like that. In real life, you don’t need to follow these guidelines to the letter if they don’t make sense for your use case.Why should we use this pattern?
OOP Advantages
As we mentioned earlier, this relates to the Single Responsibility Principle. We create the right classes for the right job, and allow us to take advantage of the basic OOP principles:
- Encapsulation is achieved by separating the classes and their roles
- Enables inheritance and polymorphism. For example, we can define a generic interface or ML inference as a parent class, and then the controller can instantiate and launch each model based on the flow of the pipeline.
Easier tests, more tests
Another big advantage is that with this separation our code became much more testable. The logic to write and read the data is separated. As well as the code that instantiates and configures the Model. Our business logic is perfectly encapsulated in the Model, so we can focus on creating very good unit tests for that class. We only need to define the input and the expected output.
Creating integration tests to validate the View and the Controller is also a good idea. As always, the integration tests will still be challenging.
Modularity And Extensibility
With these abstractions (and good tests) we can perform changes easily in different parts of the pipeline. It also allows us to extend the functionality thanks to the OOP principles we mentioned above.
Conclusions
Data Engineers and Data Scientists produce a lot of software. So increasing our arsenal with more software patterns will always be useful. However, the key thing is to know how to apply them to our day-to-day lives.
In the following chapters, we will talk about other patterns that will help us to overcome more problems that the MVC pattern doesn’t solve.