MVC for Data Developers

👀

Visit all the articles in the series of blogs post about Best Practices for Data Developers

Principles of Object Oriented Programming for Data Developers

MVC for Data Developers

Dependency Injection for Data Developers

Factory Pattern for Data Developers

The work of data engineers and data scientist involves writing a lot of code. As this code grows, the cost of maintaining it increases exponentially. The software industry has developed patterns and best practices for decades to overcome or mitigate these costs. As the Data Developer role is very new, we must find a way to adopt these patterns in our daily development work, with the necessary adaptations for our use case.

In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. For this chapter, we will focus on a twist of the traditional MVC pattern to adapt it to our data pipeline jobs.

⚠️

Disclaimer

Multiple software patterns use a similar structure that we can map to our use case. For example the ECB, MVA, or ADR. I am using the MVC as it is one of the oldest and most well-known patterns, and its entities have easy-to-recognize names.

The most important thing is that we understand:

Why do we use these patterns?

What are the core principles?

When should we apply them?

💡

Edit: May 15, 2024

Revisiting this article after two years, I probably would have chosen another name for the pattern an analogies to use. It is probably more useful to relate this concepts with hexagonal architecture patterns (ports and adapters kind of entities)

My latest revision of these principles are reflected in the

The DataAccessLayer Class, which combines a lot of software patterns and best practices in a real life curated class used in production.

Tradicional MVC framework

Source Wikipedia

Model–view–controller (MVC) is a software design pattern commonly used for developing user interfaces that divide the related program logic into three interconnected elements. This is done to separate internal representations of information from the ways information is presented to and accepted by the user.

As you might be thinking, the typical data pipeline has nothing to do with user interfaces. However, we can recycle the main ideas, and remove the specifics of the web development.

The core principle is that the information, its representation, and the interactions with the work (users or other systems) are conceptually separated in the source code using different classes/objects.

Applications in data processing

The traditional MVC entities can be mapped like this

Controller

Manage configuration parameters on the job initialization. Like reading environment variables or parsing configuration files.

Controls the main flow of the application.

Model

Represent data transformations, data analysis, or ML inference.

It is the main class that encapsulates business logic.

The configuration is set at the class creation. The input is the data to process. The output is the processed data.

Initialized, configured, and launched by the controller.

View

Manage read and writing of the input and output data.

Abstract the storage for the business logic

Depending on the complexity of our pipeline, the View object might not be needed and can be absorbed into a method of the Controller. However, the principle we must adhere to is: separate reading and writing from business logic.

A practical example

Let’s see the theory in action with a practical example. Let’s develop the code for a job that will return reporting metrics of a given dataset. The dataset is stored in an S3 bucket, and we must return the report to another S3 bucket. We will use Python and Pandas for this example.


import os

import pandas as pd

# Our Model class with a more meaninful name
class ReportGenerator:
    def genrate_report(self, df: pd.DataFrame) -> pd.DataFrame:
        # Let's not complicate ourselves for the example and use a pandas native method
        return df.describe()

# A simple view class
class ResultWriter:
    def write_results(self, df: pd.DataFrame, dst: str):
        df.to_parquet(dst)

# The main Controller class
class Controller:
    def __init__(self):
        # On initialization we will parse the configuration and create the needed objects
        self.src = os.getenv("S3_SRC_URL")
        self.dst = os.getenv("S3_DST_URL")

        self.report_generator = ReportGenerator()
        self.result_writer = ResultWriter()
    
    def run(self):
        # The main method that controlls the flow of the job
        df = pd.read_parquet(self.src)
        report = self.report_generator.genrate_report(df)
        self.result_writer.write_results(report)

We can see that immediately we can create a small test for our business logic like this:


import pandas as pd
import pytest

class TestReportGenerator:
    @pytest.fixture
    def input_data() -> pd.DataFrame:
        return pd.DataFrame({'test_column': [1, 1, 1, 1]})
    
    @pytest.fixture
    def expected_output_data() -> pd.DataFrame:
        return pd.DataFrame(
            {
                'test_column': {
                    'count': 4.0, 
                    'mean': 1.0, 
                    'std': 0.0, 
                    'min': 1.0, 
                    '25%': 1.0, 
                    '50%': 1.0, 
                    '75%': 1.0, 
                    'max': 1.0,
                }
            }
        )
    
    def test_basic(self, input_data: pd.DataFrame, expected_output_data: pd.DataFrame):
        report_generator = ReportGenerator()
        output = report_generator.genrate_report(input_data)
        pd.testing.assert_frame_equal(output, expected_output_data)

This is a very small example, so some things look overkill. For example, wrapping up the Pandas to_parquet() method in a class like that. In real life, you don’t need to follow these guidelines to the letter if they don’t make sense for your use case.

Why should we use this pattern?

OOP Advantages

As we mentioned earlier, this relates to the Single Responsibility Principle. We create the right classes for the right job, and allow us to take advantage of the basic OOP principles:

Encapsulation is achieved by separating the classes and their roles

Enables inheritance and polymorphism. For example, we can define a generic interface or ML inference as a parent class, and then the controller can instantiate and launch each model based on the flow of the pipeline.

Easier tests, more tests

Another big advantage is that with this separation our code became much more testable. The logic to write and read the data is separated. As well as the code that instantiates and configures the Model. Our business logic is perfectly encapsulated in the Model, so we can focus on creating very good unit tests for that class. We only need to define the input and the expected output.

Creating integration tests to validate the View and the Controller is also a good idea. As always, the integration tests will still be challenging.

Modularity And Extensibility

With these abstractions (and good tests) we can perform changes easily in different parts of the pipeline. It also allows us to extend the functionality thanks to the OOP principles we mentioned above.

Conclusions

Data Engineers and Data Scientists produce a lot of software. So increasing our arsenal with more software patterns will always be useful. However, the key thing is to know how to apply them to our day-to-day lives.

In the following chapters, we will talk about other patterns that will help us to overcome more problems that the MVC pattern doesn’t solve.