Principles of Object Oriented Programming for Data Developers

👀

Visit all the articles in the series of blogs post about Best Practices for Data Developers

Principles of Object Oriented Programming for Data Developers

MVC for Data Developers

Dependency Injection for Data Developers

Factory Pattern for Data Developers

Data Scientist comes from very different backgrounds. However, in the end, 90% of the output of a Data Scientist is code. Most likely, Python code in Jupyter notebooks. On the other hand, Data Scientist often lacks the necessary skills to produce maintainable code. This will impact the delivery of the team and the costs associated with maintaining the code we have written. We can intuitively agree that the more maintainable our code, the better for us, our team, and our clients.

In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. As software engineering can be a rabbit hole of patterns, paradigms, and frameworks... I want to strip down the basics. The idea is to just introduce enough information so you can make better design decisions from the start when developing your Data Science code.

The first chapter is about programming paradigms. In particular about the Object Oriented Programming.

Programming Paradigms

Let’s start with the basic definitions. What do we mean by Programming Paradigm? So by wikipedia page, we have the following definition:

Programming paradigms are a way to classify programming languages based on their features. Languages can be classified into multiple paradigms.

So it is just a classification criterion for programming languages. The main groups are:

Imperative in which the programmer instructs the machine on how to change its state.

Declarative in which the programmer merely declares properties of the desired result, but not how to compute it.

Imperative programming is probably the most common, especially for beginners as it is the most intuitive. You tell the program how to do things. For example, to sum the elements of a list:


hello = [1, 2, 3, 4]
result = 0
for element in hello:
	result += element

Declarative programming is less intuitive but very powerful in some contexts. To convert the same example to a declarative statement we need to use the sum function in python.


sum(hello)

This last example just “declares” the result, but leaves the python implementation to do the job as it is considered best. Libraries like numpy and pandas also use this declarative approach for some functions, which allows them to optimize the operations using C and Fortran libraries.

Wait a minute... So what paradigm is Python?

As you might have noticed, this kind of classification is not a disjoint set. Python is a multi-paradigm programming language that allows us to do things in many different ways. However, we are going to focus only on Object Oriented Programming (OOP) in this article.

Object Oriented Programming Basics

Object-oriented programming(OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. - Wikipedia

OOP is a type of imperative programming. One of the reasons it became so popular is that it allows us to organize the code in ways that favor modularity, extensibility, and maintainability.

The main concepts we need to know at the start are:

Classes: define the template of the objects. The classes can contain the methods and the attributes.

Methods: are the function that belongs to a class. They usually have the object attributes available to update data or retrieve data. They basically perform the business logic of your implementation

Attributes: are the data contained in an object.

Instances: are the materialization of the classes.

You are most likely familiar with these concepts, so let’s see a quick example:


# A class for a car
class Car:
  # Init method creates new instances
  def __init__(self, model, horse_power):
		self.horse_power = horse_power
		self.model = model
  
  # Bussiness logic method
  def transport(self, point_x1, point_x2):
		print(f"Moving from point {point_x1} to {point_x2}")

my_car = Car('ferrari', 1000)
my_car.horse_power  # instance attrbute
my_car.transport(10, 20)  # calling method

This is a silly implementation, but as you can see here we have the first advantage of OOP: encapsulation. Let’s say we want to replicate this but without using classes. How do we do it?

We don’t have Car objects, so we don’t Carhave attributes. The specific data for a car will need to live separately in different data structures.

The transport method will not have access to the car data, so we need to pass that data as a parameter, and passing the car data is complex because is not encapsulated into an object...

In other words: more code, less intuitive, less maintainable.

Classes are common in python for multiple packages you might have used. For example, a pandas DataFrame is a class with its data and its methods. As well a SkLearn estimator contained the trained model and the predict() method.

OOP Basic Principles: Encapsulation, Inheritance, Polymorphism

The basic structure for an OOP program is something easy to see and understand when it is already built. However, we need to understand the pillars of OOP to be able to build effective OOP programs ourselves.

Encapsulation

We already have seen this one. This principle tells us to group our data and methods in units that make sense. In our example, we created a class Car. This class encapsulates its data and its methods (business logic)

The basis of this principle is to search for the correct abstractions in our use case and create the proper classes with them.

Inheritance

There can be situations in which we need to define common abstractions for multiple classes. Following our Car example, we might want to extend our programs to accept bicycles, trucks, planes... If we try to do it just right out of the bat, we ended up with this


class Car:
  def __init__(self, model, horse_power, wheels=4):
		...
  
  def transport(self, point_x1, point_x2):
		print(f"Moving from point {point_x1} to {point_x2} with a car")

class Truck:
  def __init__(self, model, horse_power, wheels=16):
		...
  
  def transport(self, point_x1, point_x2):
		print(f"Moving from point {point_x1} to {point_x2} with a Truck")

class Bicycle:
  def __init__(self, model, wheels=2):
		...
  
  def transport(self, point_x1, point_x2):
		print(f"Moving from point {point_x1} to {point_x2} with a bicycle")

....

As you can see, each one of these forms of transportation has its own implementation of the transport method, but they can share similar methods and attributes. In fact, the initialization will be almost identical for each one of them. In the end, the purpose of these classes is the same, but the implementation details vary.

Inheritance helps us with these types of situations defining a common abstraction for all classes. It allows us to do something like this:


class Vehicle:
    def __init__(self, model, wheels):
        self.model = model
        self.wheels = wheels

    def transport(self, point_x1, point_x2):
        raise NotImplementedError


class Car(Vehicle):
    def __init__(self, model, wheels, horse_power):
        super().__init__(model, wheels)
        self.horse_power = horse_power

    def transport(self, point_x1, point_x2):
        print(f"Moving from point {point_x1} to {point_x2} by car")

# We can do the same for all vehicles.

my_car = Car('ferrari', horse_power=1000, wheels=4)
my_car.model  # Model attribute is correctly setup by the parent
my_car.transport(10, 20)  # calling method with the Car implementation

So there are several things happening here:

With the notation class Car(Vehicle): we told python that this class inherits from Vehicle

The super() method allows us to call methods from the parent class. In this simple example, we only use it to call the parent __init__ so we can set up the same attributes as the parent, but we can use it for any method.

The transport method is overwritten in the child class with the proper logic for its implementation

In this simple example, we are not event using all the advantages possible. We can have common methods in all the child classes that don’t need to be rewritten by each implementation.

Polymorphism

Inheritance is more powerful with the combination of Polymorphism. This principle tells us that we can (should) treat each child class the same way we treat the parent class.

Let’s see it with our silly example


from typing import List

class Vehicle:
    def __init__(self, model):
        ...
    def transport(self, point_x1, point_x2):
        raise NotImplementedError

# Assume for brevity that we implement these classes
class Car(Vehicle):
    ...

class Truck(Vehicle):
    ...

class Bicycle(Vehicle):
    ...

# Let's use type hints
vehicle_list: List[Vehicle] = [
    Car('ferrari'),
    Truck('ford'),
    Bicycle('bh'),
]

for v in vehcle_list:
    v.transport(10, 20)

On Python, the typing declaration is more a type of documentation than enforcement. I highly recommend it especially if you use some kind of IDE with code completion as it helps the IDE to know which kind of objects you are managing.

So as you can see in this example, when we transverse the vehicle list we don’t care which actual class instance is in it as long as it is a vehicle that implements the transport method. This can be extremely powerful. In this example, we can see how easily we can extend our program for new vehicles classes


class MotorBike(Vehicle):
    ...

vehicle_list: List[Vehicle] = [
    Car('ferrari'),
    Truck('ford'),
    Bicyle('bh'),
    MotorBike('yamaha'),
]

for v in vehcle_list:
    v.transport(10, 20)

Final Thoughts

OOP can be a powerful paradigm to increase the maintainability, usability, and code quality of your codebase. This is just an introduction to its basic principles. There are multiple ways of using them and almost unlimited ways of doing the same thing. As long as you have these principles in mind, you will start recognizing patterns in which you can apply them.

We must also keep in mind to use the right tools for the right job. OOP is another tool in our arsenal and must not be our single choice for everything we do. However, it is probably a good and safe starting point.

OOP can be combined with very well-established Software Design patterns to help to manage the complexity of your software. Not all these patterns can be applied to Data Developers. Go to the next articles to find more about the ones I have found more useful in my career.

👀

Visit all the articles in the series of blogs post about Best Practices for Data Developers

Principles of Object Oriented Programming for Data Developers

MVC for Data Developers

Dependency Injection for Data Developers

Factory Pattern for Data Developers