Visit all the articles in the series of blogs post about Best Practices for Data Developers
Data Scientist comes from very different backgrounds. However, in the end, 90% of the output of a Data Scientist is code. Most likely, Python code in Jupyter notebooks. On the other hand, Data Scientist often lacks the necessary skills to produce maintainable code. This will impact the delivery of the team and the costs associated with maintaining the code we have written. We can intuitively agree that the more maintainable our code, the better for us, our team, and our clients.
In this series, I wanted to give a brief and fundamental approach to some of the principles that can help us develop better code. As software engineering can be a rabbit hole of patterns, paradigms, and frameworks... I want to strip down the basics. The idea is to just introduce enough information so you can make better design decisions from the start when developing your Data Science code.
The first chapter is about programming paradigms. In particular about the Object Oriented Programming.
Programming Paradigms
Let’s start with the basic definitions. What do we mean by Programming Paradigm? So by wikipedia page, we have the following definition:
Programming paradigms are a way to classify programming languages based on their features. Languages can be classified into multiple paradigms.
So it is just a classification criterion for programming languages. The main groups are:
- Imperative in which the programmer instructs the machine on how to change its state.
- Declarative in which the programmer merely declares properties of the desired result, but not how to compute it.
Imperative programming is probably the most common, especially for beginners as it is the most intuitive. You tell the program how to do things. For example, to sum the elements of a list:
hello = [1, 2, 3, 4] result = 0 for element in hello: result += element
Declarative programming is less intuitive but very powerful in some contexts. To convert the same example to a declarative statement we need to use the
sum
function in python.sum(hello)
This last example just “declares” the result, but leaves the python implementation to do the job as it is considered best. Libraries like
numpy
and pandas
also use this declarative approach for some functions, which allows them to optimize the operations using C and Fortran libraries.Wait a minute... So what paradigm is Python?
As you might have noticed, this kind of classification is not a disjoint set. Python is a multi-paradigm programming language that allows us to do things in many different ways. However, we are going to focus only on Object Oriented Programming (OOP) in this article.
Object Oriented Programming Basics
Object-oriented programming(OOP) is a programming paradigm based on the concept of "objects", which can contain data and code. - Wikipedia
OOP is a type of imperative programming. One of the reasons it became so popular is that it allows us to organize the code in ways that favor modularity, extensibility, and maintainability.
The main concepts we need to know at the start are:
- Classes: define the template of the objects. The classes can contain the methods and the attributes.
- Methods: are the function that belongs to a class. They usually have the object attributes available to update data or retrieve data. They basically perform the business logic of your implementation
- Attributes: are the data contained in an object.
- Instances: are the materialization of the classes.
You are most likely familiar with these concepts, so let’s see a quick example:
# A class for a car class Car: # Init method creates new instances def __init__(self, model, horse_power): self.horse_power = horse_power self.model = model # Bussiness logic method def transport(self, point_x1, point_x2): print(f"Moving from point {point_x1} to {point_x2}") my_car = Car('ferrari', 1000) my_car.horse_power # instance attrbute my_car.transport(10, 20) # calling method
This is a silly implementation, but as you can see here we have the first advantage of OOP: encapsulation. Let’s say we want to replicate this but without using classes. How do we do it?
- We don’t have
Car
objects, so we don’tCar
have attributes. The specific data for acar
will need to live separately in different data structures.
- The
transport
method will not have access to thecar
data, so we need to pass that data as a parameter, and passing thecar
data is complex because is not encapsulated into an object...
In other words: more code, less intuitive, less maintainable.
Classes are common in python for multiple packages you might have used. For example, a pandas DataFrame is a class with its data and its methods. As well a SkLearn estimator contained the trained model and the
predict()
method.OOP Basic Principles: Encapsulation, Inheritance, Polymorphism
The basic structure for an OOP program is something easy to see and understand when it is already built. However, we need to understand the pillars of OOP to be able to build effective OOP programs ourselves.
Encapsulation
We already have seen this one. This principle tells us to group our data and methods in units that make sense. In our example, we created a class
Car
. This class encapsulates its data and its methods (business logic)The basis of this principle is to search for the correct abstractions in our use case and create the proper classes with them.
Inheritance
There can be situations in which we need to define common abstractions for multiple classes. Following our
Car
example, we might want to extend our programs to accept bicycles, trucks, planes... If we try to do it just right out of the bat, we ended up with thisclass Car: def __init__(self, model, horse_power, wheels=4): ... def transport(self, point_x1, point_x2): print(f"Moving from point {point_x1} to {point_x2} with a car") class Truck: def __init__(self, model, horse_power, wheels=16): ... def transport(self, point_x1, point_x2): print(f"Moving from point {point_x1} to {point_x2} with a Truck") class Bicycle: def __init__(self, model, wheels=2): ... def transport(self, point_x1, point_x2): print(f"Moving from point {point_x1} to {point_x2} with a bicycle") ....
As you can see, each one of these forms of transportation has its own implementation of the transport method, but they can share similar methods and attributes. In fact, the initialization will be almost identical for each one of them. In the end, the purpose of these classes is the same, but the implementation details vary.
Inheritance helps us with these types of situations defining a common abstraction for all classes. It allows us to do something like this:
class Vehicle: def __init__(self, model, wheels): self.model = model self.wheels = wheels def transport(self, point_x1, point_x2): raise NotImplementedError class Car(Vehicle): def __init__(self, model, wheels, horse_power): super().__init__(model, wheels) self.horse_power = horse_power def transport(self, point_x1, point_x2): print(f"Moving from point {point_x1} to {point_x2} by car") # We can do the same for all vehicles. my_car = Car('ferrari', horse_power=1000, wheels=4) my_car.model # Model attribute is correctly setup by the parent my_car.transport(10, 20) # calling method with the Car implementation
So there are several things happening here:
- With the notation
class Car(Vehicle):
we told python that this class inherits fromVehicle
- The
super()
method allows us to call methods from the parent class. In this simple example, we only use it to call the parent__init__
so we can set up the same attributes as the parent, but we can use it for any method.
- The
transport
method is overwritten in the child class with the proper logic for its implementation
In this simple example, we are not event using all the advantages possible. We can have common methods in all the child classes that don’t need to be rewritten by each implementation.
Polymorphism
Inheritance is more powerful with the combination of Polymorphism. This principle tells us that we can (should) treat each child class the same way we treat the parent class.
Let’s see it with our silly example
from typing import List class Vehicle: def __init__(self, model): ... def transport(self, point_x1, point_x2): raise NotImplementedError # Assume for brevity that we implement these classes class Car(Vehicle): ... class Truck(Vehicle): ... class Bicycle(Vehicle): ... # Let's use type hints vehicle_list: List[Vehicle] = [ Car('ferrari'), Truck('ford'), Bicycle('bh'), ] for v in vehcle_list: v.transport(10, 20)
So as you can see in this example, when we transverse the vehicle list we don’t care which actual class instance is in it as long as it is a vehicle that implements the
transport
method. This can be extremely powerful. In this example, we can see how easily we can extend our program for new vehicles classesclass MotorBike(Vehicle): ... vehicle_list: List[Vehicle] = [ Car('ferrari'), Truck('ford'), Bicyle('bh'), MotorBike('yamaha'), ] for v in vehcle_list: v.transport(10, 20)
Final Thoughts
OOP can be a powerful paradigm to increase the maintainability, usability, and code quality of your codebase. This is just an introduction to its basic principles. There are multiple ways of using them and almost unlimited ways of doing the same thing. As long as you have these principles in mind, you will start recognizing patterns in which you can apply them.
We must also keep in mind to use the right tools for the right job. OOP is another tool in our arsenal and must not be our single choice for everything we do. However, it is probably a good and safe starting point.
OOP can be combined with very well-established Software Design patterns to help to manage the complexity of your software. Not all these patterns can be applied to Data Developers. Go to the next articles to find more about the ones I have found more useful in my career.
Visit all the articles in the series of blogs post about Best Practices for Data Developers