The Machine Learning Lifecycle

The Machine Learning Lifecycle

A Step-by-Step Guide

Introduction

By utilizing the strength of algorithms and data, machine learning (ML) has fundamentally changed the way we solve complicated issues and make predictions. The ML lifecycle consists of several precise phases that help us turn unprocessed data into insightful knowledge. We will look at the major phases of the machine learning lifecycle in this post, including concrete examples and explanations to help us to have the best understanding of it.

Step 1: Define the Problem

The problem definition phase of the ML lifecycle is the first stage. This entails figuring out the business or research topic that needs to be resolved and deciding whether machine learning is the best method to do it.

For example, a business might want to predict future sales based on historical data. In this case, the problem can be defined as a regression problem in ML.

Data is where a machine learning engineer's true adventure begins. Yes, data in general is the key to AI and the essential fuel that ignites all AI algorithms. I think this description of AI is brilliant: "making learning machines for data, not for instructions. (new AI coding paradigm)" Knowing this, every ML Engineer regards the first stage as the modelization of a customer's need without which no problem can be solved or the wrong problem may be solved. Before beginning to gather the necessary data to solve the proper problem, the first step must be carefully planned and understood. Let now move on to the next stage.

Step 2: Data Collection and preprocessing

Data collection involves gathering data relevant to the problem. This could be from various sources like databases, APIs, web scraping, surveys, etc. For instance, to predict future sales, historical sales data along with other relevant information like marketing spend, seasonal trends, etc., would be collected.

Once the data is collected, it needs to be prepared for analysis. This includes handling missing values, normalizing numerical features, and encoding categorical variables. Tools such as pandas in Python or dplyr in R are commonly used for data-wrangling tasks. Here's an example of code that removes missing values from a dataset:

import pandas as pd
# Load the dataset
data = pd.read_csv('housing.csv')
# Remove rows with missing values
data = data.dropna()

This involves cleaning, formatting, and transforming your data so that it can be used by a machine-learning model.

There are some things you need to do during data preparation or collection, such as:

  • Identifying and removing missing values

  • Dealing with outliers

  • Encoding categorical variables

  • Scaling your data

Note: Exploratory Data Analysis (EDA) is a strong method that helps us understand the characteristics of our data. By visualizing and summarizing the data, we gain insights into its distribution, correlations, and potential outliers. This exploration aids in feature selection and provides a foundation for building robust models.

For our housing price prediction example, we might plot the relationship between the number of bedrooms and the corresponding prices using a scatter plot. This could reveal any patterns or trends in the data that could be leveraged during model training.

import matplotlib.pyplot as plt
# Assuming df is your DataFrame and it has columns 'bedrooms' and 'prices'
plt.scatter(df['bedrooms'], df['prices'])
plt.xlabel('Number of Bedrooms')
plt.ylabel('Prices')
plt.title('Relationship between Number of Bedrooms and Prices')
plt.show()
# Don't forget to run first pip install matplotlib and pip install pandas 
# in your environment to ensure you have the necessary libraries installed

Techniques like correlation analysis and recursive feature elimination can aid in this process by identifying the qualities that are most important to our future model regarding the specifications of your problem, feature selection helps us make our future model simpler and more effective and ML engineer calls this feature engineering (more complex concept then explained here, I will come back to it in a dedicated article soon, just keep in mind this is a critical step in the machine learning pipeline when comes to improve performance of models).

In our example of house price prediction, we could create a new feature by combining the number of bedrooms and bathrooms to represent the total number of rooms.

# Assuming df is your DataFrame and it has columns 'bedrooms' and 'bathrooms'
df['total_rooms'] = df['bedrooms'] + df['bathrooms']

print(df.head())

Tools used in this step:

  • Data sources: Databases, APIs, files, and online datasets (Kaggle, Google Dataset search, ...)

  • Data cleaning and preprocessing tools: Python libraries like Pandas and NumPy, data cleaning tools like DataCleaner and Trifacta.

  • Feature engineering tools: feature engineering tools like Feature engineering SDK

Once your data is prepared, you can move on to the next step in the machine learning lifecycle.

Step 3: Model Training and Evaluation

Following the preparation of the data and the engineering of the features, we proceed to create our machine-learning models. This stage entails choosing the best algorithm or model architecture for the given problem. There are several different machine learning algorithms available, each with its strengths and weaknesses. The best algorithm for your project will depend on the problem you're trying to solve (first step well done) and the type of data you have (second step well done). ML models include but are not limited to, neural networks, decision trees, and linear regression.

After selecting an algorithm, you have to train and validate your model on your data, discovering trends and connections between the collection of variables and the desired outcome. This involves feeding your data into the algorithm and letting it learn from it. To implement and train our models, we can utilize libraries like TensorFlow, Scikit-Learn, Keras, and PyTorch. (About this step with TensorFlow for more detail look: here)

The training process can take some time, depending on the size of your data and the complexity of your algorithm. Here is a sample of code using Scikit-Learn illustrating how to train a linear regression model for our house price prediction problem:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split the data into features and target variable
X = data.drop('price', axis=1)
y = data['price']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

The model's performance is evaluated during the evaluation process. This is done using various metrics like accuracy, precision, recall, and F1 score for classification problems; Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for regression problems as in our house price prediction problem.

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f'Root Mean Squared Error: {rmse}')

When a model is evaluated, its performance is tested on a different dataset (one that wasn't used to train the model) and its parameters are changed to enhance performance. Cross-validation techniques, such as k-fold cross-validation, help estimate the model's performance on different subsets of the data. This helps identify any issues with overfitting or underfitting (more information about it in our last article: here).

In machine learning, we love to say: "To improve something, you often need to be able to measure it or evaluate it". Evaluating the performance using the hyperparameters of our trained model is crucial to ensure its reliability and generalizability. Metrics such as accuracy, precision, recall, and mean squared error provide insights into how well our model performs on unseen data. If done, you can already test your prediction.

The best way to evaluate your model will depend on the specific problem you're trying to solve (first step remember).

Tools used in this step:

  • Model evaluation tools: Testing frameworks like Pytest, Scikit-learn's model selection tools, Grid search, Random search, Bayesian optimization and TensorBoard.

Up to this step, the result looks like :

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import TensorBoard
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming df is your DataFrame and 'prices' is your target variable
features = df.drop('prices', axis=1)
target = df['prices']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Define the model architecture
model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Set up TensorBoard
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=1)

# Train the model
model.fit(X_train, y_train, epochs=10, callbacks=[tensorboard])

# Make predictions on the test set
predictions = model.predict(X_test)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f'Root Mean Squared Error: {rmse}')

A notebook code solves the problem with different tests done (evaluation and validation) but on your computer so the real customer cannot benefit of your solution, here comes the software engineering part of my job.

Step 4: Model Deployment, Monitoring and Maintenance

This step involves deploying the trained model into a production environment where it can be used to make predictions. It's time to put our model into a live environment once it has proven to be effective. For consumers to utilize your model to generate predictions, you must make it accessible to them via :

  • Web service

  • Mobile app

  • Embedded system

We will talk more about it in a dedicated series of articles.

The best way to deploy your model will depend on your specific needs (Remind yourself of the first step).

Note: To guarantee that the model's performance remains optimal throughout time, proper monitoring and maintenance are crucial. Monitoring the model's performance after deployment is essential to make any necessary adjustments or advancements later on. This guarantees that the model will continue to work as more data become available.

Tools used in this step:

  • Cloud Platforms: Amazon SageMaker, Microsoft Azure Machine Learning, Google Cloud AI Platform, IBM Watson ...

  • Containerization tools: Docker, Kubernetes

  • Model Deployment and Serving Tools: TensorFlow Serving, PyTorch Lightning, ONNX Runtime, Cortex, TensorFlow Lite, TensorFlow.js, MediaPipe, ...

  • Maintenance Models Tools: MLflow, ModelDB, Kubeflow, ...

  • Monitoring Models Tools: Prometheus, Grafana, TensorBoard, Neptune.ai, ...

  • AutoML Platforms: H2O.ai, DataRobot, Google Cloud AutoML, Amazon SageMaker Autopilot

  • Big Data Processing: PySpark, Apache Hadoop, Apache Spark, Apache Kafka ...

Conclusion

As can be seen from this article, machine learning has increased the need for experts with a variety of skill sets. Data Scientist, Data Engineer, Data Analyst, and Machine Learning Engineer are the most frequently encountered job titles in the industry (recently).

Just to be clear: Data analysts are the designers of the field. Data Engineers are similar to back-end developers (production, maintenance, and monitoring are their focus problems with strong cloud and coding skills), Data Scientists have a more complete profile with strong math and problem-solving skills (from the real problem to the solution on notebooks, step 3 like a front-end developer), and machine learning is the tip of the iceberg, a full stack. Despite Machine Learning Engineers being well-paid, Data Scientist is the position with the most demand for experience. During my last three years, I switched from Machine Learning position to a Data Scientist one but in near future I want to get out from this and step up to a decision and management position as Chief Data Scientist or Chief Data Officer (CDO or CDS) another position targeted is the Engineering Manager one those positions gives you more overview of the heavy jobs of AI and Data professionnal because they are with other managing board the end-users in most of cases of our job for best and efficient decision-making.

Anyone looking to work in this field must be familiar with the Machine Learning lifecycle. It offers a methodical way to create powerful ML models. The steps in the lifecycle are explained in general in this guide, but for those who want to learn more about this fascinating subject, there are many resources available. I will recommend this one to you:

Resources

Remember, building machine learning models is like building a tower of blocks. It takes time and practice, but it can be lots of fun too! 😊

If you like this content please like ten times, share the best you can and let a comment or a feedback.

@#PeaceAndLove

@Copyright_by_Kaz’Art

@ArthurStarks