Table of Contents
Machine learning is a powerful tool that has revolutionized industries by enabling computers to learn and make decisions from data. But how does it all come together? In this article, we’ll take a comprehensive journey through the machine learning process
This article dives deeper than the foundational concepts covered in the previous two. Here, we touch upon everything from data collection to real-world deployment. You might encounter some new terms and techniques, but don’t worry—this is just the surface. Each of these concepts will be explored in greater when we explore the roadmap and star to learn ML
Whether you’re an absolute beginner or a pro in this field, this article aims to provide you with an understanding how machine learning works in practice, setting the stage for hands-on learning and application. Let’s dive in and uncover the magic behind machine learning process!
We can divide this machine learning process into 6 main parts. They are:
- Data collection
- Data preparation
- Train a model
- Analysis and Evaluation
- Deploying a model
- Retrain and maintenance of a model
1. Data Collection
We know machines learn from the data that given by us. So, it is obvious the first step in machine learning process is to collect data so that the machine learning model can find the patterns and relationships. Make sure you use data from reliable source because it will directly affect the output of the machine learning model. If u have incorrect, outdated data the chances you get wrong predictions will be high. The quality of the data we give will determine how efficient and accurate our machine learning model is.
Questions to ask:
- What kind of machine learning problems are we going to solve?
- What data sources exist?
- What privacy concerns are there?
- Is the data public or private?
- Where should we store data?
Types of data
Structured data
Data which appears in rows and columns style, like what we find in an Excel spreadsheet. There are different types of structured data.
- Categorial data – One thing or another. Order does not matter. (For car sales, color is a category. A car may be blue but not white)
- Numerical data – Any continuous values where the difference between them matters. (when selling cars 100,00$ is more than 35,000$)
- Ordinal data – Data which has order but the distance between values is unknown (rating the car seller from 1 to 5)
- Time series – Data across time (sale values of cars between 2000 – 2024)
Unstructured data
Data with no solid structure at all.
Examples: Images, videos, speech

2. Data Preparation
After gathering the data, it is always the best practice to check the data properly and make it to the desired format so that machine learning model can easily find the patterns on that data. There are several ways to do data preparation.
Exploratory Data Analysis (EDA)
In EDA what we do is learning about the data we are working with. We can learn them by asking some questions like:
- What are the feature variables and the target variables (input and output)? – For preparing a chicken soup feature variables are chicken, vegetables, spices and etc. And the target will be a hot yummy chicken soup.
- What kind of data you have? – Structured, Unstructured, categorical, numerical. (creating data dictionary).
- Are there missing values, should we remove them or fill them with feature imputation? (Explained below)
- Where are the outliers? How many? Why are they there? Are they out by much?
- Are there questions we could ask a domain expert about the data? – Would an experienced chef help us with our dataset for preparing a chicken soup.
Data preprocessing
This is a crucial step in the machine learning process. There are many aspects of data preprocessing. We will discuss only the most frequently used aspects.
Feature imputation
If the relevant data is not present machine learning model cannot learn on them. We can fill the missing values by feature imputation. There are several ways to do feature imputation.
- Single imputation: Fill with mean, median and column.
- Multiple imputation: Fill with values what model finds missing
- KNN (K-Nearest Neighbors) – Fill with value from another example which is similar
Not only these there are many other ways like last observation, moving window, random imputation and more. As we only cover the essentials to get an idea of how machine learning process work we will discuss these concepts much deeper when we are going to learn machine learning.
Feature encoding
We all know machines only understand numbers. Therefore, all the data we collected needed to be convert into numbers. For that we use feature encoding. There are many ways to do encoding.
- One Hot Encoding – Turn all the unique values into list of 0s and 1s where the target value is 1 and the rest are 0 (In traffic lights red color would be represented as [ 0,1,0] and green would be [1,0,0]
- Label Encoder – Turn labels such as if target variables are dogs, cats, rats into distinct numerical values like 0,1,2 receptively.
- Embedding encoding – Learn a representation amongst all the different data points. Widely used in structured data. One example is language model is a representation of how different words relate to each other.
Feature normalization (scaling) or standardization
When the collected numerical variables are on different scales some machine learning models don’t perform very well for this we use scaling or standardization. As example some values are in the range of 1 – 10 and other values are ranging from 5000 -10,000. In these situations, we can use scaling and normalization
Feature normalization
This can be also called as scaling. Scaling shifts your values so they always appear between 0 and 1. The simple equation to do this is subtract the minimum value and divide by the (maximum – minimum). As a example 1 – 10 values will become something like 0.1 and 1.0 respectively.
Feature standardization
This standardizes all the values so they have a mean of 0 and unit variance. The simple equation to do this is by subtracting the mean and divide by the standard deviation of the relevant feature. This will results values do not end up between 0 and 1 therefore Standardization is more robust to outliers than feature normalization.
Feature engineering
In feature engineering we transform all data into more meaningful representations by adding your own knowledge about the domain. There are several techniques we use in feature engineering such as:
- Decompose – Change a date like 2024-12-25 to is-holiday, day-of-the-week, etc.
Discretization – Turning large groups into smaller groups such as turn numerical variables into over-100, under-10, 10-20, etc. and categorical variables like colors into combined colors (dark-green, light-green, lime green). - Crossing and interaction – combining two or more features such as car-last-sold and current-date to get the time-on-market.
- Indicator features – Using other parts of the data to indicate something potentially significant, such as if you were analyzing car sales data, does the car have less than 1000 KM, is automatic and under 10 years old? From experience you know these cars are much worth so you could make a special feature called under-1000-auto-under-10y.
Feature selection
Feature selection helps us to select the most valuable features of the dataset to model by selecting fewer overall data and less redundant data to train on and improving accuracy. There are few ways to do feature selection such as:
- Dimensionality reduction
- Wrapper methods such as genetic algorithms and recursive feature elimination involve in creating large subsets of features and then removing the unnecessary ones
- Feature importance – fit a model to a dataset then inspect which features are most needed and which are not necessary. This is usually done after modeling
Dealing with imbalances
If your data have 100 examples of one class but only 5 examples of another we will have to deal with the imbalances. For this the best practice is to collect more data if you can. Or we can use advanced features like sckit-learn-contrib imbalanced- learn package, SMOTE which we will be discuss in depth when we jump into roadmap and learn ML
Data Splitting
This is the process of dividing the dataset into 3 main subsets to ensure the model’s performance is accurate and efficient. That main 3 subsets are:
- Training set – Model learns on this (70-80% of data)
- Validation set – Model hyperparameters are tuned on this (10-15% of data)
- Test set – Models final performance is evaluated on this. We do not use this set to tune the model.
Example: If you are studying for a exam training set will be your learning materials, the validation set will be your practice assignments, test set is your final exam

3. Train a model
Training a model in machine learning involves using a dataset to teach an algorithm to recognize patterns, make predictions, or classify data.
Chose an Algorithm
Often the hardest part of solving a machine learning problem can be finding the right algorithm. Different algorithms are better suited for different types of data and different problems. In machine learning mainly there are two algorithms supervised and unsupervised. Now let’s discuss each one of them little more in depth.
Supervised algorithms
In supervised algorithms, the machine is trained on labeled data, which means input and output pair is known. And then machine learns to map the input with correct output (Understanding relationship with data and labels.). Lets look at the most commonly used supervised algorithms:
- Linear Regression – Draw a line that best fits the data scattered on a graph. Produces continuous variables like height, weight and etc.
- Logistic Regression – Predicts a binary outcome based on a series of independent variables such as predicting someone has heart disease base on their health performance
- K-Nearest Neighbors – Finding ‘k’ examples which are most similar to each other, then given a new sample which is the new sample most closely aligned with.
- Support Vector Machines (SVMs) – can be used for classification and regression. Try to find the best way to separate data points using multiple lanes.
- Decision Trees and Random Forests – Can be used for classification and regression mostly suited for structured data. Decision trees split data based on criteria eventually getting to a point where the data cannot be split anymore. Random forests are a combination of many decision trees, effectively leveraging and combining the choices of many models.
- AdaBoost / Gradient Boosting Machines – can be used in classification and regression. Asks the question can the series of weak data can be turned into a set of strong data. (XGBoost, CatBoost, LighGBM)
- Neural Networks (Deep Learning) – can be used for classification and regression. Takes a series of inputs, manipulates the inputs with linear and non-linear functions (straight and non-straight lines) multiple times. Neural networks can use this combination to estimate almost anything. There are mainly 3 types of neural networks they are:
- Convolutional neural networks (CNN) – used for computer vision
- Recurrent neural networks (RNN) – used for sequence modeling
- Transformer networks – used for vision and text, starting to replace RNN
Unsupervised algorithms
In unsupervised algorithms, the machine is trained on unlabeled data, which means there is no fixed output or output is not known. Then machine learns from the data discovers the relationship and patterns of data and gives the output. We will now discuss the most used unsupervised algorithms
Clustering
K-Means clustering: choose k number of clusters, each cluster receives a center node (called a centroid) at random and with each iteration the center nodes attempt to move farther away from each other. Once the centroids stop moving, each sample is assigned a value equivalent to the closest centroid.
Visualization and dimensionality reduction
- Principal Component Analysis (PCA): reduce data from more dimensions to lower dimensions whilst attempting.co to preserve the variance.
- Autoencoders: Learn a lower dimensional encoding of data. For example, compress an image of 100 pixels into 50 pixels representing (roughly) the same information as the 100 pixels
- t-Distributed Stochastic Neighbor Embedding it-SNE good for visualizing high-demsionality data in a 20 or 30 space.
Anomaly detection
- Autoencoder: Use an autoencoder to reduce the dimensionality of the inputs of a system and then try to recreate those inputs within some threshold. If the recreations aren’t able to match the threshold, there could be some sort of outlier.
- One-class classification: train a model on only one-class as an example normal events of computer network traffic, which are usually in abundance, if anything lays outside of this class, it may be an anomaly. Algorithms for doing so include, one-class K-Means, one-class SVM (support vector machine), isolation forest and local outlier factor.
Types of Learning
Machine Learning involves different types of learning based on how the model is trained and the nature of the data. The most frequently used types are:
- Batch Learning – All of your data exists in a big static warehouse and you train a model on it. Learning may take a while and isn’t done often. Runs in production without learning.
- Online Learning – Your data is constantly being updated and you constantly train new models on it. Each learning step is usually fast and cheap. Runs in production and learns continuously.
- Transfer learning – Take the knowledge one model has learned and use it with your own.
- Active learning – Also referred to as “human in the loop” learning. A human expert interacts with a model and provides updates to labels for samples which the model is most uncertain about.
- Ensembling – Not really a form of learning, more combining algorithms which have already learned in some way to get better results.
Underfitting and Overfitting
Underfitting – This happens when your model does not perform as well as you would like on your data. Solution is to try training for longer or more advanced model.
Overfitting – This happens when how your model performing on a validation dataset starts to increase or if you don’t have a validation set (99% accuracy on training set and 55% accuracy on test set) this is called as overfitting and this is issue we need to fix.
Regularization
A collection of techniques to prevent or reduce the overfitting. Some of them are:
- L1 (lasso) and L2 (ridge) regularization – L1 regularization sets unneeded feature coefficients to 0 while L2 won’t set them to 0.
- Dropout – randomly remove parts of your model so the rest of it has to become better.
- Early stopping – stop your model from training before the validation loss starts to increase too much or more generally, any other metric has stopped improving.
- Data augmentation – manipulate your dataset in artificial ways to make it ‘harder to learn. For example, if you’re dealing with images, randomly rotate, skew, flip and adjust the height of your images. This makes your model have to learn similar patterns across different styles of the same image (harder)
- Batch normalization: standardize inputs well as adding two parameters before they go into the next layer. This often results in faster training speeds since the optimizer has less parameters to update. May be a replacement for dropout in some networks.
Hyperparameter Tuning
This means running a bunch of experiments with different model settings and see which works the best. Some of the hyperparameters you can use are:
- Setting a learning rate – Generally high learning rate means algorithms rapidly adapts to new data while low learning rate means algorithms adapts slower to new data like transfer learning.
- Number of layers (deep learning networks)
- Batch size – How many examples of data your model sees at once. Use the largest batch size you can fit in your GPU memory.
- Number of trees (decision tree algorithms)
- Number of iterations – How many times your model goes through the data.
4. Analysis and Evaluation
Analyzing and evaluating a machine learning (ML) model involves assessing its performance and ensuring it generalizes well to unseen data. This process is crucial to measure the model’s accuracy, reliability, and ability to solve the target problem. There are several ways to do the evaluation and analysis, some of the most commonly used methods are:
- Evaluation Metrics (We have already talked about this in the article of “What is machine learning?”)
- Feature importance – Which features contributed most to the model? Should some features be removed? Useful for model explain ability.
- Training / inference time / cost – How long does a model take to train? How long does inference take? Is it suitable for production? Is this feasible?
- What if tool – How does my model compare to other models? What if I changed something in data? How does this effect the outcome?
- Least confident examples – What does the model gets wrong?
- Bias/Variance tradeoff – High bias results in underfitting while high variance results in overfitting.

5. Deploying The Model
In this step we put the model into production in real world and see how it goes. Evaluation and analysis is great but until the model is in the real world production. you won’t know how it performs for real.
Machine Learning Operations (MLOps) comes in to play in deploying the model. This is where software engineering meets machine learning, essentially all the technology required around machine learning model to have it working in production.
You can use tools like TensorFlow, Pytorch, Google Ai platform, sage maker to deploy a machine learning model which we will be discuss broadly in the next article.
6. Retrain and maintenance of the model
In this last step we see how the model performs after deploying or prior to deploying based on various evaluation metrics and revisit the above steps as required. You will also find your models start to outdated or drift as in when data sources changed or upgraded. This is when you will want to retrain it.
Conclusion
The machine learning process is the backbone of turning raw data into actionable insights and intelligent systems. From data collection to deploying a model, each step plays a crucial role in building robust, reliable, and impactful machine learning solutions. In this article, we’ve mapped out the essential steps and provided a bird’s-eye view of how machine learning works, bridging the gap between theory and real-world applications.
While this article lays the groundwork, the next steps involve diving into the tools and libraries that make machine learning more accessible and manageable for everyone. From TensorFlow and PyTorch to Scikit-Learn and beyond, our next article will guide you through the tools that every aspiring machine learning practitioner should know.
Stay curious, keep learning, and remember: mastering machine learning is a journey, not a destination. See you in the next article, where we’ll explore the incredible ecosystem of tools that empower you to bring machine learning ideas to life!