Table of Contents
Introduction
Knowing why customers leave is critical for any business. One important indicator that has a direct impact on revenue is customer churn, or the loss of customers or subscribers. In order to find hidden patterns, clean up the data, and get it ready for machine learning modeling, I use exploratory data analysis (EDA) in this project to investigate customer churn.
From configuring a Python environment with Anaconda to creating a final clean dataset, this article takes you step-by-step through the entire EDA process for a churn dataset. Every significant step will be covered, including how to deal with missing values, deal with outliers, bin features, encode categorical variables, and scale numerical ones.
If you are someone new or want to explore more about exploratory data analysis (EDA) I highly recommend you to read this article where I cover all the concepts I needed to do this project and more.
By the end of this article, you’ll not only see how raw data is transformed into a model-ready form, but also understand the “why” behind each transformation.
This project will provide you with useful, hands-on insights into efficiently preparing your data, regardless of your level of experience with EDA or churn prediction models.
Environment Setup with Anaconda
The first step in data analysis is to set up a clean, isolated Python environment. Doing so guarantees that you have installed all the necessary libraries and prevents issues with other projects.
Step 1: Install Anaconda (if not already installed)
Anaconda is a Python distribution that comes preloaded with most data science tools like Jupyter, Pandas, NumPy, Matplotlib, Seaborn and more.
- Download Anaconda: https://www.anaconda.com/products/distribution
- Install it by following the prompts for your OS.
Once installed, open the Anaconda Prompt or terminal and proceed with the steps below.
Step 2: Create a New Conda Environment
We will create a new environment named churn-eda with Python 3.9: Here for churn-eda you can use whatever name you like and for the python version make sure you insert the latest version inside your local machine. (recommended 3.9+).

Then you will see some steps in the terminal just accept everything and proceed. After that we have to activate the environment we created as follows:

Step 3: Clone or Open Your Project Folder
You are still might be in the default system location. Therefore you need to navigate to your project directory using :

Step 4: Open your project in VS Code
Next we will open our project folder in vs code by typing “code .” in the terminal or android prompt. This will automatically open your project in Visual Studio Code. Afte that we are good to go with the project.
Arranging Project Folder Structure
Before dive into coding part it is essential to know how this project folder and file structure will be look like. It will be helpful for proceeding further in more organized way. You can follow any folder structure you may like, I showed below the structure I used to do my customer churn EDA project.

I created two folders to upload the datasets. Raw folder only contained the raw and weak dataset which I used for exploratory data analysis. And processed folder contained the optimized and fully processed datasets from step 0 to final step. You will get more understanding about this as we move on.
The raw customer dataset that I used for this project is uploaded in my GitHub repo You can find it here. You have to drag and drop or simply insert that dataset to the raw folder. After that you can start analyzing that data set to prepare it for modeling.
Requirements and .env file
We will need bunch of python libraries and API to do the EDA. It is a burden to install them one by one. Therefore here I just typed all the perquisites I need to install in a text file (requiremnts.txt). Later I can use this file to install all the libraries and APIs at once.

In this project, I integrated a GROQ-powered large language model (LLM) to assist with automated text generation, explanations, or even suggestions based on EDA insights. To interact with the model securely, I used an API key provided by Groq Cloud. You can also create your GROQ API key for free here.
Instead of hardcoding the API key in the notebook, I stored it safely in a .env
file. By connecting to Groq using the .env
key, I could programmatically prompt the model to assist in my analysis or article generation.
The .env
file is a plain text file that holds sensitive environment variables, such as: GROQ_API_KEY=your-secret-api-key
This file should be kept private and added to .gitignore
so it’s never committed to GitHub or shared accidentally.
Handling Missing Values
Missing values are one of the most common problems in real-world datasets. If not handled properly, they can lead to misleading results or errors during modeling. In this section, I’ll:
- Identify missing values in the dataset
- Understand which features are affected
- Decide on appropriate deletion and imputation strategies
I created .ipynb (I python notebook / jupyter notebook) named ) 0_handle_missing_values to handle the missing values. After Creating this notebook make sure you select the kernel on the right corner as the conda environment you created (churn-eda)
In these notebooks markdowns are used to type text, documentation basically to understand what the code is all about. Code can be written in the code blocks and you can simply run each code separately to see whether your code is working or not, no need to compile the whole program.
Standard setup process
Before diving into any analysis, every data science notebook typically starts with a standard setup process. This foundational step ensures your environment is ready and your data is accessible. It usually involves two key tasks:
Install libraries and Import dependencies
I had already typed the libraries and APIs I needed in a requirments.txt file. Now it is easy to install all the libraries at once by typing the following command in a separate code block and run it. This ensures all required packages for data manipulation, visualization, and modeling are available.
!pip install -r requirements.txt
After installation, I imported all the necessary libraries at the top of your notebook for clean organization and easy access:

You need import os
anytime you want to access system-level settings or environment variables in Python. By typing os.getcwd()
you can get the Current Working Directory.
Loading the Dataset
The next step is to load the dataset you want to analyze.

pd.read_csv()
is a Pandas function used to read CSV (Comma-Separated Values) files.- It loads the contents of the file into a Data Frame, which is a 2D tabular structure (like a spreadsheet) that makes data analysis easy in Python.
'EDA/data/raw/CEHHbInToW.csv'
is the relative path to the dataset file.- Once loaded, you typically preview the first few rows by
df.head()
. This helps confirm the dataset is correctly loaded and gives a quick look at the structure and contents.
Checking for Missing Values
One of the first and most important steps in data exploration is identifying missing values in the dataset. In Python, this can be done with a simple but powerful line of code:

df.isnull()
Creates a Boolean Data Frame where:True
indicates a missing value (Nan
)False
indicates a valid (non-missing) value
.sum()
When applied to the Boolean Data Frame, it counts how manyTrue
(missing) values there are in each column.
Deleting missing data
Sometimes the simplest way to deal with missing values is to remove them. Pandas makes this easy using the .dropna()
method.

df.copy()
Creates a duplicate of the original Data Frame so we don’t lose or alter the raw data. It’s a best practice before performing destructive operations.dropna(axis=0)
axis=0
checks each row. (axis=
1 checks each column)- It drops any row that has at least one null value.
This is useful when, Only a few rows contain missing values or losing them won’t affect overall data quality
Imputation: Filling Missing Values
Instead of deleting rows with missing values, another common approach is imputation: that is, filling in the missing data with a suitable value. We use impute when deleting data can lead to information loss, especially when missing values are widespread and Imputation helps preserve the dataset size while allowing the model to work with complete data.
Possible Imputation Strategies:
- Numerical Values:
- Use the mean or median
- Use special placeholders like
0
or-999
(if you want the model to recognize missingness)
- Categorical Values:
- Use the mode
- Or assign a new label like
'Unknown'
or'Missing'
Age Imputation
I saw age had the most number of missing values (600). So I use imputation to fill the missing ages by using the mean value :

df['Age'].mean()
Calculates the average (mean) of all non-null values in theAge
column.df['Age'].median()
Calculates the median (middle value when sorted) of theAge
column.round(mean, 2)
Rounds the mean value to 2 decimal places for cleaner output.print(...)
Displays the mean and median in a formatted way to compare both values side by side.
This helped me decide which central tendency (mean or median) is more appropriate for imputation based on the data distribution.
Then I plot a histogram using seaborn library to spot skewness. If the data is symmetrical, the mean works well. If skewed, the median might be more robust.

sns.histplot(...)
Plots a histogram — a graphical representation of the distribution of numerical data.df['Age']
Specifies the column to plot (in this case, Age).bins=30
Divides the age range into 30 equal-width intervals (bins) to show more granularity in the distribution.
As you can see above is is more symmetrical so I decided to go with mean. Here is how I imputed missing age values using the Mean.

df_impute
A copy of the original DataFrame used to apply changes without modifying the raw dataset.df_impute['Age'].fillna(mean)
Fills all missing values (NaN
) in theAge
column with the previously calculated mean..loc[:, 'Age']
Ensures the assignment targets the entire'Age'
column explicitly (all rows, single column).df_impute.head()
Displays the first 5 rows of the updated DataFrame to verify the imputation.
You can now type df_impute.isnull().sum()
and see age column has 0 null value rows which means imputation is succeed.
Gender Imputation
Step 1 : Predicting Gender from Names using LLM

- To handle missing values in the Gender column, we use a unique approach: predicting gender from first and last names using a large language model (LLM) via Groq API.
Gender
enum class is defined to represent valid gender values:"Male"
and"Female"
, ensuring only acceptable outputs are allowed.- A
GenderPrediction
Pydantic model is created to structure and validate input/output with three fields:firstname
,lastname
, andpred_gender
. - The
predict_gender()
function takes a person’sfirstname
andlastname
as inputs. - A formatted natural language
prompt
is created that asks the model to predict the most likely gender based on the given names. - The
groq.Groq().chat.completions.create()
function is called to send the prompt to the Groq API using the"llama-3.3-70b-versatile"
model. You can use any model that are in the Groq API documentation. - The response is received in chat format, and the predicted gender is extracted from
response.choices[0].message.content
, then cleaned using.strip()
. - Finally, the function returns the predicted gender as a string: either
"Male"
or"Female"
.
Step 2: Loop Through Missing Genders and Predict Using LLM

- The code begins by identifying all rows in the dataset (
df_impute
) where theGender
column is null, using a Boolean mask. - It then loops through each index where the gender is missing, and for every such row, it extracts the corresponding
Firstname
andLastname
. - These names are passed into a function
predict_gender()
which is powered by a large language model (LLM) like LLaMA 3. The function constructs a natural language prompt asking the model to return either “Male” or “Female” based on the name. - The LLM responds with a predicted gender (as a string), and this prediction is stripped of any extra whitespace and stored.
- If a valid gender is returned, it is assigned back into the missing cell of the DataFrame for that person. A print statement also displays the full name along with the predicted gender.
- If the model fails to return a prediction, the gender field remains unchanged, and a message is printed stating that no gender was detected.
- This loop continues until all rows with missing genders have either been filled or logged as undetectable, effectively imputing missing gender values using intelligent predictions.
You can now type df_impute.isnull().sum()
and see gender column also has 0 null value rows now which means imputation is succeed.
Removing Irrelevant Columns for Data Cleaning

In this step, I cleane up the dataset by removing several columns that are not necessary for our analysis or modeling. The code uses the .drop()
method on the Data Frame df_impute
to delete specific columns. These columns are:
RowNumber
: Usually just an index or serial number for each row, which doesn’t carry meaningful information for prediction.CustomerId
: A unique identifier for each customer that does not contribute to any patterns or relationships in the data.Firstname
andLastname
: Personal identifiers that are irrelevant for our analysis and may raise privacy concerns.
You can now type df_impute.isnull().sum()
and see no columns have null values and only the relevant columns exist in the dataset.
Saving the Cleaned Dataset
Once the data cleaning and preprocessing steps are complete, it is crucial to save the updated dataset. This ensures that the cleaned data can be easily accessed later without repeating the entire process. The following code snippet demonstrates how to save the processed DataFrame to a CSV file:
- The
to_csv()
function exports the Data Framedf_impute
to a CSV file. - The file path
'EDA/data/processed/MissingValuesHandled.csv'
specifies where the file will be saved. - Setting
index=False
prevents saving the Data Frame’s row indices as a separate column in the CSV file. - Saving the data in a structured folder (
data/processed
) helps maintain organization and clarity in the project files. - The saved CSV file can now be used for further analysis, modeling, or sharing with others.
Handling Outliers
Outliers are data points that deviate significantly from the rest of the dataset. They can distort statistical analyses and machine learning models by skewing the results or introducing noise. Therefore, detecting and handling outliers is a critical step in data preprocessing to improve model accuracy and reliability. In this section, we will explore methods to identify and manage outliers effectively.
Standard setup process
Usually what we do here is installing all the libraries and importing dependencies to our jupyter notebook. Other than that we have to load the data set. Instead of using the raw dataset, we now load the dataset where missing values have already been handled.
This ensures we’re working with a clean base before detecting and handling outliers.

Outlier Detection Technique 1 : Distribution Plots
KDE Plotting (Kernel Density Estimate)
Used to visualize the distribution of numerical features and detect skewness, multiple modes (peaks), and potential outliers.

- Creates a 2×3 grid of subplots to display distribution plots for numerical columns.
- Uses
sns.kdeplot()
to draw smooth curves representing the data distribution for each feature. - Titles are set for each subplot using the column name for clarity.
- X-axis labels are removed for a cleaner look.
tight_layout()
adjusts spacing to prevent overlaps.- Helps identify the shape of data, skewness, and possible outliers visually.

Pie Chart for Categorical Features
Used to understand the distribution of categories and detect possible data quality issues or imbalances.

- Creates a horizontal row of 5 subplots to visualize the distribution of categorical columns.
- For each column in
categorical_columns
, a pie chart is plotted using.value_counts().plot(kind='pie')
. - Each pie chart shows the proportion of each category (e.g., Male vs Female, Yes vs No).
- Titles are set based on the column name for clarity.
- Y-axis labels are removed to reduce clutter.
tight_layout()
ensures the plots are neatly arranged without overlapping.- Helps understand class imbalances or dominant categories in each feature.

Outlier Detection Technique 2 : Box Plots
Used to visualize the spread and detect outliers or extreme values in numerical data distributions.

- A grid of 2 rows and 3 columns is created to display six box plots for numerical features.
- Each
sns.boxplot()
call plots the distribution of a single numerical column: CreditScore
,Tenure
,Age
in the top row.Balance
,EstimatedSalary
,NumOfProducts
in the bottom row.
Box plots show:
- The median (central line in the box).
- The interquartile range (IQR) — middle 50% of the data.
- Whiskers indicating variability outside the upper/lower quartiles.
- Dots outside whiskers mark potential outliers.
plt.tight_layout()
ensures spacing between plots is clean and non-overlapping.- This method provides a quick visual way to detect outliers in each numerical column.

Outlier Detection Technique 3 : Empirical Rule (Three Sigma Rule)
This technique helps detect outliers in a numerical feature by identifying data points that fall more than 3 standard deviations away from the mean, based on the Empirical Rule (also known as the 68-95-99.7 Rule).
find_anomaly(data)
function

- Takes a single numeric column (
data
) as input. - Calculates the mean and standard deviation (std) of that column.
- Computes upper and lower bounds:
upper_bound = mean + (3 * std)
lower_bound = mean - (3 * std)
Returns a Boolean Series identifying where the values exceed these bounds (i.e., are considered outliers).
Outlier Detection Loop

- A list of important
numerical_columns
is defined (Age
,Tenure
,CreditScore
,Balance
,EstimatedSalary
). - The
find_anomaly()
function is called to detect which values are outliers. - The number of outliers (
n_outliers
) is counted using.sum()
on the returned Boolean Series. - The percentage of outliers is calculated by dividing
n_outliers
by the total number of rows in the dataset.
A message is printed for each column showing:
- The total number of outliers
- The percentage of outliers (rounded to 3 decimal places)
Outlier Detection Technique 4 : Interquartile Range (IQR) Method
The IQR method detects outliers by measuring the spread of the middle 50% of the data and identifying values that fall significantly outside that range.
find_anomaly(data)
function
Accepts a single numeric column as input.

Calculates:
Q1
: 25th percentile (lower quartile)Q3
: 75th percentile (upper quartile)IQR
: Interquartile Range =Q3 - Q1
Defines:
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
Returns a Boolean Series indicating whether each value is an outlier (i.e., falls outside the IQR bounds).
First Loop – Outlier Detection per Column

- Explanation :>
- Iterates over
numerical_columns
(Age
,Tenure
,CreditScore
,Balance
,EstimatedSalary
). - For each column:
- Applies
find_anomaly()
to get outlier flags. - Calculates:
n_outliers
: Total number of outliers (True
values)n_outlier_percentage
: Share of outliers relative to dataset size
- Prints a summary of outliers in each column.
- Applies
- Helps quantify how many extreme values exist in each feature.
- Iterates over
Outlier Combination Analysis:
Initializes an empty DataFrame outlier_mask_log
to track outlier status.

For each numerical column:
- Adds a Boolean column to
outlier_mask_log
showing outlier flags for that feature.
Calculates:
Outlier Count
: Row-wise sum of outlier flags across all selected columns tells how many features a row is an outlier in.
Filters:
- Keeps only rows where
Outlier Count == 2
(i.e., records that are outliers in exactly two features). - Displays those rows for further inspection.
Saving the Cleaned Dataset
Once outliers have been identified using statistical techniques like the Empirical Rule and the Interquartile Range (IQR) Method, it is important to save the flagged results for review and future reference.
The following snippet demonstrates how to export the detected outlier records for further analysis:

- The code saves the subset of rows identified as outliers in exactly two numerical columns.
- This allows domain experts and business management to review the context of those records before deciding on removal or correction.
- The file path
'EDA/data/processed/Outliers_Handled.csv'
ensures the results are saved in an organized structure under theprocessed
folder. index=False
ensures that DataFrame row indices are not included as a separate column in the saved file.
Important: No outliers have been dropped or modified at this stage.
These records have only been flagged. A final decision should be made in consultation with business stakeholders to determine whether these values are genuine anomalies or important edge cases.
Feature Binning
Feature Binning (also known as discretization) is the process of converting continuous numerical variables into categorical bins or intervals.
Why Use Feature Binning?
- Helps in handling outliers by grouping extreme values into broader categories.
- Makes the data more interpretable, especially for decision trees or rule-based models.
- Useful for visualization and detecting patterns or trends across ranges.
Standard Setup process
The initial step involves setting up the environment by installing and importing all necessary libraries. Additionally, I loaded the dataset that has already undergone handling missing values and outliers treatment.
Feature Binning using Credit Score
In this step, we apply feature binning (also called discretization) to convert the continuous CreditScore
variable into meaningful categories. This improves interpretability and can sometimes enhance model performance, especially for tree-based models.

- This step applies feature binning to convert the continuous
CreditScore
into categorical labels based on standard credit score ranges. - A custom function
custom_binning_credit_score(score)
is defined to assign categories: Poor (<580), Fair (580–669), Good (670–739), Very Good (740–799), and Excellent (800–850). - The function ensures the score does not exceed 850 by including a validation check.
- The
.apply()
method is used to apply this function to each value in theCreditScore
column. - A new column
CreditScoreBins
is created to store the resulting categories, and the originalCreditScore
column is removed to prevent duplication. - This transformation improves interpretability by grouping credit scores into meaningful and recognizable categories.
Then as I had done in previous steps I saved this new processed data set as 'EDA/data/processed/FeautureBinning_Applied
‘ and moved to the next step of Exploratory Data Analysis.
Feature Encoding
Feature encoding converts categorical data into numerical form so that machine learning models can process it. A common method is one-hot encoding, which creates binary columns for each category. This step is essential to ensure the model can understand and use categorical features effectively.

Standard Setup process
The initial step involves setting up the environment by installing and importing all necessary libraries. Additionally, this time I loaded the dataset that has already undergone handling missing values , handling outliers and feature binning treatments.
Encoding Nominal variables

- This code applies one-hot encoding to the categorical columns
Geography
andGender
. pd.get_dummies()
creates new binary columns (0 or 1) for each category within these variables, with appropriate prefixes.- The new dummy columns are then concatenated to the original DataFrame using
pd.concat()
, and the originalGeography
andGender
columns are removed usingdel
. - The result is a DataFrame (
df_encoded
) where the nominal variables are now represented numerically, making them suitable for modeling.
Encoding Ordinal Variables
Defined a dictionary encode_dict_creditscore
to map categorical credit score labels to numerical values:
'Poor'
mapped to0
'Fair'
mapped to1
'Good'
mapped to2
'Very Good'
mapped to3
'Excellent'
mapped to4

- Used the
.map()
function on theCreditScoreBins
column of the dataframedf_encoded
to replace the categorical credit score categories with their corresponding numerical codes from the dictionary. - Stored the transformed numerical values back into the
CreditScoreBins
column ofdf_encoded
. - Displayed the first 10 rows of the updated dataframe using
.head(10)
to verify the encoding.
Then as I had done in previous steps I saved this new processed data set as 'EDA/data/processed/FeautureEncoding_Applied
‘ and moved to the final step of Exploratory Data Analysis.
Feature Scaling
Feature scaling is the process of adjusting numerical data to a common scale, ensuring no feature dominates others due to its range. It helps models like KNN, SVM, and neural networks perform better and train faster by improving balance and convergence.
Normalization : Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
Standardization: Standardization is another scaling technique where the values are centered around the mean with aunit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

- Created a list
Columns_need_to_be_scaled
containing numerical columns to be scaled:'Age'
,'Tenure'
,'Balance'
, and'EstimatedSalary'
. - Used a
for
loop to iterate through each column in the list.
Inside the loop:
- Initialized a
StandardScaler()
object fromsklearn.preprocessing
to scale data (optional line forMinMaxScaler()
is commented out). - Applied
fit_transform()
to each column, reshaped using.reshape(10000, 1)
to convert it into a 2D array as required by the scaler. - Stored the scaled values back into the original DataFrame
df
for each respective column.
Result: All selected columns are now standardized (mean = 0, standard deviation = 1), ready for model training.
Finally I saved this new processed data set as 'EDA/data/processed/ChurnModeling_Final
‘ This means the raw dataset I had was finally ready for modeling after multiple steps of Exploratory Data Analysis.
Conclusion
I’ve successfully navigated the essential stages of data preprocessing, from setting up my environment to meticulously cleaning, transforming, and scaling my features. What truly elevates this project is how I went beyond traditional Exploratory Data Analysis (EDA) by leveraging the Groq API with a Large Language Model (LLM) for tasks like predicting gender from names.
This innovative approach allowed me to impute missing values with greater accuracy and insight, showcasing a powerful synergy between conventional data science techniques and the cutting-edge capabilities of LLMs.
By diligently handling missing values, identifying and addressing outliers, and applying strategic feature engineering techniques like binning and encoding, I’ve not only prepared my dataset for robust model training but also gained invaluable insights into its underlying structure. Remember, the quality of my data directly impacts the performance of my machine learning models.
A well-preprocessed dataset leads to more accurate predictions, more reliable insights, and ultimately, more impactful solutions. As I move forward, this comprehensive approach to data preparation, now enhanced with LLM integration, will serve as a cornerstone for building high-performing and trustworthy machine learning applications. I’ll keep experimenting, keep refining, and keep building!. Stay tuned with Techwiseaid for more hands on experiences like this.