Student Performance Predictor

Beyond EDA: Intelligent Data Preprocessing with LLM-Powered Imputation

Introduction

Knowing why customers leave is critical for any business. One important indicator that has a direct impact on revenue is customer churn, or the loss of customers or subscribers. In order to find hidden patterns, clean up the data, and get it ready for machine learning modeling, I use exploratory data analysis (EDA) in this project to investigate customer churn.

From configuring a Python environment with Anaconda to creating a final clean dataset, this article takes you step-by-step through the entire EDA process for a churn dataset. Every significant step will be covered, including how to deal with missing values, deal with outliers, bin features, encode categorical variables, and scale numerical ones.

If you are someone new or want to explore more about exploratory data analysis (EDA) I highly recommend you to read this article where I cover all the concepts I needed to do this project and more.

By the end of this article, you’ll not only see how raw data is transformed into a model-ready form, but also understand the “why” behind each transformation.

This project will provide you with useful, hands-on insights into efficiently preparing your data, regardless of your level of experience with EDA or churn prediction models.

Environment Setup with Anaconda

The first step in data analysis is to set up a clean, isolated Python environment. Doing so guarantees that you have installed all the necessary libraries and prevents issues with other projects.

Step 1: Install Anaconda (if not already installed)

Anaconda is a Python distribution that comes preloaded with most data science tools like Jupyter, Pandas, NumPy, Matplotlib, Seaborn and more.

Once installed, open the Anaconda Prompt or terminal and proceed with the steps below.

Step 2: Create a New Conda Environment

We will create a new environment named churn-eda with Python 3.9: Here for churn-eda you can use whatever name you like and for the python version make sure you insert the latest version inside your local machine. (recommended 3.9+).

Screenshot 2025 07 23 184438

Then you will see some steps in the terminal just accept everything and proceed. After that we have to activate the environment we created as follows:

Screenshot 2025 07 23 184330

Step 3: Clone or Open Your Project Folder

You are still might be in the default system location. Therefore you need to navigate to your project directory using :

Screenshot 2025 07 23 184806

Step 4: Open your project in VS Code

Next we will open our project folder in vs code by typing “code .” in the terminal or android prompt. This will automatically open your project in Visual Studio Code. Afte that we are good to go with the project.

Arranging Project Folder Structure

Before dive into coding part it is essential to know how this project folder and file structure will be look like. It will be helpful for proceeding further in more organized way. You can follow any folder structure you may like, I showed below the structure I used to do my customer churn EDA project.

Screenshot 2025 07 23 185800

I created two folders to upload the datasets. Raw folder only contained the raw and weak dataset which I used for exploratory data analysis. And processed folder contained the optimized and fully processed datasets from step 0 to final step. You will get more understanding about this as we move on.

The raw customer dataset that I used for this project is uploaded in my GitHub repo You can find it here. You have to drag and drop or simply insert that dataset to the raw folder. After that you can start analyzing that data set to prepare it for modeling.

Requirements and .env file

We will need bunch of python libraries and API to do the EDA. It is a burden to install them one by one. Therefore here I just typed all the perquisites I need to install in a text file (requiremnts.txt). Later I can use this file to install all the libraries and APIs at once.

Screenshot 2025 07 23 191059

In this project, I integrated a GROQ-powered large language model (LLM) to assist with automated text generation, explanations, or even suggestions based on EDA insights. To interact with the model securely, I used an API key provided by Groq Cloud. You can also create your GROQ API key for free here.

Instead of hardcoding the API key in the notebook, I stored it safely in a .env file. By connecting to Groq using the .env key, I could programmatically prompt the model to assist in my analysis or article generation.

The .env file is a plain text file that holds sensitive environment variables, such as: GROQ_API_KEY=your-secret-api-key

This file should be kept private and added to .gitignore so it’s never committed to GitHub or shared accidentally.

Handling Missing Values

Missing values are one of the most common problems in real-world datasets. If not handled properly, they can lead to misleading results or errors during modeling. In this section, I’ll:

  • Identify missing values in the dataset
  • Understand which features are affected
  • Decide on appropriate deletion and imputation strategies

I created .ipynb (I python notebook / jupyter notebook) named ) 0_handle_missing_values to handle the missing values. After Creating this notebook make sure you select the kernel on the right corner as the conda environment you created (churn-eda)

In these notebooks markdowns are used to type text, documentation basically to understand what the code is all about. Code can be written in the code blocks and you can simply run each code separately to see whether your code is working or not, no need to compile the whole program.

Standard setup process

Before diving into any analysis, every data science notebook typically starts with a standard setup process. This foundational step ensures your environment is ready and your data is accessible. It usually involves two key tasks:

Install libraries and Import dependencies

I had already typed the libraries and APIs I needed in a requirments.txt file. Now it is easy to install all the libraries at once by typing the following command in a separate code block and run it. This ensures all required packages for data manipulation, visualization, and modeling are available.

!pip install -r requirements.txt

After installation, I imported all the necessary libraries at the top of your notebook for clean organization and easy access:

Screenshot 2025 07 23 193957

You need import os anytime you want to access system-level settings or environment variables in Python. By typing os.getcwd() you can get the Current Working Directory.

Loading the Dataset

The next step is to load the dataset you want to analyze.

Screenshot 2025 07 23 195446
  • pd.read_csv() is a Pandas function used to read CSV (Comma-Separated Values) files.
  • It loads the contents of the file into a Data Frame, which is a 2D tabular structure (like a spreadsheet) that makes data analysis easy in Python.
  • 'EDA/data/raw/CEHHbInToW.csv' is the relative path to the dataset file.
  • Once loaded, you typically preview the first few rows by df.head(). This helps confirm the dataset is correctly loaded and gives a quick look at the structure and contents.

Checking for Missing Values

One of the first and most important steps in data exploration is identifying missing values in the dataset. In Python, this can be done with a simple but powerful line of code:

Screenshot 2025 07 23 195950
  • df.isnull()
    Creates a Boolean Data Frame where:
    • True indicates a missing value (Nan)
    • False indicates a valid (non-missing) value
  • .sum()
    When applied to the Boolean Data Frame, it counts how many True (missing) values there are in each column.

Deleting missing data

Sometimes the simplest way to deal with missing values is to remove them. Pandas makes this easy using the .dropna() method.

Screenshot 2025 07 23 200832
  • df.copy()
    Creates a duplicate of the original Data Frame so we don’t lose or alter the raw data. It’s a best practice before performing destructive operations.
  • dropna(axis=0)
    • axis=0 checks each row. (axis=1 checks each column)
    • It drops any row that has at least one null value.

This is useful when, Only a few rows contain missing values or losing them won’t affect overall data quality

Imputation: Filling Missing Values

Instead of deleting rows with missing values, another common approach is imputation: that is, filling in the missing data with a suitable value. We use impute when deleting data can lead to information loss, especially when missing values are widespread and Imputation helps preserve the dataset size while allowing the model to work with complete data.

Possible Imputation Strategies:

  • Numerical Values:
    • Use the mean or median
    • Use special placeholders like 0 or -999 (if you want the model to recognize missingness)
  • Categorical Values:
    • Use the mode
    • Or assign a new label like 'Unknown' or 'Missing'

Age Imputation

I saw age had the most number of missing values (600). So I use imputation to fill the missing ages by using the mean value :

Screenshot 2025 07 23 202046
  • df['Age'].mean()
    Calculates the average (mean) of all non-null values in the Age column.
  • df['Age'].median()
    Calculates the median (middle value when sorted) of the Age column.
  • round(mean, 2)
    Rounds the mean value to 2 decimal places for cleaner output.
  • print(...)
    Displays the mean and median in a formatted way to compare both values side by side.

This helped me decide which central tendency (mean or median) is more appropriate for imputation based on the data distribution.

Then I plot a histogram using seaborn library to spot skewness. If the data is symmetrical, the mean works well. If skewed, the median might be more robust.

Screenshot 2025 07 23 202747
  • sns.histplot(...)
    Plots a histogram — a graphical representation of the distribution of numerical data.
  • df['Age']
    Specifies the column to plot (in this case, Age).
  • bins=30
    Divides the age range into 30 equal-width intervals (bins) to show more granularity in the distribution.

As you can see above is is more symmetrical so I decided to go with mean. Here is how I imputed missing age values using the Mean.

Screenshot 2025 07 23 203250
  • df_impute
    A copy of the original DataFrame used to apply changes without modifying the raw dataset.
  • df_impute['Age'].fillna(mean)
    Fills all missing values (NaN) in the Age column with the previously calculated mean.
  • .loc[:, 'Age']
    Ensures the assignment targets the entire 'Age' column explicitly (all rows, single column).
  • df_impute.head()
    Displays the first 5 rows of the updated DataFrame to verify the imputation.

You can now type df_impute.isnull().sum() and see age column has 0 null value rows which means imputation is succeed.

Gender Imputation

Step 1 : Predicting Gender from Names using LLM
Screenshot 2025 07 23 204611
  • To handle missing values in the Gender column, we use a unique approach: predicting gender from first and last names using a large language model (LLM) via Groq API.
  • Gender enum class is defined to represent valid gender values: "Male" and "Female", ensuring only acceptable outputs are allowed.
  • A GenderPrediction Pydantic model is created to structure and validate input/output with three fields: firstname, lastname, and pred_gender.
  • The predict_gender() function takes a person’s firstname and lastname as inputs.
  • A formatted natural language prompt is created that asks the model to predict the most likely gender based on the given names.
  • The groq.Groq().chat.completions.create() function is called to send the prompt to the Groq API using the "llama-3.3-70b-versatile" model. You can use any model that are in the Groq API documentation.
  • The response is received in chat format, and the predicted gender is extracted from response.choices[0].message.content, then cleaned using .strip().
  • Finally, the function returns the predicted gender as a string: either "Male" or "Female".
Step 2: Loop Through Missing Genders and Predict Using LLM
Screenshot 2025 07 23 205120
  • The code begins by identifying all rows in the dataset (df_impute) where the Gender column is null, using a Boolean mask.
  • It then loops through each index where the gender is missing, and for every such row, it extracts the corresponding Firstname and Lastname.
  • These names are passed into a function predict_gender() which is powered by a large language model (LLM) like LLaMA 3. The function constructs a natural language prompt asking the model to return either “Male” or “Female” based on the name.
  • The LLM responds with a predicted gender (as a string), and this prediction is stripped of any extra whitespace and stored.
  • If a valid gender is returned, it is assigned back into the missing cell of the DataFrame for that person. A print statement also displays the full name along with the predicted gender.
  • If the model fails to return a prediction, the gender field remains unchanged, and a message is printed stating that no gender was detected.
  • This loop continues until all rows with missing genders have either been filled or logged as undetectable, effectively imputing missing gender values using intelligent predictions.

You can now type df_impute.isnull().sum() and see gender column also has 0 null value rows now which means imputation is succeed.

Removing Irrelevant Columns for Data Cleaning

Screenshot 2025 07 23 205636

In this step, I cleane up the dataset by removing several columns that are not necessary for our analysis or modeling. The code uses the .drop() method on the Data Frame df_impute to delete specific columns. These columns are:

  • RowNumber: Usually just an index or serial number for each row, which doesn’t carry meaningful information for prediction.
  • CustomerId: A unique identifier for each customer that does not contribute to any patterns or relationships in the data.
  • Firstname and Lastname: Personal identifiers that are irrelevant for our analysis and may raise privacy concerns.

You can now type df_impute.isnull().sum() and see no columns have null values and only the relevant columns exist in the dataset.

Saving the Cleaned Dataset

Once the data cleaning and preprocessing steps are complete, it is crucial to save the updated dataset. This ensures that the cleaned data can be easily accessed later without repeating the entire process. The following code snippet demonstrates how to save the processed DataFrame to a CSV file:

  • The to_csv() function exports the Data Frame df_impute to a CSV file.
  • The file path 'EDA/data/processed/MissingValuesHandled.csv' specifies where the file will be saved.
  • Setting index=False prevents saving the Data Frame’s row indices as a separate column in the CSV file.
  • Saving the data in a structured folder (data/processed) helps maintain organization and clarity in the project files.
  • The saved CSV file can now be used for further analysis, modeling, or sharing with others.

Handling Outliers

Outliers are data points that deviate significantly from the rest of the dataset. They can distort statistical analyses and machine learning models by skewing the results or introducing noise. Therefore, detecting and handling outliers is a critical step in data preprocessing to improve model accuracy and reliability. In this section, we will explore methods to identify and manage outliers effectively.

Standard setup process

Usually what we do here is installing all the libraries and importing dependencies to our jupyter notebook. Other than that we have to load the data set. Instead of using the raw dataset, we now load the dataset where missing values have already been handled.
This ensures we’re working with a clean base before detecting and handling outliers.

Screenshot 2025 07 23 211418

Outlier Detection Technique 1 : Distribution Plots

KDE Plotting (Kernel Density Estimate)

Used to visualize the distribution of numerical features and detect skewness, multiple modes (peaks), and potential outliers.

Screenshot 2025 07 25 211658 1
  • Creates a 2×3 grid of subplots to display distribution plots for numerical columns.
  • Uses sns.kdeplot() to draw smooth curves representing the data distribution for each feature.
  • Titles are set for each subplot using the column name for clarity.
  • X-axis labels are removed for a cleaner look.
  • tight_layout() adjusts spacing to prevent overlaps.
  • Helps identify the shape of data, skewness, and possible outliers visually.
Customer Churn EDA
Pie Chart for Categorical Features

Used to understand the distribution of categories and detect possible data quality issues or imbalances.

Screenshot 2025 07 25 212308
  • Creates a horizontal row of 5 subplots to visualize the distribution of categorical columns.
  • For each column in categorical_columns, a pie chart is plotted using .value_counts().plot(kind='pie').
  • Each pie chart shows the proportion of each category (e.g., Male vs Female, Yes vs No).
  • Titles are set based on the column name for clarity.
  • Y-axis labels are removed to reduce clutter.
  • tight_layout() ensures the plots are neatly arranged without overlapping.
  • Helps understand class imbalances or dominant categories in each feature.
Screenshot 2025 07 25 212325

Outlier Detection Technique 2 : Box Plots

Used to visualize the spread and detect outliers or extreme values in numerical data distributions.

Screenshot 2025 07 25 212849
  • A grid of 2 rows and 3 columns is created to display six box plots for numerical features.
  • Each sns.boxplot() call plots the distribution of a single numerical column:
  • CreditScore, Tenure, Age in the top row.
  • Balance, EstimatedSalary, NumOfProducts in the bottom row.

Box plots show:

  • The median (central line in the box).
  • The interquartile range (IQR) — middle 50% of the data.
  • Whiskers indicating variability outside the upper/lower quartiles.
  • Dots outside whiskers mark potential outliers.
  • plt.tight_layout() ensures spacing between plots is clean and non-overlapping.
  • This method provides a quick visual way to detect outliers in each numerical column.
Screenshot 2025 07 25 213023

Outlier Detection Technique 3 : Empirical Rule (Three Sigma Rule)

This technique helps detect outliers in a numerical feature by identifying data points that fall more than 3 standard deviations away from the mean, based on the Empirical Rule (also known as the 68-95-99.7 Rule).

find_anomaly(data) function
Screenshot 2025 07 25 213858
  • Takes a single numeric column (data) as input.
  • Calculates the mean and standard deviation (std) of that column.
  • Computes upper and lower bounds:
  • upper_bound = mean + (3 * std)
  • lower_bound = mean - (3 * std)

Returns a Boolean Series identifying where the values exceed these bounds (i.e., are considered outliers).

Outlier Detection Loop
Screenshot 2025 07 25 213909
  • A list of important numerical_columns is defined (Age, Tenure, CreditScore, Balance, EstimatedSalary).
  • The find_anomaly() function is called to detect which values are outliers.
  • The number of outliers (n_outliers) is counted using .sum() on the returned Boolean Series.
  • The percentage of outliers is calculated by dividing n_outliers by the total number of rows in the dataset.

A message is printed for each column showing:

  • The total number of outliers
  • The percentage of outliers (rounded to 3 decimal places)

Outlier Detection Technique 4 : Interquartile Range (IQR) Method

The IQR method detects outliers by measuring the spread of the middle 50% of the data and identifying values that fall significantly outside that range.

find_anomaly(data) function

Accepts a single numeric column as input.

Screenshot 2025 07 26 103121

Calculates:

  • Q1: 25th percentile (lower quartile)
  • Q3: 75th percentile (upper quartile)
  • IQR: Interquartile Range = Q3 - Q1

Defines:

  • lower_bound = Q1 - 1.5 * IQR
  • upper_bound = Q3 + 1.5 * IQR

Returns a Boolean Series indicating whether each value is an outlier (i.e., falls outside the IQR bounds).

First Loop – Outlier Detection per Column
Screenshot 2025 07 26 103157
  • Explanation :>
    • Iterates over numerical_columns (Age, Tenure, CreditScore, Balance, EstimatedSalary).
    • For each column:
      • Applies find_anomaly() to get outlier flags.
      • Calculates:
        • n_outliers: Total number of outliers (True values)
        • n_outlier_percentage: Share of outliers relative to dataset size
      • Prints a summary of outliers in each column.
    • Helps quantify how many extreme values exist in each feature.
Outlier Combination Analysis:

Initializes an empty DataFrame outlier_mask_log to track outlier status.

Screenshot 2025 07 26 103237

For each numerical column:

  • Adds a Boolean column to outlier_mask_log showing outlier flags for that feature.

Calculates:

  • Outlier Count: Row-wise sum of outlier flags across all selected columns tells how many features a row is an outlier in.

Filters:

  • Keeps only rows where Outlier Count == 2 (i.e., records that are outliers in exactly two features).
  • Displays those rows for further inspection.

Saving the Cleaned Dataset

Once outliers have been identified using statistical techniques like the Empirical Rule and the Interquartile Range (IQR) Method, it is important to save the flagged results for review and future reference.

The following snippet demonstrates how to export the detected outlier records for further analysis:

Screenshot 2025 07 26 104706
  • The code saves the subset of rows identified as outliers in exactly two numerical columns.
  • This allows domain experts and business management to review the context of those records before deciding on removal or correction.
  • The file path 'EDA/data/processed/Outliers_Handled.csv' ensures the results are saved in an organized structure under the processed folder.
  • index=False ensures that DataFrame row indices are not included as a separate column in the saved file.

Important: No outliers have been dropped or modified at this stage.
These records have only been flagged. A final decision should be made in consultation with business stakeholders to determine whether these values are genuine anomalies or important edge cases.

Feature Binning

Feature Binning (also known as discretization) is the process of converting continuous numerical variables into categorical bins or intervals.

Why Use Feature Binning?

  • Helps in handling outliers by grouping extreme values into broader categories.
  • Makes the data more interpretable, especially for decision trees or rule-based models.
  • Useful for visualization and detecting patterns or trends across ranges.

Standard Setup process

The initial step involves setting up the environment by installing and importing all necessary libraries. Additionally, I loaded the dataset that has already undergone handling missing values and outliers treatment.

Feature Binning using Credit Score

In this step, we apply feature binning (also called discretization) to convert the continuous CreditScore variable into meaningful categories. This improves interpretability and can sometimes enhance model performance, especially for tree-based models.

Screenshot 2025 07 26 110245
  • This step applies feature binning to convert the continuous CreditScore into categorical labels based on standard credit score ranges.
  • A custom function custom_binning_credit_score(score) is defined to assign categories: Poor (<580), Fair (580–669), Good (670–739), Very Good (740–799), and Excellent (800–850).
  • The function ensures the score does not exceed 850 by including a validation check.
  • The .apply() method is used to apply this function to each value in the CreditScore column.
  • A new column CreditScoreBins is created to store the resulting categories, and the original CreditScore column is removed to prevent duplication.
  • This transformation improves interpretability by grouping credit scores into meaningful and recognizable categories.

Then as I had done in previous steps I saved this new processed data set as 'EDA/data/processed/FeautureBinning_Applied‘ and moved to the next step of Exploratory Data Analysis.

Feature Encoding

Feature encoding converts categorical data into numerical form so that machine learning models can process it. A common method is one-hot encoding, which creates binary columns for each category. This step is essential to ensure the model can understand and use categorical features effectively.

Screenshot 2025 07 26 111114

Standard Setup process

The initial step involves setting up the environment by installing and importing all necessary libraries. Additionally, this time I loaded the dataset that has already undergone handling missing values , handling outliers and feature binning treatments.

Encoding Nominal variables

Screenshot 2025 07 26 111300
  • This code applies one-hot encoding to the categorical columns Geography and Gender.
  • pd.get_dummies() creates new binary columns (0 or 1) for each category within these variables, with appropriate prefixes.
  • The new dummy columns are then concatenated to the original DataFrame using pd.concat(), and the original Geography and Gender columns are removed using del.
  • The result is a DataFrame (df_encoded) where the nominal variables are now represented numerically, making them suitable for modeling.

Encoding Ordinal Variables

Defined a dictionary encode_dict_creditscore to map categorical credit score labels to numerical values:

  • 'Poor' mapped to 0
  • 'Fair' mapped to 1
  • 'Good' mapped to 2
  • 'Very Good' mapped to 3
  • 'Excellent' mapped to 4
Screenshot 2025 07 26 111818
  • Used the .map() function on the CreditScoreBins column of the dataframe df_encoded to replace the categorical credit score categories with their corresponding numerical codes from the dictionary.
  • Stored the transformed numerical values back into the CreditScoreBins column of df_encoded.
  • Displayed the first 10 rows of the updated dataframe using .head(10) to verify the encoding.

Then as I had done in previous steps I saved this new processed data set as 'EDA/data/processed/FeautureEncoding_Applied‘ and moved to the final step of Exploratory Data Analysis.

Feature Scaling

Feature scaling is the process of adjusting numerical data to a common scale, ensuring no feature dominates others due to its range. It helps models like KNN, SVM, and neural networks perform better and train faster by improving balance and convergence.

Normalization : Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Standardization: Standardization is another scaling technique where the values are centered around the mean with aunit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Screenshot 2025 07 26 112212
  • Created a list Columns_need_to_be_scaled containing numerical columns to be scaled: 'Age', 'Tenure', 'Balance', and 'EstimatedSalary'.
  • Used a for loop to iterate through each column in the list.

Inside the loop:

  • Initialized a StandardScaler() object from sklearn.preprocessing to scale data (optional line for MinMaxScaler() is commented out).
  • Applied fit_transform() to each column, reshaped using .reshape(10000, 1) to convert it into a 2D array as required by the scaler.
  • Stored the scaled values back into the original DataFrame df for each respective column.

Result: All selected columns are now standardized (mean = 0, standard deviation = 1), ready for model training.

Finally I saved this new processed data set as 'EDA/data/processed/ChurnModeling_Final‘ This means the raw dataset I had was finally ready for modeling after multiple steps of Exploratory Data Analysis.

Conclusion

I’ve successfully navigated the essential stages of data preprocessing, from setting up my environment to meticulously cleaning, transforming, and scaling my features. What truly elevates this project is how I went beyond traditional Exploratory Data Analysis (EDA) by leveraging the Groq API with a Large Language Model (LLM) for tasks like predicting gender from names.

This innovative approach allowed me to impute missing values with greater accuracy and insight, showcasing a powerful synergy between conventional data science techniques and the cutting-edge capabilities of LLMs.

By diligently handling missing values, identifying and addressing outliers, and applying strategic feature engineering techniques like binning and encoding, I’ve not only prepared my dataset for robust model training but also gained invaluable insights into its underlying structure. Remember, the quality of my data directly impacts the performance of my machine learning models.

A well-preprocessed dataset leads to more accurate predictions, more reliable insights, and ultimately, more impactful solutions. As I move forward, this comprehensive approach to data preparation, now enhanced with LLM integration, will serve as a cornerstone for building high-performing and trustworthy machine learning applications. I’ll keep experimenting, keep refining, and keep building!. Stay tuned with Techwiseaid for more hands on experiences like this.

Leave a Reply

Your email address will not be published. Required fields are marked *