In this project, we explore predictive analytics in the context of loan approvals. Our goal is to understand the factors influencing approval decisions and assess how accurately we can predict them. We begin by examining the dataset and uncovering patterns through targeted visualizations. These insights then guide us into building a logistic regression model, allowing us to quantify relationships and evaluate the model’s predictive power.
The data has been picked from kaggle: https://www.kaggle.com/datasets/abhishekmishra08/loan-approval-datasets?resource=download&select=loan_data.csv
Let's import the libraries and read the source file. There are 24,000 rows in the file.
Now, let's explore the dataset and perform preprocessing if needed. We'll check for and remove any duplicate rows or missing values to ensure data quality before moving into analysis.
Let's see what story is each of the above visual telling us.
This plot shows loan approval outcomes across employment categories. All approved loans belong to employed individuals, while unemployed applicants receive no approvals at all. It highlights employment status as a strong categorical predictor.
This scatter plot reveals a proportional relationship between income and loan amount. Approved loans tend to cluster within a balanced range, while rejected ones show more dispersion. It suggests financial scale and proportionality influence approval decisions.
This visualization compares the distribution of DTI ratios for approved and rejected loans. Rejected applications skew toward higher DTI values, while approved ones concentrate at lower ratios. It indicates DTI is a key differentiator in approval outcomes.
The boxplot shows that the median DTI ratios for approved and rejected loans are quite similar. However, rejected loans exhibit greater dispersion with numerous outliers, indicating higher variability. This pattern reinforces the significance of DTI as a key factor influencing loan approval decisions.
Modeling Approach
Our target variable, loan approval, is binary—either "Approved" or "Rejected"—making logistic regression an ideal choice.
While the dataset includes a Text column describing loan purpose, it contains unstructured data that cannot be meaningfully quantified without natural language processing techniques such as tokenization or sentiment analysis. Including it without proper preprocessing would introduce noise and risk degrading model performance. Logistic regression also requires all input features to be numeric. To meet this requirement, we’ll drop the Text column and apply encoding techniques to convert categorical variables—specifically Approval and Employment_Status—into numerical form. This ensures compatibility with the model.
To assess how well our logistic regression model generalizes to unseen data, we’ll begin by splitting the dataset into training and test sets. The training set will be used to fit the model, while the test set provides an unbiased evaluation of its predictive performance. Once the model is trained, we’ll generate predictions on the test data and evaluate them using key classification metrics. These include accuracy, precision, recall, and the confusion matrix—all of which help us understand how effectively the model distinguishes between approved and rejected loan applications.
Interpretation
Accuracy: The model correctly predicted ~93% of all loan approval outcomes showing strong overall performance.
Precision: When the model predicts “Approved,” it’s correct ~78% of the time - meaning it does not make too many false approvals.
Recall: The model successfully captures around 77% of all actual approvals, though it misses a few genuine ones.
F1 Score: The F1 score of 0.78 shows a good balance between precision and recall, which is important when approval and rejection cases are not equally represented.
Interpretation
Conclusion
The model demonstrates strong overall performance, accurately predicting about 93% of loan approval outcomes. It effectively distinguishes between approved and rejected cases, with a particularly high ability to identify non-approved loans (True Negatives).
While precision (78%) and recall (77%) indicate balanced performance, the few missed approvals (False Negatives) suggest that the model could be slightly improved to better capture borderline approval cases.
Overall, the results show that the model is reliable, well-balanced, and practical for deployment in predicting loan approval decisions, with room for fine-tuning to further enhance recall.