Threshold moving for Logistic Regression to handle imbalanced test data
Current Scenario
- Test data and training data has same proportion of imbalance i.e Lower default accounts than non default accounts. (0.01/0.99)
- Smote inherently handles this imbalance before training of the model and perform resampling to equalize (0.5/0.5) the proportion of positive (default) vs negative (non-default) classes.
- But, when we test the best estimator model over the test data-set (still has imbalance), we get poor precision and lots of False Positives. (Low Precision High Recall Model).
Improvisation
- To make the model more robust towards the imbalanced data, we can do two things:
-
(1) Don't use SMOTE -
(2) Use SMOTE. But change your default threshold (0.5). -
Use ROC Curve to find optimal threshold. -
Use Precision - Recall curve to find optimal threshold. -
Adjust 0.5 threshold to \frac{0.5}{k}. Where,kis:
-
-
\begin{align}
k & = \frac{\text{Default accounts percentage after SMOTE}}{\text{Default accounts percentage before SMOTE (Training Set)}}
\end{align}
- Implement a custom transformer that finds resampled proportion of default class after application of SMOTE by `imblearn` pipeline. Compute new threshold.
Note: Custom transformer is no longer required as this is handled mathematically. Resample proportion is computed using sampling strategy.See Adjusted Threshold at a6ab7c6e
-
Use newly computed optimal decision threshold for logistic regression decision boundary. -
GridSearchCVdoes not directly handle this. Instead of using class labels given byscore, find probability of default for each account compare it with new threshold and reassign class labels.
-
-
Compute confusion matrix and check if this improves the performance of the model. -
Do a comparative analysis of all the methods for choosing optimal threshold and select one. Justify your decision. -
Add a note on scoring parameter for the model performance measurement.
Edited by Amoli Rajgor