Threshold moving for Logistic Regression to handle imbalanced test data

Current Scenario

  • Test data and training data has same proportion of imbalance i.e Lower default accounts than non default accounts. (0.01/0.99)
  • Smote inherently handles this imbalance before training of the model and perform resampling to equalize (0.5/0.5) the proportion of positive (default) vs negative (non-default) classes.
  • But, when we test the best estimator model over the test data-set (still has imbalance), we get poor precision and lots of False Positives. (Low Precision High Recall Model).

Improvisation

  1. To make the model more robust towards the imbalanced data, we can do two things:
    • (1) Don't use SMOTE
    • (2) Use SMOTE. But change your default threshold (0.5).
      • Use ROC Curve to find optimal threshold.
      • Use Precision - Recall curve to find optimal threshold.
      • Adjust 0.5 threshold to \frac{0.5}{k}. Where, k is:
\begin{align}
	k & = \frac{\text{Default accounts percentage after SMOTE}}{\text{Default accounts percentage before SMOTE (Training Set)}} 
\end{align}
  1. Implement a custom transformer that finds resampled proportion of default class after application of SMOTE by `imblearn` pipeline. Compute new threshold.

Note: Custom transformer is no longer required as this is handled mathematically. Resample proportion is computed using sampling strategy.See Adjusted Threshold at a6ab7c6e

  1. Use newly computed optimal decision threshold for logistic regression decision boundary.
    • GridSearchCV does not directly handle this. Instead of using class labels given by score, find probability of default for each account compare it with new threshold and reassign class labels.
  2. Compute confusion matrix and check if this improves the performance of the model.
  3. Do a comparative analysis of all the methods for choosing optimal threshold and select one. Justify your decision.
  4. Add a note on scoring parameter for the model performance measurement.
Edited by Amoli Rajgor