Failure-Aware ML System
Production-validated risk engine with high-recall, failure-aware classification designed for regulated environments like Credit Risk and Fraud Detection.
System Recall
98.7%
Automation
67.2%
Review Reduction
68%
Scale
1.3M records
Problem & Constraints
Standard Kaggle solutions maximize ROC-AUC but ignore the operational bottleneck: manual review volume. In regulated environments (fintech, fraud), missing a high-risk case (False Negative) is catastrophic. Meanwhile, flagging too many cases for human review makes the system economically unviable.
The challenge: minimize False Negatives while keeping manual review load manageable—solving for both safety and operational cost.
Approach: Cascade Architecture
I realized a single model was trying to learn "easy" and "hard" patterns simultaneously. The solution was a two-stage pipeline:
- Stage 1 (Gatekeeper): A high-recall, explainable Logistic
Regression. It aggressively filters "Obvious Safe" cases.
Lending Club Result: Filtered 65.6% of volume immediately.
IEEE-CIS Fraud Result: Filtered 96.4% of volume immediately. - Stage 2 (Specialist): A calibrated XGBoost model trained only on the residual "Hard Cases." This improved the model's ability to separate gray-zone risk.
Dynamic Safety Valves
Fraud attacks come in bursts. Static thresholds fail when the environment changes. I implemented Rolling Quantile Thresholding:
- Stable Environment (Lending Club): System detected 0.29% drift. Thresholds remained stable, maximizing throughput.
- Attack Scenario (IEEE-CIS): System detected a 2.05% confidence drop. Dynamic logic automatically tightened the Pass Threshold, prioritizing safety over automation until the shift stabilized.
Performance by Domain
| Dataset | Challenge | Automation | System Recall |
|---|---|---|---|
| Lending Club | Big Data (1.3M rows) | 67.2% | 98.7% |
| IEEE-CIS | Fraud (3.5% Target) | 90.0% | 98.4% |
| Home Credit | Complex (Sub-prime) | 75.5% | 96.6% |
| UCI Credit | Noisy (22% Default) | 29.9% | 97.7% |
Postmortem: What I Learned
The Review Bottleneck
Finding: In the Home Credit dataset, a single model flagged 76% of users for review—operationally unacceptable.
Fix: The Cascade approach reduced this to 24.1% by trusting the Gatekeeper for low-risk applicants. This turns a "broken" system into a deployable one.
The Noisy Dataset Limit
Finding: On UCI Credit, despite best efforts, the review rate remained high (57.7%).
Root Cause: When the base default rate is 22% and features are weak, there is an "Irreducible Error." No model can safely auto-approve more than 30% without taking massive risks. The system correctly identified this limit and refused to guess.
Engineering Highlights
Inference Latency: The architecture processed 1.3 million rows with minimal latency because 65% of inferences used only the lightweight Gatekeeper model (Matrix Multiplication) rather than the heavy XGBoost (Tree Traversal).
Feature Signal: Ratio-based feature engineering (e.g.,
RATIO_DTI_UTILIZATION) was critical for the Home Credit
model to separate sub-prime risk and reach >96% System Recall.
Production Roadmap
While this project validates the core decisioning logic on historical data, a live production deployment would require the following infrastructure:
- Global State Management: To handle distributed fraud attacks across multiple regions (e.g., US-East vs. EU-West), I would implement CRDTs (G-Counters) for rate limiting. This ensures eventual consistency for velocity checks without incurring cross-region locking latency.
- Feature Serving Layer: To eliminate training-serving
skew, I would migrate the
RATIO_feature definitions to a Feature Store (e.g., Feast). This guarantees that the ratios calculated during batch training match the real-time inference inputs exactly. - Adaptive Feedback Loop: The current architecture uses batch retraining. In production, I would add an Online Learning sidecar (using River or Vowpal Wabbit) to ingest label feedback from human reviewers immediately, allowing the system to adapt to new fraud vectors within minutes rather than weeks.