Failure-Aware ML System

Problem & Constraints

Standard Kaggle solutions maximize ROC-AUC but ignore the operational bottleneck: manual review volume. In regulated environments (fintech, fraud), missing a high-risk case (False Negative) is catastrophic. Meanwhile, flagging too many cases for human review makes the system economically unviable.

The challenge: minimize False Negatives while keeping manual review load manageable—solving for both safety and operational cost.

Approach: Cascade Architecture

I realized a single model was trying to learn "easy" and "hard" patterns simultaneously. The solution was a two-stage pipeline:

Stage 1 (Gatekeeper): A high-recall, explainable Logistic Regression. It aggressively filters "Obvious Safe" cases.
Lending Club Result: Filtered 65.6% of volume immediately.
IEEE-CIS Fraud Result: Filtered 96.4% of volume immediately.
Stage 2 (Specialist): A calibrated XGBoost model trained only on the residual "Hard Cases." This improved the model's ability to separate gray-zone risk.

Dynamic Safety Valves

Fraud attacks come in bursts. Static thresholds fail when the environment changes. I implemented Rolling Quantile Thresholding:

Stable Environment (Lending Club): System detected 0.29% drift. Thresholds remained stable, maximizing throughput.
Attack Scenario (IEEE-CIS): System detected a 2.05% confidence drop. Dynamic logic automatically tightened the Pass Threshold, prioritizing safety over automation until the shift stabilized.

Performance by Domain

Dataset	Challenge	Automation	System Recall
Lending Club	Big Data (1.3M rows)	67.2%	98.7%
IEEE-CIS	Fraud (3.5% Target)	90.0%	98.4%
Home Credit	Complex (Sub-prime)	75.5%	96.6%
UCI Credit	Noisy (22% Default)	29.9%	97.7%

Postmortem: What I Learned

The Review Bottleneck

Finding: In the Home Credit dataset, a single model flagged 76% of users for review—operationally unacceptable.

Fix: The Cascade approach reduced this to 24.1% by trusting the Gatekeeper for low-risk applicants. This turns a "broken" system into a deployable one.

The Noisy Dataset Limit

Finding: On UCI Credit, despite best efforts, the review rate remained high (57.7%).

Root Cause: When the base default rate is 22% and features are weak, there is an "Irreducible Error." No model can safely auto-approve more than 30% without taking massive risks. The system correctly identified this limit and refused to guess.

Engineering Highlights

Inference Latency: The architecture processed 1.3 million rows with minimal latency because 65% of inferences used only the lightweight Gatekeeper model (Matrix Multiplication) rather than the heavy XGBoost (Tree Traversal).

Feature Signal: Ratio-based feature engineering (e.g., RATIO_DTI_UTILIZATION) was critical for the Home Credit model to separate sub-prime risk and reach >96% System Recall.

Production Roadmap

While this project validates the core decisioning logic on historical data, a live production deployment would require the following infrastructure:

Global State Management: To handle distributed fraud attacks across multiple regions (e.g., US-East vs. EU-West), I would implement CRDTs (G-Counters) for rate limiting. This ensures eventual consistency for velocity checks without incurring cross-region locking latency.
Feature Serving Layer: To eliminate training-serving skew, I would migrate the RATIO_ feature definitions to a Feature Store (e.g., Feast). This guarantees that the ratios calculated during batch training match the real-time inference inputs exactly.
Adaptive Feedback Loop: The current architecture uses batch retraining. In production, I would add an Online Learning sidecar (using River or Vowpal Wabbit) to ingest label feedback from human reviewers immediately, allowing the system to adapt to new fraud vectors within minutes rather than weeks.