Constrained LLM Extraction Pipeline
Fail-closed incident normalization: deterministic-first routing, schema-validated Structured Outputs, bounded recovery, and escalation on invalid outputs.
Tests
65 passing
Contract
JSON Schema
Mode
Fail-closed
Path
Deterministic
Key Principle
"The safest LLM output is the one you never request."
This project demonstrates senior engineering judgment by designing around LLM failure modes rather than assuming them away.
Trust Boundary
A record is considered data only after it passes Pydantic strict validation; otherwise it is untrusted text.
Use Case: Security Incident Normalization
Input: Free-form incident reports, chat logs, system alerts
Output: Strict JSON schema with incident type, severity, affected systems, IOCs, and confidence score.
Architecture
Raw Input (Incident Report, Chat Log, Alert)
↓
┌────────────────────────────────────────┐
│ Pre-Classifier (Regex + Keywords) │
└────────────────────────────────────────┘
↓
┌────────┴────────┐
↓ ↓
┌──────────┐ ┌─────────────┐
│ HIGH │ │ AMBIGUOUS │
│ CERTAINTY│ │ (LLM Path) │
└──────────┘ └─────────────┘
↓ ↓
┌──────────┐ ┌─────────────┐
│ Rule- │ │ LLM Extract │
│ Based │ │ → Schema │
│ Extractor│ │ Validate │
└──────────┘ └─────────────┘
↓ ↓
│ ┌─────┴─────┐
│ ↓ ↓
│ ┌─────────┐ ┌──────────┐
│ │ VALID │ │ INVALID │
│ └─────────┘ └──────────┘
│ ↓ ↓
│ │ ┌──────────┐
│ │ │ Determin-│
│ │ │ istic │
│ │ │ Repair │
│ │ └──────────┘
│ │ ↓
│ │ ┌─────┴─────┐
│ │ ↓ ↓
│ │ ┌──────┐ ┌───────────┐
│ │ │VALID │ │ ESCALATE/ │
│ │ └──────┘ │ FLAG/DROP │
│ │ ↓ └───────────┘
└───────────┴─────┴───────→ Output Core Design Principles
1. Deterministic First, LLM Second
Regex + keyword detection is:
- Cheaper — zero API cost
- Faster — sub-millisecond latency
- Auditable — fully deterministic
- Zero hallucination risk
Example: If input contains SSN, password reset,
OTP → classify as credential_theft without LLM.
2. Strict Schema Enforcement
LLM output is treated as untrusted text, not data.
class SecurityIncident(BaseModel):
model_config = ConfigDict(
strict=True, # No type coercion
extra="forbid", # Reject extra fields
) Rejection criteria:
- Missing required fields
-
Invalid enum values (
"very high"instead of"high") -
Type mismatches (string
"0.5"≠ float0.5) - Extra keys (LLM invented fields)
3. Deterministic Recovery Paths
Instead of "retry with a better prompt":
- Strip markdown fences
- Extract first valid JSON block
- Fix trailing commas
- Normalize enum case
Only one retry allowed → prevents infinite cost loops.
4. Confidence Scoring
The model is not permitted to output confidence; confidence is computed deterministically from routing, repairs, and rule hits.
Computed from:
| Signal | Impact |
|---|---|
| Deterministic rule hits | +0.15 per hit (max +0.30) |
| LLM bypass (rule-only) | +0.20 |
| All required fields present | +0.10 |
| Each repair applied | -0.10 |
Test Results
65 tests passing across all modules:
| Test Suite | What It Validates |
|---|---|
test_validator.py | Strict mode rejects type coercion, extra fields, invalid enums |
test_repair.py | Markdown fences stripped, trailing commas fixed, enum case normalized |
test_pre_classifier.py | Credential theft, malware, phishing patterns route to HIGH_CERTAINTY |
test_pipeline.py | Deterministic path bypasses LLM, repair recovers malformed JSON |
test_prompt_injection.py | Adversarial inputs produce valid schema or escalate |
Interview FAQ
"Why not just improve the prompt?"
Prompts don't create guarantees. Schema validation does.
"How do you prevent hallucinations?"
You don't—you detect and reject them.
"What happens when extraction fails?"
We fail closed: drop, flag, or escalate—never fabricate.
"What would you do at scale?"
Cache deterministic results, move regex to Rust/Go, batch LLM calls, tighten schema over time.
Project Structure
llm-extraction-pipeline/
├── pyproject.toml
├── README.md
├── src/extraction_pipeline/
│ ├── __init__.py
│ ├── schema.py # Pydantic v2 strict models
│ ├── pre_classifier.py # Deterministic routing
│ ├── rule_extractor.py # High-certainty extraction
│ ├── llm_extractor.py # LLM wrapper + mock
│ ├── validator.py # Trust boundary
│ ├── json_repair.py # Deterministic recovery
│ ├── confidence.py # Signal-based scoring
│ └── pipeline.py # Orchestrator
├── tests/
│ ├── conftest.py
│ ├── test_validator.py
│ ├── test_repair.py
│ ├── test_pre_classifier.py
│ ├── test_pipeline.py
│ └── adversarial/
│ └── test_prompt_injection.py
├── data/
│ └── synthetic_incidents.py
└── evaluation/
├── harness.py
└── baseline.py Known Limitations
- Single-node Only: Pre-classifier regex runs in-process. Production would use a compiled regex engine or Rust/Go sidecar.
- No Learning Loop: System does not adapt based on human corrections. Would add online learning for threshold tuning.
- Mock Evaluation: Baseline comparisons use mock LLM for repeatability. Real-world hallucination rates may differ.