Back to Portfolio

Constrained LLM Extraction Pipeline

Fail-closed incident normalization: deterministic-first routing, schema-validated Structured Outputs, bounded recovery, and escalation on invalid outputs.

Tests

65 passing

Contract

JSON Schema

Mode

Fail-closed

Path

Deterministic

Key Principle

"The safest LLM output is the one you never request."

This project demonstrates senior engineering judgment by designing around LLM failure modes rather than assuming them away.

Trust Boundary

A record is considered data only after it passes Pydantic strict validation; otherwise it is untrusted text.

Use Case: Security Incident Normalization

Input: Free-form incident reports, chat logs, system alerts

Output: Strict JSON schema with incident type, severity, affected systems, IOCs, and confidence score.

Architecture

Raw Input (Incident Report, Chat Log, Alert)
        ↓
┌────────────────────────────────────────┐
│  Pre-Classifier (Regex + Keywords)     │
└────────────────────────────────────────┘
        ↓
   ┌────────┴────────┐
   ↓                 ↓
┌──────────┐   ┌─────────────┐
│ HIGH     │   │ AMBIGUOUS   │
│ CERTAINTY│   │ (LLM Path)  │
└──────────┘   └─────────────┘
   ↓                 ↓
┌──────────┐   ┌─────────────┐
│ Rule-    │   │ LLM Extract │
│ Based    │   │ → Schema    │
│ Extractor│   │   Validate  │
└──────────┘   └─────────────┘
   ↓                 ↓
   │           ┌─────┴─────┐
   │           ↓           ↓
   │     ┌─────────┐ ┌──────────┐
   │     │ VALID   │ │ INVALID  │
   │     └─────────┘ └──────────┘
   │           ↓           ↓
   │           │     ┌──────────┐
   │           │     │ Determin-│
   │           │     │ istic    │
   │           │     │ Repair   │
   │           │     └──────────┘
   │           │           ↓
   │           │     ┌─────┴─────┐
   │           │     ↓           ↓
   │           │ ┌──────┐ ┌───────────┐
   │           │ │VALID │ │ ESCALATE/ │
   │           │ └──────┘ │ FLAG/DROP │
   │           │     ↓    └───────────┘
   └───────────┴─────┴───────→ Output

Core Design Principles

1. Deterministic First, LLM Second

Regex + keyword detection is:

  • Cheaper — zero API cost
  • Faster — sub-millisecond latency
  • Auditable — fully deterministic
  • Zero hallucination risk

Example: If input contains SSN, password reset, OTP → classify as credential_theft without LLM.

2. Strict Schema Enforcement

LLM output is treated as untrusted text, not data.

class SecurityIncident(BaseModel):
    model_config = ConfigDict(
        strict=True,       # No type coercion
        extra="forbid",    # Reject extra fields
    )

Rejection criteria:

  • Missing required fields
  • Invalid enum values ("very high" instead of "high")
  • Type mismatches (string "0.5" ≠ float 0.5)
  • Extra keys (LLM invented fields)

3. Deterministic Recovery Paths

Instead of "retry with a better prompt":

  • Strip markdown fences
  • Extract first valid JSON block
  • Fix trailing commas
  • Normalize enum case

Only one retry allowed → prevents infinite cost loops.

4. Confidence Scoring

The model is not permitted to output confidence; confidence is computed deterministically from routing, repairs, and rule hits.

Computed from:

Signal Impact
Deterministic rule hits +0.15 per hit (max +0.30)
LLM bypass (rule-only) +0.20
All required fields present +0.10
Each repair applied -0.10

Test Results

65 tests passing across all modules:

Test Suite What It Validates
test_validator.py Strict mode rejects type coercion, extra fields, invalid enums
test_repair.py Markdown fences stripped, trailing commas fixed, enum case normalized
test_pre_classifier.py Credential theft, malware, phishing patterns route to HIGH_CERTAINTY
test_pipeline.py Deterministic path bypasses LLM, repair recovers malformed JSON
test_prompt_injection.py Adversarial inputs produce valid schema or escalate

Interview FAQ

"Why not just improve the prompt?"

Prompts don't create guarantees. Schema validation does.

"How do you prevent hallucinations?"

You don't—you detect and reject them.

"What happens when extraction fails?"

We fail closed: drop, flag, or escalate—never fabricate.

"What would you do at scale?"

Cache deterministic results, move regex to Rust/Go, batch LLM calls, tighten schema over time.

Project Structure

llm-extraction-pipeline/
├── pyproject.toml
├── README.md
├── src/extraction_pipeline/
│   ├── __init__.py
│   ├── schema.py              # Pydantic v2 strict models
│   ├── pre_classifier.py      # Deterministic routing
│   ├── rule_extractor.py      # High-certainty extraction
│   ├── llm_extractor.py       # LLM wrapper + mock
│   ├── validator.py           # Trust boundary
│   ├── json_repair.py         # Deterministic recovery
│   ├── confidence.py          # Signal-based scoring
│   └── pipeline.py            # Orchestrator
├── tests/
│   ├── conftest.py
│   ├── test_validator.py
│   ├── test_repair.py
│   ├── test_pre_classifier.py
│   ├── test_pipeline.py
│   └── adversarial/
│       └── test_prompt_injection.py
├── data/
│   └── synthetic_incidents.py
└── evaluation/
    ├── harness.py
    └── baseline.py

Known Limitations

  • Single-node Only: Pre-classifier regex runs in-process. Production would use a compiled regex engine or Rust/Go sidecar.
  • No Learning Loop: System does not adapt based on human corrections. Would add online learning for threshold tuning.
  • Mock Evaluation: Baseline comparisons use mock LLM for repeatability. Real-world hallucination rates may differ.