Structured Data Extraction with LLMs: How to Build a Pipeline
Build a reliable structured data extraction pipeline using LLMs, LangChain, and OpenAI functions: JSON schemas, deterministic outputs, zero hallucinations, for production.
High-quality structured data unlocks downstream analytics, automation, and search. Here's the thing though - if you're dealing with messy text and need clean JSON, you don't need training data or brittle regex patterns anymore. Let me show you exactly how I built a deterministic extraction pipeline using LLMs, OpenAI function calling, LangChain, and Pydantic. This approach works reliably, and you can run it right in Colab.
Actually, before we dive in, I should mention something that bit me a few times: if you're dealing with unpredictable input, understanding tokenization pitfalls and invisible characters can save you hours of debugging. These subtle extraction bugs drove me absolutely crazy until I figured out what was happening.

Why This Approach Works
Function Calling Enforces Structure OpenAI function calling is honestly a game-changer. I was skeptical at first, but it really does force the model to return JSON that matches your exact schema - no free-form text, no hallucinated fields. The model outputs only what you define, period. That's it.
Pydantic Validates At The Boundary I've come to really appreciate Pydantic models over the past year or so. They enforce types, required fields, and constraints at runtime. When something's wrong, you get clear error messages immediately - not three steps down your pipeline where everything explodes mysteriously. Your downstream systems receive clean data, or they don't receive anything at all. And honestly? That's exactly what you want.
LangChain Orchestrates Composable Pipelines Here's what makes LangChain particularly useful in practice: you can combine prompts, models, and parsers into testable, reusable pipelines. Need to swap models? Extend schemas? Add retry logic? You can do all that without touching your core extraction logic. I learned this the hard way after building my first extraction system without it.
Deterministic Output With Temperature Zero Setting temperature to zero eliminates randomness completely. Same input, same output, every single time. This makes extraction predictable and, more importantly, testable. No more "it worked yesterday" mysteries.
How It Works
Define Schema: Use Pydantic models to specify your data structure (event name, date, outcome - whatever you need).
Convert To Function Spec: Transform that Pydantic model into an OpenAI function definition.
Bind Function To Model: Attach the function spec to the LLM so it knows to return structured JSON.
Create Prompt: Write a strict system prompt that tells the model to extract only explicit information.
Build Chain: Compose everything - prompt, model, and output parser - into a LangChain pipeline.
Invoke And Validate: Run the chain on your input text and validate the output with Pydantic.
Setup & Installation
First thing - run this cell at the top of your Colab notebook to install all the dependencies you'll need:
!pip install -U "langchain>=0.2" "langchain-openai>=0.1" "langchain-community>=0.2" "langchain-text-splitters>=0.0.1" pydantic python-dotenv beautifulsoup4 html2text
And don't forget to set your OpenAI API key as an environment variable before running anything. I can't tell you how many times I've forgotten this step:
import os
required_keys = ["OPENAI_API_KEY"]
missing = [k for k in required_keys if not os.getenv(k)]
if missing:
raise EnvironmentError(
f"Missing required environment variables: {', '.join(missing)}\n"
"Please set them before running the notebook. Example:\n"
" export OPENAI_API_KEY='your-key-here'"
)
print("All required API keys found.")
Step-by-Step Implementation
Step 1: Initialize The LLM
Let's start by loading environment variables and initializing the OpenAI model. Notice we're setting temperature to zero for deterministic output - this is crucial:
from langchain_openai import ChatOpenAI
# Use gpt-4o-mini for cost-effective, fast extraction with function calling support
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
print("LLM ready:", llm)Step 2: Define Pydantic Models
Now we create Pydantic models to define our extracted event structure. The field descriptions are actually important here - they guide the LLM more than you might think:
from typing import List, Optional
from pydantic import BaseModel, Field
class Event(BaseModel):
"""
Represents a single event extracted from text.
"""
name: str = Field(..., description="The explicit event name or title extracted verbatim from the text.")
date: Optional[str] = Field(None, description="The explicit date as written in the text (ISO if present, else raw).")
outcome: Optional[str] = Field(None, description="The explicit outcome/result stated in the text, if any.")
class Extracted(BaseModel):
"""
Wrapper model for a list of extracted events.
"""
events: List[Event] = Field(default_factory=list, description="All events explicitly mentioned in the text.")
Step 3: Convert Schema To OpenAI Function Spec
This step transforms your Pydantic model into something OpenAI understands - a function definition that tells the model to return structured JSON:
from langchain_core.utils.function_calling import convert_to_openai_function
extract_fn = convert_to_openai_function(Extracted)
functions = [extract_fn]Step 4: Create A Strict System Prompt
The prompt is crucial. And I mean really crucial. You need to be explicit about extracting only what's actually in the text. No hallucination, no creative interpretation:
from langchain_core.prompts import ChatPromptTemplate
SYSTEM_PROMPT = """You are a precise information extractor.
- Extract only information explicitly present in the text.
- Do not infer, guess, or add missing details.
- If a field is not explicitly present, set it to null.
- If no events are present, return an empty list.
- Preserve original wording where reasonable."""
prompt = ChatPromptTemplate.from_messages(
[
("system", SYSTEM_PROMPT),
("human", "{text}")
]
)
Step 5: Bind Function To Model And Set Up Parsers
Here we bind the function spec to the LLM and configure our output parsers to extract structured data:
from langchain_core.output_parsers.openai_functions import (
JsonKeyOutputFunctionsParser,
JsonOutputFunctionsParser,
)
# Bind the function spec to the LLM
model_with_fn = llm.bind(functions=functions)
# Parser that returns only the "events" key from the function arguments
events_only_parser = JsonKeyOutputFunctionsParser(key_name="events")
# Parser that returns the entire function arguments payload
full_args_parser = JsonOutputFunctionsParser()Step 6: Build LangChain Pipelines
Now let's compose everything into reusable chains. This is where it all comes together:
# Chain that returns only the list of events
events_chain = prompt | model_with_fn | events_only_parser
# Chain that returns the full structured payload for validation
full_chain = prompt | model_with_fn | full_args_parser
Run And Validate
Basic Extraction
Let's test the pipeline on something simple first. I always start with the easiest case:
text = "I attended a music festival on June 15th and a tech conference on July 20th."
events = events_chain.invoke({"text": text})
print(events)
Validate With Pydantic
Use the full chain to extract and validate with Pydantic:
payload = full_chain.invoke({"text": text})
validated = Extracted.model_validate(payload)
print(validated)
for ev in validated.events:
print(ev.name, ev.date, ev.outcome)
Edge Case: No Events
Always verify your extractor returns an empty list when there's nothing to extract. This one caught me off guard in a previous project:
text_irrelevant = "This is irrelevant text with no events."
empty = events_chain.invoke({"text": text_irrelevant})
print(empty)
assert empty == [], "Extractor should not hallucinate events."
Edge Case: Missing Fields
What happens when some fields are missing? Let's test that:
text_partial = "Our team hosted Launch Day and later Demo Night."
partial = events_chain.invoke({"text": text_partial})
print(partial)
Determinism Check
This is important - run the same input multiple times and make sure you get identical outputs. If you don't, something's wrong:
same1 = events_chain.invoke({"text": text})
same2 = events_chain.invoke({"text": text})
assert same1 == same2, "Outputs should be identical with temperature=0."
Real-World Data Extraction
Now for something more interesting - let's load a Wikipedia page and extract events from actual content. This is where things get fun:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Apollo_program")
docs = loader.load()
page_text = docs[0].page_content[:10000]
real_events = events_chain.invoke({"text": page_text})
print(f"Extracted {len(real_events)} events")
for e in real_events[:5]:
print(e)
Chunking Long Documents
For longer documents, you'll need to split text into overlapping chunks, extract from each chunk, then merge and deduplicate events. Actually, wait - if you notice models missing details or hallucinating as context grows, our deep dive on context rot and LLM memory limits explains exactly why this happens and how to work around it. But here's the basic approach:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=400)
chunks = splitter.split_text(docs[0].page_content)
all_events = []
for ch in chunks:
all_events.extend(events_chain.invoke({"text": ch}))
# Deduplicate by (name, date) tuple
unique = {(e["name"], e.get("date")): e for e in all_events}
chr = list(unique.values())
print(f"Merged events: {len(merged_events)}")Constraints And Performance
Token Limits: gpt-4o-mini supports up to 128k tokens input. But here's the thing - for documents over 4k characters, you should chunk the text anyway. I learned this after watching extraction quality degrade on longer documents.
Cost: gpt-4o-mini runs about $0.15 per 1M input tokens and $0.60 per 1M output tokens. A typical extraction call uses somewhere between 500 and 2000 tokens. Not bad, honestly.
Latency: Expect 1-3 seconds per extraction call. This varies based on input size and API load. Sometimes it's faster, sometimes slower - plan accordingly.
For high-volume jobs, you'll want to control prompt size, reduce chunk overlap, and honestly, just stick with cheaper models like gpt-4o-mini for extraction. They work great for this use case. If you're trying to figure out which model best fits your pipeline's speed, cost, and reliability needs, check out our practical guide on how to choose an LLM for your application.
Conclusion
You've built a deterministic, validated extraction pipeline that converts raw text into structured JSON using OpenAI function calling, Pydantic, and LangChain. The system enforces schema compliance, eliminates hallucination, and - this is key - produces repeatable results.
Key Design Choices:
Function calling forces the model to return only schema-compliant JSON
Pydantic validation catches invalid payloads at runtime
LangChain orchestration makes the pipeline composable, testable, and extensible
Temperature zero ensures deterministic output
Next Steps:
Add retry logic with exponential backoff for production reliability (trust me, you'll need this)
Extend the schema with new fields like location or confidence scores
Deploy the pipeline as a REST API using FastAPI
Parallelize extraction across multiple documents with asyncio and rate limiting
Add observability with structured logging and input/output hashing to track performance. And avoid logging PII - that's important. Really important.
The more I think about it, this approach has saved me so much time compared to the old regex-based extraction systems I used to build. Once you get the hang of it, you'll wonder how you ever lived without structured extraction.