Structured Data Extraction

Why Structured Extraction Matters

Most real-world data is unstructured — emails, reports, reviews, documents, chat logs. Converting this into structured formats (JSON, CSV, database rows) is one of the highest-value applications of prompt engineering.

Manual data extraction is slow and error-prone. AI can process hundreds of documents per minute with consistent accuracy, extracting exactly the fields you need in the format your systems expect.

Common use cases include: extracting contact info from emails, parsing invoices into line items, converting meeting notes into action items with owners and deadlines, pulling product specs from marketing descriptions, and normalizing addresses or names from messy input data.

The Extraction Prompt Pattern

Every extraction prompt needs three things: the schema (what fields to extract), the source text (what to extract from), and the output format (how to structure the result).

Define your schema precisely. Instead of "extract the important information," specify: "Extract these fields: company_name (string), revenue (number in USD), year (YYYY format), growth_rate (percentage as decimal). If a field is not found, set it to null."

The null handling instruction is critical. Without it, models will guess or fabricate values for missing fields. With it, you get honest nulls that your code can handle appropriately.

Invoice Data ExtractionClaude 4.6

Prompt:

Extract the following fields from this invoice text into JSON: - vendor_name: string - invoice_number: string - date: string (YYYY-MM-DD format) - line_items: array of {description: string, quantity: number, unit_price: number, total: number} - subtotal: number - tax: number - total: number If any field is not found, use null. Invoice text: """ INVOICE #2024-0847 From: Acme Supplies LLC Date: March 15, 2024 2x Widget Pro @ $49.99 = $99.98 5x Gadget Mini @ $12.50 = $62.50 Subtotal: $162.48 Tax (8%): $13.00 Total Due: $175.48 """

Output:

{ "vendor_name": "Acme Supplies LLC", "invoice_number": "2024-0847", "date": "2024-03-15", "line_items": [ {"description": "Widget Pro", "quantity": 2, "unit_price": 49.99, "total": 99.98}, {"description": "Gadget Mini", "quantity": 5, "unit_price": 12.50, "total": 62.50} ], "subtotal": 162.48, "tax": 13.00, "total": 175.48 }

Few-Shot Extraction for Consistency

For production pipelines processing many documents, few-shot examples are essential. They lock in the exact output format and handle ambiguous cases consistently.

Provide 2-3 examples covering: a normal case, an edge case (missing data), and a tricky case (ambiguous formatting). The model learns your normalization rules from the examples rather than needing explicit instructions for every edge case.

This is particularly important for addresses (different formats across countries), dates (MM/DD vs DD/MM), names (first-last order varies by culture), and currency (symbol placement, thousands separators).

Batch Processing and Chunking

When processing many items, you can extract data from multiple inputs in a single prompt. Format your input as a numbered list and ask for a numbered JSON array in response.

For large documents that exceed the context window, chunk the text into overlapping segments and process each chunk separately. Use a second prompt to merge and deduplicate the extracted data from all chunks.

Always validate extracted data programmatically after the AI returns it. Check types, ranges, required fields, and format constraints. AI extraction is highly accurate but not perfect — a validation layer catches the occasional error before it enters your database.

Structured Data Extraction

Why Structured Extraction Matters

The Extraction Prompt Pattern

Few-Shot Extraction for Consistency

Batch Processing and Chunking

Key Takeaways

Try It Yourself