Chapter 8 of 10

Structured Data Extraction

Extract structured data from unstructured text using AI. Master JSON output, table parsing, and entity extraction techniques.

10 min readFree

Why Structured Extraction Matters

Most real-world data is unstructured — emails, reports, reviews, documents, chat logs. Converting this into structured formats (JSON, CSV, database rows) is one of the highest-value applications of prompt engineering.

Manual data extraction is slow and error-prone. AI can process hundreds of documents per minute with consistent accuracy, extracting exactly the fields you need in the format your systems expect.

Common use cases include: extracting contact info from emails, parsing invoices into line items, converting meeting notes into action items with owners and deadlines, pulling product specs from marketing descriptions, and normalizing addresses or names from messy input data.

The Extraction Prompt Pattern

Every extraction prompt needs three things: the schema (what fields to extract), the source text (what to extract from), and the output format (how to structure the result).

Define your schema precisely. Instead of "extract the important information," specify: "Extract these fields: company_name (string), revenue (number in USD), year (YYYY format), growth_rate (percentage as decimal). If a field is not found, set it to null."

The null handling instruction is critical. Without it, models will guess or fabricate values for missing fields. With it, you get honest nulls that your code can handle appropriately.

Invoice Data ExtractionClaude 4.6

Prompt:

Extract the following fields from this invoice text into JSON: - vendor_name: string - invoice_number: string - date: string (YYYY-MM-DD format) - line_items: array of {description: string, quantity: number, unit_price: number, total: number} - subtotal: number - tax: number - total: number If any field is not found, use null. Invoice text: """ INVOICE #2024-0847 From: Acme Supplies LLC Date: March 15, 2024 2x Widget Pro @ $49.99 = $99.98 5x Gadget Mini @ $12.50 = $62.50 Subtotal: $162.48 Tax (8%): $13.00 Total Due: $175.48 """

Output:

{ "vendor_name": "Acme Supplies LLC", "invoice_number": "2024-0847", "date": "2024-03-15", "line_items": [ {"description": "Widget Pro", "quantity": 2, "unit_price": 49.99, "total": 99.98}, {"description": "Gadget Mini", "quantity": 5, "unit_price": 12.50, "total": 62.50} ], "subtotal": 162.48, "tax": 13.00, "total": 175.48 }

Few-Shot Extraction for Consistency

For production pipelines processing many documents, few-shot examples are essential. They lock in the exact output format and handle ambiguous cases consistently.

Provide 2-3 examples covering: a normal case, an edge case (missing data), and a tricky case (ambiguous formatting). The model learns your normalization rules from the examples rather than needing explicit instructions for every edge case.

This is particularly important for addresses (different formats across countries), dates (MM/DD vs DD/MM), names (first-last order varies by culture), and currency (symbol placement, thousands separators).

Batch Processing and Chunking

When processing many items, you can extract data from multiple inputs in a single prompt. Format your input as a numbered list and ask for a numbered JSON array in response.

For large documents that exceed the context window, chunk the text into overlapping segments and process each chunk separately. Use a second prompt to merge and deduplicate the extracted data from all chunks.

Always validate extracted data programmatically after the AI returns it. Check types, ranges, required fields, and format constraints. AI extraction is highly accurate but not perfect — a validation layer catches the occasional error before it enters your database.

Key Takeaways

  • Define your extraction schema precisely — field names, types, and null handling
  • Use few-shot examples for production pipelines to ensure consistent output
  • Handle missing data explicitly with null instructions
  • Validate AI-extracted data programmatically before using it
  • Batch processing and chunking handle large volumes and long documents

Try It Yourself

Find an email or document with mixed information. Write an extraction prompt to pull specific fields into JSON. Verify the output matches the source.

Open JSON Formatter