data extraction pipeline

The Challenge

The problem isn't unique to any one industry. Whether you're in property, finance, logistics, legal, or infrastructure — the same issue appears: structured data is present in your documents, but it's locked inside PDFs in a layout that computers can't reliably parse without help. Reference numbers, dates, financial figures, named parties — all there, all inaccessible at scale.

The complexity compounds when documents arrive from multiple sources. Bulk document sets typically come in two formats:

Text-based PDFs — digitally created documents with selectable, machine-readable text
Scanned PDFs — physical documents photographed or photocopied, with no embedded text layer at all

Standard PDF parsing tools fail completely on scanned documents. Any viable solution needs to handle both formats automatically — no human classification step, no separate queues for different file types.

Beyond format handling, the extraction logic needs to be layout-agnostic. Documents from different issuers share the same fields, but spacing, line breaks, and structure vary. A hardcoded approach breaks the moment a document deviates from the template — which, at scale, is inevitable.

The Solution

We built the entire solution as a Python script using Claude Code — working directly in the terminal, iterating on the extraction logic, prompt design, and OCR routing until each piece worked reliably end to end. The result is a single command: python extract.py.

On launch, an interactive menu lets the operator select the document type and tick which attributes to extract — no code changes required. The script then connects to Google Drive, Airtable, and the Anthropic API, lists all PDFs in the target folder, and begins processing. A --new-only flag cross-references Airtable to skip documents already processed, making every run incremental by default.

"The most powerful part isn't the extraction itself — it's that the system handles text and scanned documents identically. From any PDF, in any layout, the output is always the same: clean, structured, database-ready records."

Sibusiso Mabaso, Founder & CEO

How the Pipeline Works

Terminal

python extract.py

→

Google Drive

List PDFs, skip dupes

→

Text PDF

PyMuPDF extraction

Scanned PDF

Google Vision OCR

→

Claude API

Layout-agnostic extract

→

Airtable

Structured records

For text-based documents, PyMuPDF extracts the raw text layer. For scanned documents with no embedded text, Google Vision OCR reads the image first. Both paths then send their output to Claude's API — not as a rigid template match, but as a plain-English request to locate and return specific fields regardless of how the document is laid out. Claude returns structured JSON, which the script parses and writes directly into Airtable.

What Gets Extracted

Because Claude API handles the field detection — not hardcoded patterns — the attributes extracted are defined at run time via the interactive menu. The same script works across different document types without touching the code. Typical categories include:

Document identifiers — reference numbers, agreement names, and the parties involved
Classification data — regions, categories, site codes, or any organisational grouping present in the document
Date fields — commencement dates, expiry dates, renewal windows, and any scheduled milestones
Financial terms — amounts, rates, frequencies, and any escalation or adjustment clauses
Line-item entries — each itemised charge or sub-record with its description and parsed numeric value
Party and contact details — supplier names, payees, signatories, and counterparty information
Metadata and flags — document type, contract or reference number, status fields, and operational notes

Where a document contains multiple line items, the normalisation layer produces one Airtable row per item — with all document-level fields repeated — keeping the data fully relational and ready for filtering, aggregation, or reporting.

Results

When deployed against a backlog of 1,700 documents, the pipeline completed the full run in under 100 minutes — fully unattended. The manual equivalent, at a conservative 15 minutes per document, would have required approximately 425 hours of data entry — over ten weeks of full-time work. The same pipeline runs again on any new batch at the push of a command.

1,700

Documents processed in a single run

100 min

Total automated runtime, start to finish

425 hrs

Manual equivalent avoided (at 15 min/doc)

Manual steps required

What This Demonstrates

The conventional approach to PDF extraction — hardcoded field patterns, rigid templates, separate logic per document type — breaks the moment a document deviates from the expected layout. This approach works differently: instead of teaching a machine the exact structure of every document, it uses Claude to interpret a plain-English request and locate the relevant information wherever it appears in the document.

Because the extraction logic lives in a prompt rather than in code, updating what gets captured is a configuration change — not a development sprint. Add a new field to the interactive menu and the next run captures it. Change document types entirely and the same script adapts without modification.

The architecture applies directly to any document-heavy workflow: supplier contracts, insurance policies, planning approvals, compliance certificates, financial statements, employment agreements, property schedules. Wherever a team is manually copying data from PDFs into a system of record, this approach replaces that process — built with Claude Code, run from a single terminal command, and scaling to thousands of documents without adding headcount.

Have documents your team still processes by hand?

Schedule a Call →

Tags: Data Data Extraction PDF Processing Claude Code Workflow Terminal Airtable AI

What do you think?

Show comments / Leave a comment

1 Comment

Jordan M.

April 28, 2025

This is exactly the kind of practical automation we need more of. The structured PDF output is a game-changer for revision — looking forward to seeing how this evolves.

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

From Any Document to a Structured Database: Automating PDF Data Extraction with AI