The Challenge

The problem isn't unique to any one industry. Whether you're in property, finance, logistics, legal, or infrastructure — the same issue appears: structured data is present in your documents, but it's locked inside PDFs in a layout that computers can't reliably parse without help. Reference numbers, dates, financial figures, named parties — all there, all inaccessible at scale.

The complexity compounds when documents arrive from multiple sources. Bulk document sets typically come in two formats:

  • Text-based PDFs — digitally created documents with selectable, machine-readable text
  • Scanned PDFs — physical documents photographed or photocopied, with no embedded text layer at all

Standard PDF parsing tools fail completely on scanned documents. Any viable solution needs to handle both formats automatically — no human classification step, no separate queues for different file types.

Beyond format handling, the extraction logic needs to be layout-agnostic. Documents from different issuers share the same fields, but spacing, line breaks, and structure vary. A hardcoded approach breaks the moment a document deviates from the template — which, at scale, is inevitable.

The Solution

We built the entire solution as a Python script using Claude Code — working directly in the terminal, iterating on the extraction logic, prompt design, and OCR routing until each piece worked reliably end to end. The result is a single command: python extract.py.

On launch, an interactive menu lets the operator select the document type and tick which attributes to extract — no code changes required. The script then connects to Google Drive, Airtable, and the Anthropic API, lists all PDFs in the target folder, and begins processing. A --new-only flag cross-references Airtable to skip documents already processed, making every run incremental by default.

"The most powerful part isn't the extraction itself — it's that the system handles text and scanned documents identically. From any PDF, in any layout, the output is always the same: clean, structured, database-ready records."

Sibusiso Mabaso, Founder & CEO
How the Pipeline Works
Terminal
python extract.py
Google Drive
List PDFs, skip dupes
Text PDF
PyMuPDF extraction
or
Scanned PDF
Google Vision OCR
Claude API
Layout-agnostic extract
Airtable
Structured records

For text-based documents, PyMuPDF extracts the raw text layer. For scanned documents with no embedded text, Google Vision OCR reads the image first. Both paths then send their output to Claude's API — not as a rigid template match, but as a plain-English request to locate and return specific fields regardless of how the document is laid out. Claude returns structured JSON, which the script parses and writes directly into Airtable.

What Gets Extracted

Because Claude API handles the field detection — not hardcoded patterns — the attributes extracted are defined at run time via the interactive menu. The same script works across different document types without touching the code. Typical categories include:

  • Document identifiers — reference numbers, agreement names, and the parties involved
  • Classification data — regions, categories, site codes, or any organisational grouping present in the document
  • Date fields — commencement dates, expiry dates, renewal windows, and any scheduled milestones
  • Financial terms — amounts, rates, frequencies, and any escalation or adjustment clauses
  • Line-item entries — each itemised charge or sub-record with its description and parsed numeric value
  • Party and contact details — supplier names, payees, signatories, and counterparty information
  • Metadata and flags — document type, contract or reference number, status fields, and operational notes

Where a document contains multiple line items, the normalisation layer produces one Airtable row per item — with all document-level fields repeated — keeping the data fully relational and ready for filtering, aggregation, or reporting.

Results

When deployed against a backlog of 1,700 documents, the pipeline completed the full run in under 100 minutes — fully unattended. The manual equivalent, at a conservative 15 minutes per document, would have required approximately 425 hours of data entry — over ten weeks of full-time work. The same pipeline runs again on any new batch at the push of a command.

1,700
Documents processed in a single run
100 min
Total automated runtime, start to finish
425 hrs
Manual equivalent avoided (at 15 min/doc)
0
Manual steps required

What This Demonstrates

The conventional approach to PDF extraction — hardcoded field patterns, rigid templates, separate logic per document type — breaks the moment a document deviates from the expected layout. This approach works differently: instead of teaching a machine the exact structure of every document, it uses Claude to interpret a plain-English request and locate the relevant information wherever it appears in the document.

Because the extraction logic lives in a prompt rather than in code, updating what gets captured is a configuration change — not a development sprint. Add a new field to the interactive menu and the next run captures it. Change document types entirely and the same script adapts without modification.

The architecture applies directly to any document-heavy workflow: supplier contracts, insurance policies, planning approvals, compliance certificates, financial statements, employment agreements, property schedules. Wherever a team is manually copying data from PDFs into a system of record, this approach replaces that process — built with Claude Code, run from a single terminal command, and scaling to thousands of documents without adding headcount.

Have documents your team still processes by hand?

Schedule a Call →