// case study · 2024

AI-powered reservation parsing for multilingual PDFs

An LLM-backed pipeline that extracts structured booking data from multilingual reservation PDFs (English + Arabic), with vendor-specific parsing rules and JSON Schema-validated outputs.

Client Direct Hotels
Sector AI / Automation
Role Tech Lead
Engagement Project · 8 weeks
Team 2 (1 lead, 1 backend)
Stack Laravel · OpenAI GPT-4V · Anthropic Claude · JSON Schema
Year 2024

The brief

Direct Hotels was burning hours every day on manual data entry — bookings come in as PDFs from a long tail of suppliers, in two languages, in formats that change without notice. We built a structured extraction pipeline that reads the PDFs once, validates the output, and posts straight into the reservation system.

What we built

  • Dual-model architecture — OpenAI GPT-4V for image-heavy PDFs, Anthropic Claude for text-dense ones. Both are abstracted behind a model-agnostic adapter so we can swap in new entrants without rewriting the pipeline.
  • Vendor-specific parsing strategies — different suppliers have wildly different layouts; the pipeline picks a strategy per source with a fallback to the generic prompt.
  • Bilingual (English + Arabic) extraction — including right-to-left layouts, mixed-direction lines, and Arabic-formatted dates.
  • JSON Schema validation + self-healing — every model response is validated against a Zod-equivalent schema in PHP. Invalid responses re-prompt the model with the validator’s complaint.
  • Cost ceiling per document — a hard cap on tokens per PDF, monitored in a dashboard. Saves the team from a runaway bill if a malformed PDF triggers a retry loop.

Outcome

Manual reservation entry dropped from a multi-hour daily task to a near-zero exception queue. Pipeline accuracy held above 95% on the production set, and the structured outputs flow directly into the booking CRM.

// next step

Have a project like this?

Tell us about it. We respond within one working day with a written estimate or a calendar link.

Start a project