AI-powered reservation parsing for multilingual PDFs

The brief

Direct Hotels was burning hours every day on manual data entry — bookings come in as PDFs from a long tail of suppliers, in two languages, in formats that change without notice. We built a structured extraction pipeline that reads the PDFs once, validates the output, and posts straight into the reservation system.

What we built

Dual-model architecture — OpenAI GPT-4V for image-heavy PDFs, Anthropic Claude for text-dense ones. Both are abstracted behind a model-agnostic adapter so we can swap in new entrants without rewriting the pipeline.
Vendor-specific parsing strategies — different suppliers have wildly different layouts; the pipeline picks a strategy per source with a fallback to the generic prompt.
Bilingual (English + Arabic) extraction — including right-to-left layouts, mixed-direction lines, and Arabic-formatted dates.
JSON Schema validation + self-healing — every model response is validated against a Zod-equivalent schema in PHP. Invalid responses re-prompt the model with the validator’s complaint.
Cost ceiling per document — a hard cap on tokens per PDF, monitored in a dashboard. Saves the team from a runaway bill if a malformed PDF triggers a retry loop.

Outcome

Manual reservation entry dropped from a multi-hour daily task to a near-zero exception queue. Pipeline accuracy held above 95% on the production set, and the structured outputs flow directly into the booking CRM.

AI-powered reservation parsing for multilingual PDFs

The brief

What we built

Outcome

Have a project like this?