AI-powered reservation parsing for multilingual PDFs
An LLM-backed pipeline that extracts structured booking data from multilingual reservation PDFs (English + Arabic), with vendor-specific parsing rules and JSON Schema-validated outputs.
The brief
Direct Hotels was burning hours every day on manual data entry — bookings come in as PDFs from a long tail of suppliers, in two languages, in formats that change without notice. We built a structured extraction pipeline that reads the PDFs once, validates the output, and posts straight into the reservation system.
What we built
- Dual-model architecture — OpenAI GPT-4V for image-heavy PDFs, Anthropic Claude for text-dense ones. Both are abstracted behind a model-agnostic adapter so we can swap in new entrants without rewriting the pipeline.
- Vendor-specific parsing strategies — different suppliers have wildly different layouts; the pipeline picks a strategy per source with a fallback to the generic prompt.
- Bilingual (English + Arabic) extraction — including right-to-left layouts, mixed-direction lines, and Arabic-formatted dates.
- JSON Schema validation + self-healing — every model response is validated against a Zod-equivalent schema in PHP. Invalid responses re-prompt the model with the validator’s complaint.
- Cost ceiling per document — a hard cap on tokens per PDF, monitored in a dashboard. Saves the team from a runaway bill if a malformed PDF triggers a retry loop.
Outcome
Manual reservation entry dropped from a multi-hour daily task to a near-zero exception queue. Pipeline accuracy held above 95% on the production set, and the structured outputs flow directly into the booking CRM.
Have a project like this?
Tell us about it. We respond within one working day with a written estimate or a calendar link.