Arabic Document Extraction

Built a production Arabic document extraction system that converts photos of government IDs into structured, machine-readable data for identity verification workflows. The system is designed to handle real-world capture conditions such as blur, glare, low lighting, and off-angle images, where document quality is inconsistent. The pipeline combines document detection, Arabic OCR, and layout-aware field parsing to reliably extract key identity fields from diverse GCC documents. A major focus was handling right-to-left (RTL) text flow so that Arabic labels and associated values are interpreted correctly based on spatial relationships in the image.

The hard part

Arabic OCR introduces challenges not present in Latin scripts: right-to-left reading order, context-dependent letter shapes, and visually similar characters. The system accounts for RTL layout by using Arabic field labels as positional anchors and mapping expected value regions relative to them. This significantly improved consistency across different ID designs and capture conditions.

What I did

Designed and implemented the Arabic document extraction system end-to-end, from OCR integration to structured field parsing. Built RTL-aware logic to correctly associate Arabic labels and values, and developed flexible matching and validation rules to handle variations in spelling, layout, and document formats. Improved robustness through preprocessing, orientation handling, and continuous iteration on real-world samples, refining the system to perform reliably across diverse capture conditions and ID designs.

Arabic Document Extraction

The hard part

What I did

Tech