Back to Projects
Computer Vision

Arabic Document Extraction

Arabic OCR system for extracting structured data from documents

Company: Nashid
Year: 2025-2026
Status: Production
96.9%
Accuracy
Field accuracy across key ID fields
~7–12s
Processing
per document under real-world conditions
GCC Support
Coverage
Multiple country formats
300-400/day
Volume
verifications in production

Built a production Arabic document extraction system that converts photos of government IDs into structured, machine-readable data for identity verification workflows. The system is designed to handle real-world capture conditions such as blur, glare, low lighting, and off-angle images, where document quality is inconsistent. The pipeline combines document detection, Arabic OCR, and layout-aware field parsing to reliably extract key identity fields from diverse GCC documents. A major focus was handling right-to-left (RTL) text flow so that Arabic labels and associated values are interpreted correctly based on spatial relationships in the image.

The hard part

Arabic OCR introduces challenges not present in Latin scripts: right-to-left reading order, context-dependent letter shapes, and visually similar characters. The system accounts for RTL layout by using Arabic field labels as positional anchors and mapping expected value regions relative to them. This significantly improved consistency across different ID designs and capture conditions.

What I did

Designed and implemented the Arabic document extraction system end-to-end, from OCR integration to structured field parsing. Built RTL-aware logic to correctly associate Arabic labels and values, and developed flexible matching and validation rules to handle variations in spelling, layout, and document formats. Improved robustness through preprocessing, orientation handling, and continuous iteration on real-world samples, refining the system to perform reliably across diverse capture conditions and ID designs.

Tech

PaddleOCRPyTorchOpenCVONNX RuntimePython