Towards Khmer Scene Document Layout Detection

arXiv:2603.00707v1 Announce Type: new Abstract: While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).

相关推荐

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

BiJEPA: Bi-directional Joint Embedding Predictive Architecture for Symmetric Representation Learning

Towards Universal Khmer Text Recognition

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models