The 19th International Conference on
Document Analysis and
Recognition
September 16-21, 2025 Wuhan, Hubei, China
The 19th International Conference on
Document Analysis and
Recognition
September 16-21, 2025 Wuhan, Hubei, China
M3RD: Multi-Modal Mathematical Reasoning in Documents:
The AI community has been placing significant emphasis on mathematical reasoning as a means to explore the ability of intelligence in large language models (LLMs) and multi-modal large language models (MLLMs), such as OpenAI O1 and DeepSeek R1. As a common information medium, documents consist of text, images, tables, diagrams, charts, mathematical notations, etc. By leveraging multiple elements in documents, multi-modal mathematical reasoning focuses on enabling machines to solve, interpret, and reason about mathematical problems. It combines image and text analysis, symbolic manipulation, numerical computation, and logical inference to address challenges ranging from basic arithmetic to advanced problem-solving in algebra, calculus, and beyond, thus forming an area of growing importance in document intelligence.
Recent advancements in (multi-modal) large language models have driven progress in multi-modal mathematical reasoning, such as chart reasoning, table reasoning, and geometry problem solving. Many real-world documents, such as academic papers, technical manuals, and financial reports, involve symbolic mathematics and logical reasoning. Addressing these requires algorithms that combine the precision of symbolic methods with the flexibility of modern AI. This workshop aims to bring together the researchers from industry, science, and academia to exchange ideas and discuss ongoing research in multi-modal mathematical reasoning in documents.
Workshop on Visual Text Generation and Text Image Processing
Visual text generation and text image preprocessing are two fundamental areas that play crucial roles in modern visual text analysis systems and directly impact the performance of downstream tasks such as OCR, information extraction, and visual text understanding. Visual text generation addresses the critical challenge of data scarcity by creating diverse, high-quality synthetic datasets. This not only reduces the cost and time of manual data collection but also enables the creation of comprehensive training sets that cover various visual text types and edge cases. Text image preprocessing, serving as the foundation of reliable visual text analysis pipelines, tackles real-world challenges such as low resolution, uneven illumination, and geometric distortions. These techniques are essential for handling various visual text conditions, from documents to scene images. The synergy of these topics to visual text analysis is becoming increasingly important: visual text generation techniques help create better training data, while advanced text image preprocessing techniques improve text quality in real conditions, facilitating model recognition. This workshop aims to bring together researchers and practitioners to work on these two topics and foster innovations in visual text analysis.
Advancing Multimodal Document Understanding: Challenges and Opportunities
The rapid development of AI technologies has made multimodal understanding a central research area. Large multimodal models, capable of processing and integrating text, images, video, and speech, have already led to significant progress in document analysis. Documents contain the most complex and widespread data forms with rich source of multimodal information, such as text, images, handwritten content, tables, even speech, etc. Moreover, documents range from simple receipts and invoices to more complex structures like academic papers, legal contracts, and medical records. This diversity highlights the need for intelligent systems that can extract, reason, and interpret information from multiple modalities.
Despite advances in fields like Large Language Model (LLM) and Computer Vision (CV), integrating these modalities to understand documents remains challenging. Unlike unimodal text or images, documents require reasoning concerning how different modalities interact. The layout of a document offers valuable structural clues. Tables and charts need visual processing to extract meaningful insights. Handwritten content requires advanced optical character recognition (OCR). Scanned and photographed documents with varying quality, complicate tasks like information extraction and semantic understanding. Combining text, visual, and structural features is essential for many real-world applications. These include invoice parsing for financial automation, contract analysis for legal decision-making, patient record summarization in healthcare, and document-based question answering in education and research. How to solve these problems requires innovative models and robust benchmarking to assess real-world performance.
This workshop aims to bring together experts from document analysis, NLP, computer vision, multimodal machine learning, and industry applications. It will serve as a platform to discuss solutions in multimodal document understanding and explore future research directions. The focus will be on integrating modalities, designing novel models and frameworks, and developing practical tools and evaluation benchmarks. By encouraging collaboration across academia and industry, we aim to unlock the potential of multimodal understanding to tackle real-world document processing challenges. This research can lead to impactful applications across both industry and society.
The Fifth ICDAR International Workshop on Machine Learning
Since 2010, the year of initiation of annual Imagenet Competition where research teams submit programs that classify and detect objects, machine learning has gained significant popularity. In the present age, Machine learning, in particular deep learning, is incredibly powerful to make predictions based on large amounts of available data. There are many applications of machine learning in Computer vision, pattern recognition including Document analysis, Medical image analysis etc. In order to facilitate innovative collaboration and engagement between document analysis community and other research communities like computer vision and images analysis etc. here we plan to organize this workshop of Machine learning after the ICDAR main conference.
The 16th IAPR International Workshop on Graphics Recognition (GREC 2025)
GREC workshops provide an excellent opportunity for researchers and practitioners at all levels of experience to meet colleagues and to share new ideas and knowledge about graphics recognition methods. Graphics Recognition is a subfield of document image analysis that deals with graphical entities in engineering drawings, comics, musical scores, sketches, maps, architectural plans, mathematical notation, tables, diagrams, etc.
The aim of this workshop is to maintain a very high level of interaction and creative discussions between participants, maintaining a « workshop » spirit, and not being tempted by a « mini-conference » model.
ICDAR 2025 Workshop on Documents Analysis of Low-resource Languages
The importance of low-resource document analysis is multifaceted, particularly in the fields of cultural preservation, data scarcity, linguistic research, and technological applications. Firstly, low-resource languages often embody unique cultural and historical contexts. Document analysis facilitates the digitization and preservation of these linguistic materials, providing crucial resources for understanding human history and cultural evolution. For instance, many endangered languages possess vast amounts of scanned documents, which can be analyzed to create valuable linguistic and cultural repositories. Secondly, low-resource languages typically suffer from a lack of large-scale annotated datasets, posing challenges for training machine learning models. Document analysis techniques, such as Optical Character Recognition (OCR) and document layout analysis, enable the extraction and structuring of data from existing documents, thereby mitigating data scarcity issues. Moreover, document analysis plays a pivotal role in enhancing machine translation capabilities. Monolingual data extracted through OCR can be utilized to improve machine translation for low-resource languages, which is particularly critical for languages with limited parallel corpora. Additionally, document analysis supports linguistic research by enabling the study of language variations and historical documentation, shedding light on the evolution and unique features of these languages. Finally, document analysis enhances the accessibility and usability of low-resource language documents. For example, advancements in OCR systems for non-Latin scripts allow researchers to extract text more efficiently from scanned documents, enabling applications such as content summarization and information retrieval. In summary, low-resource document analysis is not only a vital tool for cultural preservation but also a key driver of language technology development and academic research.
More information about the workshops will be available soon.