The 19th International Conference on
Document Analysis and
Recognition
September 16-21, 2025 Wuhan, Hubei, China
The 19th International Conference on
Document Analysis and
Recognition
September 16-21, 2025 Wuhan, Hubei, China
Prof Yu Qiao
Lead scientist at Shanghai AI Laboratory
Vice Dean of Shanghai Innovation Institute
Large Multi-modal Models for Document Understanding: Advances and Challenges
In recent years, breakthrough advancements in large language and multimodal models, represented by ChatGPT and Gemini, have paved the way toward artificial general intelligence (AGI). Guided by the Scaling Law, these large models exhibit general and unprecedented capabilities in various tasks. Text recognition and document understanding are long-standing core challenges in AI, as well as hallmark successes in practical applications. On one hand, OCR and document technologies, driven by the ICDAR community for decades, have played a pivotal role in supplying high-quality training corpora for language and foundational models. On the other hand, large models have transcended the boundaries between vision, language, and structured data, significantly advancing the state of the art in OCR and document understanding. This talk will present the progress and future trends of multimodal large models, with a focus on underlying technological innovations—exemplified by InternVL—and discuss how to achieve emergent multimodal intelligence. This talk will also analyze the progress and challenges on how large models enhance complex text and document understanding tasks.
Yu Qiao, is the leading scientist and professor at the Shanghai Artificial Intelligence Laboratory, and vice dean of Shanghai Innovation Institute. His research interests include general large models, computer vision, deep learning, and related fields. He led the development of the general vision large model InternImage with SOTA on various vision tasks, and the well-known large multi-modal model, InternVL, a top performed open-sourced LMM. He published over 300 research papers, cited over 90,000 times cumulatively, with an H-index of over 130. He received the AAAI 2021 Distinguished Paper Award, the CVPR 2023 Best Paper Award, and the ACL 2024 Outstanding Paper Award.
Prof Josep Lladós
Director of the Computer Vision Center
Computer Sciences Department of the Universitat Autònoma de Barcelona
The Dual Syntax of Documents. Structural Reasoning in Document AI
Foundation models have marked a significant advancement in document processing, offering robust solutions that enable businesses to automate and streamline their document workflows. Large Language Models (LLMs) have propelled the field of Document Intelligence (DI), delivering substantial impact across domains such as fintech, legaltech, and insurtech by automating the reading and understanding of document content.
To fully leverage the potential of foundation models in DI, it is essential to consider the structural language inherent in documents. As communication artifacts, documents follow compositional rules that govern both their linguistic and structural syntax. On one hand, documents contain textual elements that adhere to linguistic patterns—an area where LLMs have demonstrated remarkable progress. On the other hand, documents exhibit structural syntax, where layout and formatting define the spatial and semantic relationships between elements.
Achieving a comprehensive understanding of documents requires learning representations that integrate both linguistic and structural dimensions. Relational reasoning in document parsing involves manipulating structured representations of semantically meaningful components—such as titles, tables, and figures—based on compositional rules. Modern AI approaches incorporate a relational inductive bias, which enforces constraints among entities during training. This is a foundational principle in tasks such as semantic segmentation, document classification, object recognition, and visual question answering.
This keynote highlights the importance of structural language models in learning document representations. In particular, we will explore graph representation learning as a powerful method for modeling the structural knowledge (layout) of documents, enabling a dual-level interpretation that encompasses both linguistic and syntactic aspects. The concept of a document language model is introduced as a compositional framework that describes documents in terms of their constituent objects, relationships, and properties.
We will also examine document knowledge representations that integrate both declarative and procedural knowledge, facilitating advanced reasoning capabilities. The proposed approaches will be illustrated through applications in semantic segmentation, classification, document object recognition, and link discovery.
Josep Lladós received the degree in Computer Sciences in 1991 from the Universitat Politècnica de Catalunya and the PhD degree in Computer Sciences in 1997 from the Universitat Autònoma de Barcelona (Spain) and the Université Paris 8 (France). Currently he is a Professor at the Computer Sciences Department of the Universitat Autònoma de Barcelona and director of the Computer Vision Center. He is associate researcher of the IDAKS Lab of the Osaka Prefecture University (Japan). He is chair holder of Knowledge Transfer of the UAB Research Park and Santander Bank. His current research fields are document intelligence and graph-based learning. He has been the head of a number of Computer Vision R&D projects and published more than 260 papers in national and international conferences and journals. He has supervised 18 PhD theses. He is a member of the IAPR, where he is currently the secretary of the Executive Committee. He has served as chair of some committees: IAPR-EC (Education Committee), IAPR-ILC (Industrial Liaision Committee) and the IAPR TC-10 (Technical Committee on Graphics Recognition). He has served in the editorial board of several journals and international conferences (co-editor in chief of IJDAR, general chair of ICDAR2009). He was involved in the committees for the definition of the Spanish and Catalan Artificial Intelligence Strategy. He was the recipient of the IAPR-ICDAR Young Investigator Award in 2007. He received the Certificate of Appreciation of the IAPR in 2022. Josep Lladós has also experience in technological transfer and in 2002 he created the company ICAR Vision Systems, in the area on Document Image Analysis. His current h-index (Google Scholar) is 47.
Koichi Kise
Professor of Graduate School of Informatics, Osaka Metropolitan University
From AI to AI - Why Document Analysis and Recognition Stands Out: A Personal Perspective
The research field of Document Analysis and Recognition possesses unique aspects that set it apart from other disciplines. One of the most important aspects is that characters reside at the intersection of signal and symbol, and documents convey knowledge in both signal and symbolic forms. In this keynote talk, I will reflect on the distinctive nature of this field, as I have experienced it since attending the 2nd ICDAR (1993, Tsukuba, Japan) and discuss these features in relation to my own research journey. Through this talk, I hope to provide the audience with insights that may serve as a compass for their future research endeavors.
Koichi Kise received his BE, ME, and Dr. Eng. degrees in Communication Engineering from Osaka University, Japan, in 1986, 1988, and 1991, respectively. From 2000 to 2001, he was a visiting researcher at the German Research Center for Artificial Intelligence (DFKI), Germany. He is currently a Professor in the Department of Core Informatics at the Graduate School of Informatics, Osaka Metropolitan University, Japan.
In 2008, in collaboration with Professor Andreas Dengel of DFKI, he co-founded the Institute of Document Analysis and Knowledge Science (IDAKS) at Osaka Prefecture University (now Osaka Metropolitan University), where he currently serves as Director. He also serves as Director of the DFKI Lab Japan, established in 2022 at Osaka Metropolitan University as DFKI’s first overseas laboratory.
Prof. Kise served as the General Chair of ICDAR 2017 and as a Program Chair of ICDAR 2013, 2023, and 2026. He has been one of the Editors-in-Chief of the International Journal on Document Analysis and Recognition (IJDAR) since 2013.
His research interests include document analysis, human behavior understanding, learning augmentation, and AI applications in medicine.