Docugami | AI Document Engineering Blog

Docugami to Showcase Small Agentic Reasoning Models that Outperform Chat GPT-grade LLMs at Prestigious ICDAR Conference

Written by Docugami | September 13, 2025 at 2:32 AM

At the world’s preeminent conference on document analysis later this month, Docugami will demonstrate its small agentic reasoning models that outperform much larger and more expensive proprietary LLMs.

Docugami’s novel approach uses small language models with limited human supervision to achieve high-quality reasoning results that outperform virtually all state-of-the-art LLMs like Chat GPT.

Docugami’s Science Team will present the company’s small agentic reasoning model work at the prestigious International Conference on Document Analysis and Recognition (ICDAR) on September 18, in Wuhan, China.

“Our small agentic reasoning models achieve exceptional results at lower cost, with greater security, and without the need for costly proprietary LLMs or extensive human prompting,” said Docugami CEO Jean Paoli, a renowned expert on document technologies and co-creator of the W3C XML standard.

Recently, the company’s work was evaluated in several industry and academic benchmarks, showing Docugami’s small agentic reasoning models outperform all open-source reasoning models of comparable size and virtually all GPT-4 based frameworks.

For the FinQA dataset, for example, which requires reasoning over complex financial information, the benchmarking results found that Docugami’s small agentic model (called Docugami MATATA, for “MAthematical Tool-Assisted reasoning for Tabular Applications) outperforms all existing models, even large closed-source models using GPT-4 and Chat GPT or open-source models that use human expert annotations for training. 

  Framework Closed-source/
Open-source
Model Method FinQA
Accuracy
1 Docugami MATATA Open-source Ministral-8B FT 77.59
2 TAT-LLM Open-source Llama2-70B FT 76.81
3 EEDP Closed-source GPT-4 PE 76.05
4 TAT-LLM Open-source Llama2-13B FT 71.93
5 Docugami MATATA Open-source Phi3-mini 3.8B FT 70.10
6 PoT Closed-source PoT-SC-Codex PE 68.1
7 TAT-LLM Open-source Llama2-7B FT 65.13
8 EEDP Closed-source ChatGPT PE 61.88
9 FinQANet Open-source RoBERTa FT 61.24

Table 1. FinQA Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the FinQA dataset for various proprietary and open-source approaches.
PE: Prompt Engineering; FT: Fine-Tuning.

In recent years, as the mathematical reasoning capabilities of AI have increased rapidly, most of the attention has focused on approaches using Large Language Models (LLMs) and intensive human prompt engineering to address complex mathematical reasoning problems.

However, these approaches have important limitations and dependencies. Prompt-engineering is labor intensive, often requiring hand-crafted prompts, and may not capture the full complexity and diversity of the data. Proprietary Large Language Models raise data privacy and security concerns, especially with sensitive business documents, and present scalability issues due to their high cost and latency.

Our work in this area stems from Docugami’s roots as a business document foundation model, focused on the unique business documents of individual organizations. Docugami’s world-class Science team is working to advance AI approaches that can tackle complex mathematical reasoning problems over business documents with extraordinary precision, lower costs, greater efficiency, and improved data privacy and security for sensitive information.

We have developed a novel cost-effective method for training AI agents for tabular data problem-solving through reasoning, planning, and tool use. With a progressive self-improvement paradigm and iterative weak supervision, Docugami’s small agentic reasoning model requires only 3.8B/8B Small Language Models (SLMs), making it especially well-suited for local hosting and sensitive business contexts where data privacy is vital.

Using flexible and reusable tools across different datasets, Docugami’s small agentic model achieves exceptional performance at radically lower cost, with excellent scalability across a variety of shared tasks.

We recently evaluated Docugami’s small agentic model, using several of the most widely used datasets for benchmarking mathematical problem-solving. The FinQA dataset and the TAT-QA dataset represent complex real-world reasoning scenarios, and include tabular and text data, requiring reasoning over financial information with intricate contexts. The TabMWP dataset involves mathematical word problems over tabular data. All three of these datasets require multi-step reasoning, information extraction, data manipulation, tabular and contextual understanding, and numerical calculations, making them useful evaluators for AI intended for use with lengthy business documents and complex business scenarios.

For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.

For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.

  Framework Closed-source/
Open-source
Model Method TAT-QA
Accuracy
1 TAT-LLM Open-source Llama2-70B FT 81.4
2 Docugami MATATA Open-source Ministral-8B FT 77.6
3 TAT-LLM Open-source Llama2-13B FT 77.5
4 Code Generation for
Table-Text Question
na na na 76.8
5 TAT-LLM Open-source Llama2-7B FT 76.4
6 AeNER na na na 75.0
7 Docugami MATATA Open-source Phi3-mini-3.8B FT 74.2
8 Code Generation for
Table-Text Question
na na na 73.7
9 Encore Open-source BART-Large FT 71.8
10 KFEX-N Open-source Delberta-V#-
Large
FT 71.0

Table 2. TAT-QA Dataset Exact Match Accuracy for Proprietary and Open-Source Models. Leaderboard of exact match accuracy scores for mathematical reasoning with the TAT-QA dataset for various proprietary and open-source approaches. Source: https://nextplusplus.github.io/TAT-QA/
PE: Prompt Engineering; FT: Fine-Tuning.


Similarly, for the TabMWP dataset, the experimental results demonstrate that Docugami’s small agentic model outperforms all existing open-source models in reasoning across the TabMWP dataset, even open-source models that rely on much larger LLMs or training data annotated by GPT-4. Our approach also outperforms all but one closed source model, and nearly matches the performance of the much larger and much more expensive GPT-4 model, with 98.13 percent accuracy compared to 98.78 percent for GPT-4.

Table 3. TabMWP Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the TabMWP dataset for various proprietary and open-source approaches.
Source: https://promptpg.github.io/leaderboard.html
PE: Prompt Engineering; FT: Fine-Tuning.

  Framework Closed-source/
Open-source
Model Method TabMWP Accuracy
1 Chameleon Closed-source GPT-4 PE 98.78
2 Docugami MATATA Open-source Ministral-8B FT 98.13
3 PoT GPT-4 Closed-source GPT-4 PE 96.93
4 Docugami MATATA Open-source Phi3-mini-3.8B FT 96.66
5 CREATOR Closed-source ChatGPT PE 94.7
6 Chameleon Closed-source ChatGPT PE 93.28
7 TaCo Open-source TAPEX-large FT 92.91
8 Pot ChatGPT + Doc Closed-source PoT-SC-Codex PE 92.69
9 CoT GPT-4 Closed-source GPT-4 PE 90.81
10 CoS-Planning Closed-source ChatGPT PE 90.00
11 PoT ChatGPT Closed-source ChatGPT PE 89.49

As a company focused on transforming businesses of all sizes and across all sectors by unlocking the data and insights contained in complex business documents, Docugami’s work in this area aims to reduce the reliance on proprietary Large Language Models and human prompt-engineering, resulting in comparable or better performance with lower costs, faster local results, and increased data privacy. This work will be particularly important in business settings where cost-effectiveness, operational efficiency and flexibility, and data privacy and security are key considerations.

We’re excited to be presenting Docugami’s reasoning advances at ICDAR, the world’s foremost conference on document analysis and recognition, and we see enormous potential in our novel approach to unlock complex document data and advance mathematical reasoning. Results from academic and industry benchmarks demonstrate that Docugami’s agentic approach, using tool-augmented Small Language Models with weak-supervision, can meet or exceed the performance of existing reasoning methods using costly LLMs like GPT-4 and ChatGPT, as well as extensive human prompt-engineering. Our dedicated Science team will continue to accelerate progress in the important area of agentic models.


Notes: You can read an early paper on our work here: https://arxiv.org/abs/2411.18915. This work was supported in part by the National Science Foundation SBIR Phase II grant 2233508 “Authoring Assistance via Contextual Semantic Labeling” and by the Mathematics of Information Technology and Complex Systems (MITACS) Accelerate grant IT41737. This work used Jetstream2 at Indiana University through allocation CIS230177: Assessing Next Generation LLMs for Contextual Semantic Labeling from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. Ministral-8B was used under the "The Mistral AI Non-Production License" (link), for research purposes only; Phi3-mini was used under the MIT License (link). Funding for this work was provided in part by the Mitacs, the Canadian innovation organization.