Docugami to Showcase Small Agentic Reasoning Models that Outperform Chat GPT-grade LLMs at Prestigious ICDAR Conference

Written by Docugami | September 13, 2025 at 2:32 AM

At the world’s preeminent conference on document analysis later this month, Docugami will demonstrate its small agentic reasoning models that outperform much larger and more expensive proprietary LLMs.

Docugami’s novel approach uses small language models with limited human supervision to achieve high-quality reasoning results that outperform virtually all state-of-the-art LLMs like Chat GPT.

Docugami’s Science Team will present the company’s small agentic reasoning model work at the prestigious International Conference on Document Analysis and Recognition (ICDAR) on September 18, in Wuhan, China.

“Our small agentic reasoning models achieve exceptional results at lower cost, with greater security, and without the need for costly proprietary LLMs or extensive human prompting,” said Docugami CEO Jean Paoli, a renowned expert on document technologies and co-creator of the W3C XML standard.

Recently, the company’s work was evaluated in several industry and academic benchmarks, showing Docugami’s small agentic reasoning models outperform all open-source reasoning models of comparable size and virtually all GPT-4 based frameworks.

For the FinQA dataset, for example, which requires reasoning over complex financial information, the benchmarking results found that Docugami’s small agentic model (called Docugami MATATA, for “MAthematical Tool-Assisted reasoning for Tabular Applications) outperforms all existing models, even large closed-source models using GPT-4 and Chat GPT or open-source models that use human expert annotations for training.

	Framework	Closed-source/ Open-source	Model	Method	FinQA Accuracy
1	Docugami MATATA	Open-source	Ministral-8B	FT	77.59
2	TAT-LLM	Open-source	Llama2-70B	FT	76.81
3	EEDP	Closed-source	GPT-4	PE	76.05
4	TAT-LLM	Open-source	Llama2-13B	FT	71.93
5	Docugami MATATA	Open-source	Phi3-mini 3.8B	FT	70.10
6	PoT	Closed-source	PoT-SC-Codex	PE	68.1
7	TAT-LLM	Open-source	Llama2-7B	FT	65.13
8	EEDP	Closed-source	ChatGPT	PE	61.88
9	FinQANet	Open-source	RoBERTa	FT	61.24

Table 1. FinQA Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the FinQA dataset for various proprietary and open-source approaches.
PE: Prompt Engineering; FT: Fine-Tuning.

In recent years, as the mathematical reasoning capabilities of AI have increased rapidly, most of the attention has focused on approaches using Large Language Models (LLMs) and intensive human prompt engineering to address complex mathematical reasoning problems.

However, these approaches have important limitations and dependencies. Prompt-engineering is labor intensive, often requiring hand-crafted prompts, and may not capture the full complexity and diversity of the data. Proprietary Large Language Models raise data privacy and security concerns, especially with sensitive business documents, and present scalability issues due to their high cost and latency.

Our work in this area stems from Docugami’s roots as a business document foundation model, focused on the unique business documents of individual organizations. Docugami’s world-class Science team is working to advance AI approaches that can tackle complex mathematical reasoning problems over business documents with extraordinary precision, lower costs, greater efficiency, and improved data privacy and security for sensitive information.

We have developed a novel cost-effective method for training AI agents for tabular data problem-solving through reasoning, planning, and tool use. With a progressive self-improvement paradigm and iterative weak supervision, Docugami’s small agentic reasoning model requires only 3.8B/8B Small Language Models (SLMs), making it especially well-suited for local hosting and sensitive business contexts where data privacy is vital.

Using flexible and reusable tools across different datasets, Docugami’s small agentic model achieves exceptional performance at radically lower cost, with excellent scalability across a variety of shared tasks.

We recently evaluated Docugami’s small agentic model, using several of the most widely used datasets for benchmarking mathematical problem-solving. The FinQA dataset and the TAT-QA dataset represent complex real-world reasoning scenarios, and include tabular and text data, requiring reasoning over financial information with intricate contexts. The TabMWP dataset involves mathematical word problems over tabular data. All three of these datasets require multi-step reasoning, information extraction, data manipulation, tabular and contextual understanding, and numerical calculations, making them useful evaluators for AI intended for use with lengthy business documents and complex business scenarios.

For the TAT-QA dataset, the experimental results demonstrate that Docugami’s small agentic models achieve outstanding results, outperforming virtually all other approaches, including many with much larger and more expensive model sizes.

	Framework	Closed-source/ Open-source	Model	Method	TAT-QA Accuracy
1	TAT-LLM	Open-source	Llama2-70B	FT	81.4
2	Docugami MATATA	Open-source	Ministral-8B	FT	77.6
3	TAT-LLM	Open-source	Llama2-13B	FT	77.5
4	Code Generation for Table-Text Question	na	na	na	76.8
5	TAT-LLM	Open-source	Llama2-7B	FT	76.4
6	AeNER	na	na	na	75.0
7	Docugami MATATA	Open-source	Phi3-mini-3.8B	FT	74.2
8	Code Generation for Table-Text Question	na	na	na	73.7
9	Encore	Open-source	BART-Large	FT	71.8
10	KFEX-N	Open-source	Delberta-V#- Large	FT	71.0

Table 2. TAT-QA Dataset Exact Match Accuracy for Proprietary and Open-Source Models. Leaderboard of exact match accuracy scores for mathematical reasoning with the TAT-QA dataset for various proprietary and open-source approaches. Source: https://nextplusplus.github.io/TAT-QA/
PE: Prompt Engineering; FT: Fine-Tuning.

Similarly, for the TabMWP dataset, the experimental results demonstrate that Docugami’s small agentic model outperforms all existing open-source models in reasoning across the TabMWP dataset, even open-source models that rely on much larger LLMs or training data annotated by GPT-4. Our approach also outperforms all but one closed source model, and nearly matches the performance of the much larger and much more expensive GPT-4 model, with 98.13 percent accuracy compared to 98.78 percent for GPT-4.

Table 3. TabMWP Dataset Accuracy for Proprietary and Open-Source Models. Leaderboard of accuracy scores for mathematical reasoning with the TabMWP dataset for various proprietary and open-source approaches.
Source: https://promptpg.github.io/leaderboard.html
PE: Prompt Engineering; FT: Fine-Tuning.

	Framework	Closed-source/ Open-source	Model	Method	TabMWP Accuracy
1	Chameleon	Closed-source	GPT-4	PE	98.78
2	Docugami MATATA	Open-source	Ministral-8B	FT	98.13
3	PoT GPT-4	Closed-source	GPT-4	PE	96.93
4	Docugami MATATA	Open-source	Phi3-mini-3.8B	FT	96.66
5	CREATOR	Closed-source	ChatGPT	PE	94.7
6	Chameleon	Closed-source	ChatGPT	PE	93.28
7	TaCo	Open-source	TAPEX-large	FT	92.91
8	Pot ChatGPT + Doc	Closed-source	PoT-SC-Codex	PE	92.69
9	CoT GPT-4	Closed-source	GPT-4	PE	90.81
10	CoS-Planning	Closed-source	ChatGPT	PE	90.00
11	PoT ChatGPT	Closed-source	ChatGPT	PE	89.49

As a company focused on transforming businesses of all sizes and across all sectors by unlocking the data and insights contained in complex business documents, Docugami’s work in this area aims to reduce the reliance on proprietary Large Language Models and human prompt-engineering, resulting in comparable or better performance with lower costs, faster local results, and increased data privacy. This work will be particularly important in business settings where cost-effectiveness, operational efficiency and flexibility, and data privacy and security are key considerations.

We’re excited to be presenting Docugami’s reasoning advances at ICDAR, the world’s foremost conference on document analysis and recognition, and we see enormous potential in our novel approach to unlock complex document data and advance mathematical reasoning. Results from academic and industry benchmarks demonstrate that Docugami’s agentic approach, using tool-augmented Small Language Models with weak-supervision, can meet or exceed the performance of existing reasoning methods using costly LLMs like GPT-4 and ChatGPT, as well as extensive human prompt-engineering. Our dedicated Science team will continue to accelerate progress in the important area of agentic models.

Notes: You can read an early paper on our work here: https://arxiv.org/abs/2411.18915. This work was supported in part by the National Science Foundation SBIR Phase II grant 2233508 “Authoring Assistance via Contextual Semantic Labeling” and by the Mathematics of Information Technology and Complex Systems (MITACS) Accelerate grant IT41737. This work used Jetstream2 at Indiana University through allocation CIS230177: Assessing Next Generation LLMs for Contextual Semantic Labeling from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by U.S. National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. Ministral-8B was used under the "The Mistral AI Non-Production License" (link), for research purposes only; Phi3-mini was used under the MIT License (link). Funding for this work was provided in part by the Mitacs, the Canadian innovation organization.

View full post