This is the next in a series of articles examining the work of UiPath Research, a growing team of AI scientists, researchers, and engineers expanding the AI capabilities of the UiPath Platform™.
As the leader of this team, I'm pleased to announce that UiPath DocPath, our large language model (LLM) for information extraction from documents, is entering general availability for the U.S. region (other regions to follow). We have already used DocPath in our public endpoints, with over 40 out-of-the-box models. The model is now deployed for fine-tuning within Modern Projects in Document Understanding for the U.S. region.
This post will detail our measurement methodology and our findings on DocPath's performance under real-world workloads.
Why DocPath?
There are various information extraction approaches available today, including supervised models as well as multi-modal frontier models like GPT-4o that are increasingly able to do zero-shot extraction. We created DocPath to provide our customers with the following capabilities:
- High accuracy for semi-structured documents and forms, e.g. invoices, receipts, purchase orders, W4, Accord and other document types similar to our out-of-the-box models.
- Efficient model training with active learning within UiPath Document Understanding.
- Confidence scores in model output, facilitating workflows with selective human in the loop interventions for lower-confidence inferences.
Measurement Methodology
We employed two primary methods to evaluate DocPath's performance:
- Automated Testing: Since its announcement, we've been automatically testing fine-tuning for DocPath alongside our production LayoutLMv3 model. This approach allowed us to:
- Assess DocPath's performance under real-world conditions across a wide range of document types and extraction tasks.
- Compare results directly with our previous specialized information extraction model (LayoutLMv3) without impacting customer operations.
- Ground Truth Evaluations:We conducted controlled evaluations using curated datasets representing various document types. This method provided insights into DocPath's performance with high-quality, verified data.
Results
DocPath achieved a better F1 score than LayoutLMv3 on 90% of extraction tasks in automated testing. In the few situations where DocPath had a lower F1 score than LayoutLMv3, we found that the training data was extremely limited or mislabeled, causing statistical uncertainty in the results.
In addition, our ground truth evaluations revealed the following results:
Metric | LayoutLMv3 | DocPath | Improvement |
---|---|---|---|
False Positive Rate | 3.9% | 3.3% | ~15% reduction |
False Negative Rate | 4.1% | 3.4% | ~17% reduction |
F1 Score* | 95.21 | 96.51 | +1.3 improvement Note: This is a commonly used metric in the field of information retrieval, balancing false positives and false negatives. Small improvements are significant as the score gets higher. |
We also performed ground-truth evaluations of DocPath (with fine-tuning) against zero-shot extraction using GPT-4o. There is a significant difference, which is to be expected since DocPath had the benefit of fine-tuning to capture nuances that are difficult to capture in an extraction prompt.
Metric | GPT-4o | DocPath | Improvement |
---|---|---|---|
False Positive Rate | 17.0% | 3.3% | ~5x or 81% reduction |
False Negative Rate | 16.4% | 3.6% | ~4x or 78% reduction |
F1 Score | 83.19 | 96.51 | +13.32 improvement |
Analysis
Our testing in private preview indicates several key benefits of DocPath:
- Increased Accuracy: A higherer F1 score, particularly given already high baselines, translates to a meaningful improvement in overall accuracy.
- Reduced False Positives: The significant reduction in false positive rates is likely to result in higher automation rates. For example, a typical use case for automating documents like invoices is to only request review from humans if the model does not produce output for a field. In situations like this, false positives (where the model produces incorrect output, rather than not producing anything) are problematic and automation developers must default to always requiring human review to prevent errors.
- Improved Accuracy (after fine-tuning) vs. Frontier Models (zero-shot): The comparison against zero-shot extraction using frontier models demonstrates DocPath's effectiveness in specialized document understanding tasks, once fine-tuned to a specific task.
Limitations and Future Work
While these results are promising, it is important to note that:
- The comparison against frontier models is based on zero-shot performance. We are actively working to compare against fine-tuned performance of these models.
- Early exploratory testing indicates that DocPath requires less training data to achieve similar F1 scores to LayoutLMv3. We are continuing to gather more data to validate this preliminary result.
As we move into general availability, we aim to gather more diverse real-world data to further validate and improve DocPath's performance.
We welcome feedback from users, as we continue to refine and iterate on this model as well as other advanced information extraction techniques.
*Note - The F-score is a metric used to evaluate the accuracy of systems dealing with binary classification and information retrieval. It is calculated from the precision and recall of the test. Precision is the sum of true positive results divided by the number of all samples predicted positive (including incorrectly identified). Recall is the number of true positive results divided by the number of all samples that should be seen as positive. The F1 score is the mean of the precision and recall. It represents both precision and recall combined in a single metric.