The Hidden Challenge of LLMs

By Cory Jaeger

The Rise of LLMs and the Explosion of Unstructured Data

The more businesses and individuals rely on large language models (LLMs) like ChatGPT, Claude, and others, the more unstructured data we create. Every conversation, every generated document, every report adds to a growing mountain of information. Today, over 80% of all data is already unstructured, and analysts estimate it’s growing at 50–60% per year, a trend accelerated by AI itself. While this wave of data opens enormous opportunities, it also introduces inconsistencies and quality issues that demand new solutions.

Why LLM Results Can’t Be Fully Trusted Yet

Although LLMs are powerful, their ability to extract and synthesize information is not consistent enough to rely on without human review. Accuracy varies widely: structured machine-readable tables can reach 90–98% accuracy, but when parsing tables in PDFs, accuracy drops to 75%, and free-text extraction averages 60–70%. Images and footnotes are no better, often below 70% accuracy. That means the same question posed across different LLMs, or even in different sessions, can deliver very different answers. Without validation, businesses risk basing key decisions on data riddled with errors.

What’s Missing: Validation and Consistency

The core issue isn’t that LLMs can’t find insights; it’s that they lack the validation, normalization, and audit trails needed for consistent, reliable outputs. Problems like OCR errors, numeric miscalculations, or hallucinated values are common. Footnotes and disclosures often get lost, which can materially change results. Without schema enforcement, provenance tracking, or rules-based checks, unstructured data becomes a liability rather than an asset. For investors and corporations alike, this creates inefficiencies, wasted resources, and growing mistrust of AI-driven analysis.

How Businesses are Losing Money Exponentially

Unstructured data isn’t going away. In fact, it’s multiplying. Businesses that fail to address its complexity risk making decisions on shaky ground, while wasting millions on bad data every year. Gartner estimates poor data quality costs companies $12.9 million per year on average, while the U.S. economy loses $3.1 trillion annually. Meanwhile, investors that effectively leverage validated data strategies are seeing up to 20% annual outperformance, and companies applying unstructured data are 23x more likely to acquire customers and 7x more likely to retain them. The gap between leaders and laggards is widening.

Using Technology to Turn Chaos into Clarity

This problem is why we build PhyTech. Our end-to-end data solution is purpose-built to solve the inefficiencies of unstructured data by enabling businesses to collect and use it while ensuring quality. Trained to collect data from any file, report, or even personal notes, PhyTech autonomously collects and refines unstructured data into clean, reliable datasets. Every datapoint undergoes 50+ rules-based quality checks for accuracy, then is normalized and structured for immediate use. Whether plugged directly into Physis chatbot, ImpactChat, for analysis and reporting, or integrated into existing client systems, PhyTech transforms messy, unreliable information into actionable intelligence, cutting costs, saving time, and unlocking real competitive advantage

Learn more about PhyTech.

Explore more Articles

Finance

Lifestyle

Market Movers

Physis