← Go back

The Data Infrastructure Nobody Wants to Build

Date: September 8, 2025

Everyone wants to talk to their data. That's why there's a new "AI Data Analyst" launching every week. The pitch is always the same: connect your database, ask questions in English, get answers instantly. It's ChatGPT for your data. What could go wrong?

Everything, it turns out.

But here's the thing: the failures are predictable. They follow a pattern. And once you see the pattern, you can't unsee it.

The Problem That Seemed Solved

The promise was so compelling. Natural language finally conquered SQL. You wouldn't need data engineers anymore. Business users could get their own answers. The demos looked perfect.

Then companies tried it in production.

"What's our customer churn rate?" Simple question. The AI writes flawless SQL to count customers. But the answer is wrong. Not because the SQL is wrong, but because the AI doesn't know that your company defines churn differently for enterprise accounts versus self-serve. It doesn't know you exclude seasonal buyers. It doesn't know that finance measures from contract date while product measures from last activity.

This knowledge doesn't exist in your database schema. It lives in your Head of Customer Success's brain, scattered across Google Docs, implied in Excel formulas your CFO guards like state secrets.

The AI can read your tables. It can't read your company's mind.

What Your Data Actually Contains

Think of your database as containing three layers of information, but only one is actually stored there:

Layer 1 - The Facts: Customer 12345 purchased for $99 on January 3rd. This is what's in your database.
Layer 2 - The Meaning: That was a renewal, not a new purchase. It counts toward net revenue retention. The $99 reflects a loyalty discount. This is what your team knows.
Layer 3 - The Implications: This customer fits the expansion cohort pattern. They need the enterprise sequence, not standard renewal. This is what drives action.

Current AI tools read Layer 1 perfectly. They guess at Layer 2, usually wrong. They're oblivious to Layer 3.

This is the gap nobody wants to acknowledge. Because acknowledging it means accepting that the real work can't be skipped.

The Physics Problem Nobody Mentions

Here's something vendors won't tell you: every question has a computational cost, and you can either pay it once or pay it forever.

Take a simple metric: weekly active users. The AI can count unique users from the last seven days. Easy. But that means scanning millions of rows every time someone asks. Every. Single. Time.

Your CEO checks the dashboard five times a day? That's five full table scans. Your board meeting has twenty people looking at metrics? Twenty scans. Same computation, repeated endlessly.

The alternative is pre-computing these metrics once and storing the results. But that requires infrastructure: orchestration, scheduling, storage, invalidation logic. Real engineering work.

Guess which approach every "plug-and-play" solution uses? The wasteful one. Because the efficient one requires admitting that infrastructure matters.

Why Business Questions Aren't Queries

"Why did revenue drop?" isn't asking for a number. It's launching an investigation.

Real analysis follows a workflow:

Define the drop (versus what baseline?)
Segment the problem (which products? regions? segments?)
Find anomalies (what changed?)
Test hypotheses (seasonality? competition? our changes?)
Validate findings (statistical significance? confounding factors?)

This is detective work, not database queries. It requires maintaining context, building on previous findings, remembering what you've already eliminated.

Current AI analysts treat every question as isolated. They can't build investigative threads. They're like a detective with amnesia, starting fresh with every clue.

The Infrastructure Everyone Actually Needs

Here's what works in production:

The Semantic Layer: Where you encode what data means. Customer lifetime value isn't a query, it's a formula with assumptions. This layer makes those assumptions explicit and executable.

The Computation Layer: Pre-computed aggregations for common questions. Materialized views for complex joins. Incremental processing so you're not recalculating history nightly.

The Orchestration Layer: Workflows that mirror how analysis actually happens. Not single queries but multi-step processes that build toward answers.

The Validation Layer: Sanity checks, reconciliation, lineage tracking. Because wrong numbers that look right are worse than obvious errors.

The Access Layer: Yes, natural language, but built on top of everything else, not instead of it.

Skip any layer and the system fails in predictable ways. No semantic layer? Inconsistent definitions. No pre-computation? Unusable performance. No orchestration? Can't answer real questions. No validation? Silent failures.

The Uncomfortable Truth

The work is irreducible. You can't automate away the need to understand your business. You can't skip encoding that understanding into systems. You can't pretend that complex questions have simple answers.

This isn't what people want to hear. They want the magic solution, the plug-and-play miracle. They want to skip the months of semantic modeling, the careful orchestration design, the tedious validation rules.

But here's what I've learned from watching hundreds of data projects: the companies that accept this truth and do the work get massive advantages. Their data actually works. Their decisions are actually informed. Their AI assistants actually assist.

The companies that keep looking for shortcuts keep buying new tools, keep running POCs, keep wondering why nothing quite works.

The Real Innovation Opportunity

The breakthrough isn't better AI. GPT-5 won't solve this. Claude won't figure out your business logic by reading your database.

The breakthrough is making the irreducible work more tractable. It's building semantic layers that can be shared across similar companies. It's learning patterns from hundreds of implementations. It's encoding business logic in ways that are both human-maintainable and machine-executable.

Most importantly, it's accepting that infrastructure is the product. The semantic layer isn't overhead, it's the core value. The pre-computation isn't optimization, it's what makes the system usable. The validation isn't paranoia, it's what makes the system trustworthy.

What This Means

If you're evaluating AI data tools, ask these questions:

Where does business logic live? If the answer is "the AI figures it out," run.
How are common metrics pre-computed? If they aren't, prepare for terrible performance.
How do multi-step investigations work? If they don't, you can't answer real questions.
How is accuracy validated? If it isn't, you'll get convincing nonsense.

If you're building data infrastructure, accept these truths:

The semantic layer is mandatory, not optional
Pre-computation is physics, not preference
Workflows are the unit of analysis, not queries
Validation is essential, not nice-to-have

The market will eventually learn these lessons. The question is whether you learn them now, while they're still competitive advantages, or later, after you've wasted years on solutions that can't work.

The semantic layer is the product. The AI is just the interface.

That's what nobody wants to build. It's also what everybody needs.