AI Token Cost and Accuracy Benchmark
Evaluating Strategy Mosaic Against Direct Text-to-SQL Execution
Benchmark Evidence on Query Accuracy, Coverage, and AI Token Cost Reduction
The Brief
Strategy Mosaic improved AI analytics accuracy and reduced costs. In testing, Mosaic answered all analytics questions correctly, while direct text-to-SQL approaches produced incorrect results on complex queries.
The semantic layer gives AI the business context it needs. By defining metrics, relationships, and business logic once, Mosaic helps ensure AI, dashboards, and users all work from the same trusted data definitions.
Organizations can scale AI more efficiently. Mosaic reduced AI token usage by up to 50%, while Strategy AI Agents reduced it by up to 98%, helping lower costs and improve performance as AI adoption grows.
Executive Summary
As organizations deploy AI agents to query enterprise data, the accuracy and cost of those queries depend on one thing: what context the AI receives. AI doesn't know your business. It doesn't know what your metrics mean, how your data is structured, or which rules apply, so it guesses. The answer looks right but often isn't. Strategy Software's Mosaic is the context layer that fixes this, the semantic layer that gives AI the business context it needs to be reliable, governed, and effective.
Mosaic was benchmarked against direct PostgreSQL querying across 17 insurance claims analytics questions on a real-world insurance data model with 28 tables. The findings are clear on both dimensions:
Accuracy: Mosaic answered all 17 questions correctly. PostgreSQL answered 15. The 2 failures were not random, they were structurally predictable wrong answers on complex multi-table queries, returning financially inflated numbers that looked valid but were not.
Cost: Mosaic significantly reduces AI token consumption and per-query costs. The same semantic model that prevents wrong answers also reduces what organizations pay to run AI analytics at scale.
Incorrect Results from Direct SQL
Query | Direct SQL Returned | Correct Answer | Error |
|---|---|---|---|
Policy 336: coverage limit | $40,688,000 | $5,086,000 | 8× inflated |
Policy 336: catastrophe payout | $23,000 | $13,600 | 5× inflated |
Policy 337: catastrophe payout | $9,000 | $0 | 100% misattributed from Policy 336 |
Each query executed successfully and returned a number. Only Mosaic returned correct answers in all three cases.
The sections that follow explain the root causes of both failure modes, the benchmark methodology, and the architectural reasons Mosaic addresses them. For organizations evaluating AI analytics infrastructure, the evidence points to one conclusion: the semantic layer is what makes AI reliable on enterprise data.
1. Token Cost and Wrong Answers in AI-Driven Analytics
The dominant pattern for AI analytics today is text-to-SQL: the AI model receives raw schema information, generates a SQL query, and executes it against the database. Sequeda et al. (2023) establish text-to-SQL as the standard deployment baseline, and it is widely used in production environments today. It has two distinct failure modes that become more costly at scale.
1.1 Token Cost
Every text-to-SQL query requires the AI model to first receive the full schema: table names, column names, data types, and foreign key relationships. For a 28-table schema this adds approximately 4,428 tokens of schema context per query. For production enterprise schemas with 100–500+ tables, this grows to 100,000+ tokens per query.
At current LLM pricing of $3.00 per million input tokens, a ten-person analytics team running 200 queries per day against a 200-table schema incurs over $15,000/month in schema context overhead alone, before any SQL query is executed against the data. This cost is paid fresh on every query; schema context is not cached across stateless LLM calls.
1.2 Wrong Answers from Complex Queries
When an AI model generates SQL against a raw schema, it must independently reason about join paths, bridge tables, and aggregation grain. That business logic is entirely absent from the schema definitions. For simple single-table queries this works. For queries spanning multiple fact tables, the model frequently selects the wrong join path and returns numbers that are materially wrong but structurally valid. There is no error, no warning, and no way for the user to know the answer is wrong without already knowing the correct answer.
In this benchmark, two multi-fact-table questions returned results from PostgreSQL that were 5–10× inflated or attributed to the wrong policy. Neither failure triggered a SQL error. Both results would have passed routine review by an analyst who did not already know the correct answer.
1.3 Join Complexity and Business Logic
Analytical schemas in real enterprises are layered: star schemas, snowflake schemas, ACORD insurance models, Kimball dimensional designs, all involving multiple fact tables joined through carefully chosen bridge tables. The correct join path is not derivable from column names alone; it is encoded in business rules.
Text-to-SQL generation requires the AI to infer those business rules from table names on every query. The wider the schema, the more plausible join paths exist, and the more likely the AI selects the wrong one. The semantic layer encodes the correct path once and applies it consistently.
2. How the Semantic Layer Addresses These Failures
A semantic layer sits between the AI and the database. Rather than exposing raw table structures, it presents business-oriented concepts such as "total premium," "claim payout," and "loss ratio," backed by validated join logic. Mosaic's implementation provides three architectural protections against the failure modes described above.
2.1 Fan-Out Protection
Fan-out, in plain terms, is what happens when a join multiplies rows before an aggregation gets to sum them. A 5-year policy generates 5 rows in policy_coverage_detail. When the AI joins that table to a claims table through a shared key, each claim row matches all 5 coverage rows, and SUM() counts every dollar five times. The query returns a valid number. The number is wrong by a factor of 5.
Mosaic isolates fact tables at the semantic model level. When a query requires both coverage limits and claim payments, Mosaic computes each aggregate independently, then joins the pre-aggregated results at the policy grain. The multiplicative error never has the opportunity to form.
2.2 Canonical Bridge Routing
A canonical path is the single authoritative route through a schema to connect two entities. In the benchmark dataset, claims can be connected to policies via two paths: through insurable_object_identifier (shorter, but ambiguous) or through the claim_coverage bridge table (canonical, which enforces correct policy-year attribution). Without a semantic model, an AI will frequently take the shorter path because it looks correct from the column names alone. That path produces plausible results for simple scenarios and systematically wrong results when multiple policy years or shared insurable objects are involved.
Mosaic encodes the canonical path once in the semantic model. Every subsequent query, regardless of analyst, AI model, or interface, follows the same validated route.
2.3 Compressed Semantic Context
Rather than sending the full raw schema, Mosaic exposes a semantic API (get_semantics) that returns business-friendly field names, pre-computed relationships, and filtered context relevant to the question. This compresses schema context from 4,428 tokens to 2,073 tokens, a 53% reduction, while simultaneously making that context more precise and less prone to misinterpretation.
The SQL generated against this context is correspondingly simpler. Analysts write GROUP BY queries instead of multi-table JOIN chains. Average SQL token count drops from 174 to 132 per query, a 24% reduction. Both effects compound directly into per-query cost.
2.4 Simple vs. Complex Queries: Why the Gap Widens
For straightforward questions, the performance difference between Mosaic and raw SQL is primarily economic: Mosaic is 37% cheaper. For multi-fact-table questions, the difference is categorical: Mosaic returns the correct answer; PostgreSQL does not. This distinction matters more over time. Enterprise data models are growing in complexity as more data sources, stricter governance requirements, expanding AI initiatives, and distributed analytics all push schemas toward the multi-fact-table patterns where raw SQL fails.
Organizations often pilot AI analytics on simple questions and see comparable results from both approaches. They then deploy at scale, and the complex, high-stakes queries begin to fail.
Query Type | Example | PostgreSQL | Mosaic |
|---|---|---|---|
Simple — single table | Premium per policy | Correct | Correct, 37% cheaper |
Moderate — 2-table join with clear FK | Claims by location | Correct | Correct, 37% cheaper |
Complex — multi-fact-table aggregation | Coverage limit + total claims per policy | Wrong (8–10× fan-out inflation) | Correct, 37% cheaper |
Complex — bridge-table attribution | Catastrophe payout by policy | Wrong (5× inflation + misattribution) | Correct, 37% cheaper |
3. Benchmark Methodology
All results in this paper derive from live execution against real data. No metrics are theoretical or estimated; where direct measurement was not possible (due to MCP protocol limitations on server-side timing), data points are clearly labeled with their derivation method.
3.1 Dataset
The dataset follows an ACORD-aligned (U.S. insurance industry standard) insurance data model with 28 tables capturing policies, coverage details, claims, financial amounts, and party relationships. The schema includes the multi-row-per-policy patterns created by multi-year policy terms, exactly the structural condition that triggers fan-out errors in naive SQL.
- Source: Chat-With-Your-Data benchmark (data.world/cwd-benchmark-data)
- Semantic model: Mosaic Library (linked in original benchmark)
3.2 Test Scope
- 17 questions spanning single-table lookups, multi-table aggregations, ratio calculations, exposure analysis, and cross-domain attribution
- Live execution against both PostgreSQL and Mosaic MCP connections, with no synthetic or replayed data
- SQL generated by Claude Code MCP for both systems, ensuring equivalent AI generation quality
- Token counts measured from actual API responses (schema context) and query character lengths (÷4, standard approximation)
- PostgreSQL execution times measured via EXPLAIN ANALYZE: 0.2ms–6.7ms, negligible relative to LLM inference time of ~13–14s for both tools
3.3 Mosaic Configuration
Three one-time configurations were required to encode business intent not inferable from column names:
Configuration | Problem | Business Impact |
|---|---|---|
Party ID unification | person_identifier and organization_identifier map to the same semantic key as agreement_party_role.party_identifier, but names differ and no FK exists | Accurate party-to-policy traversal; without this, party queries misjoin or return empty results |
Case + type normalization | party_identifier vs Party_Identifier (case mismatch); bigint vs integer (type mismatch) prevent auto-join | Consistent join routing across party tables without runtime cast errors |
Policy number parsing | agreement_name holds 'Policy 31003000336'; policy.policy_number holds '31003000336'. Direct equality join fails due to prefix. | Premium-to-claim analysis at policy level. Computed field strips prefix via SPLIT_PART(agreement_name, ' ', 2). |
These configurations are one-time setup. Once encoded, they apply automatically to every subsequent query, regardless of user, interface, or AI model.
4. Accuracy Results: Where Semantics Prevent Failure
Of 17 questions, both systems answered 15 correctly. On the remaining 2, PostgreSQL returned plausible but materially wrong numbers; Mosaic returned correct answers in both cases. These failures are not isolated to this dataset. Any query that joins two fact tables through a shared dimension key, without semantic isolation, will produce the same inflation. They are predictable structural failures of naive multi-fact-table SQL, and they would affect every analyst querying this data without a semantic model.
Metric | Mosaic | Without Mosaic | Difference |
|---|---|---|---|
Correct answers (17 questions) | 17 / 17 = 100% | 15 / 17 = 88.2% | −11.8 % |
First-attempt success rate | 100% | 94.1% | −5.9 % |
Wrong answer rate | 0% | 11.8% | +11.8 % |
4.1 Question 16: Coverage Limits and Total Claims per Policy
Question: "What is the total coverage limit and total claims per policy?"
Policy 31003000336 | Result vs Correct | |
|---|---|---|
PostgreSQL | Limit: $40,688,000 Claims: $136,000 | 8–10× inflated (fan-out) |
Mosaic | Limit: $5,086,000 Claims: $13,600 | Correct |
Root Cause
policy_coverage_detail contains one row per policy per coverage year. A 5-year policy produces 5 PCD rows. When the AI joins policy_amount (fact table) to claim_amount (fact table) through this shared key, each claim row joins to all 5 coverage year rows, multiplying claim amounts by 5 before SUM() aggregates. The query shape (one output row per policy) looks correct. The inflation is invisible without prior knowledge of the correct answer.
Why Mosaic Is Correct
The semantic model isolates fact tables. Coverage and claim metrics are computed via independent canonical paths and combined only as pre-aggregated results at the policy grain. The multiplicative error cannot form.
4.2 Question 17: Total Catastrophe Payout per Policy
Question: "What is the total catastrophe payout per policy?"
Policy 31003000336 | Policy 31003000337 | |
|---|---|---|
PostgreSQL | $23,000 (5× inflated) | $9,000 (misattributed from Policy 336) |
Mosaic | $13,600 (correct) | Not returned (correct — no catastrophe payouts) |
Root Causes (Two Compounding Errors)
Fan-out: insurable_object_identifier appears across multiple PCD rows, multiplying claim_amount rows before aggregation.
Bridge misselection: Connecting claims to policies via insurable_object_identifier bypasses the claim_coverage bridge, which contains the authoritative policy-year linkage. Claims belonging to Policy 336 leak into Policy 337 because both policies share an insurable object.
Why Mosaic Is Correct
The semantic model enforces the path: claim → claim_coverage → policy_coverage_detail → policy. This is the only valid attribution path. Both overcounting and cross-policy leakage are structurally prevented.
5. Token Economics
Without business context, AI has to rediscover how your data works on every query. It loads the schema, reasons about relationships, and generates SQL from scratch each time. That burns tokens before any actual analysis begins. Mosaic addresses this directly because business logic is pre-defined in the semantic layer, AI receives only what it needs.
5.1 Mosaic MCP (Semantic Layer via Model Context Protocol)
Mosaic MCP gives the LLM the business context it needs, not the full raw schema. Mosaic provides the governed definitions, relationships, and join logic so the SQL is accurate and precise. The LLM produces the query, but Mosaic defines what that query should look like, abstracting JOIN complexity into simple GROUP BY queries.
Token Component | Mosaic MCP | PostgreSQL | Reduction |
|---|---|---|---|
Schema context / query | 2,073 tokens | 4,428 tokens | 53% |
SQL tokens (avg) | 132 tokens | 174 tokens | 24% |
Total input tokens / query | ~2,405 tokens | ~4,802 tokens | 50% |
Cost / query | $0.01246 | $0.01966 | 37% |
Total cost — 17 questions | $0.2119 | $0.3342 | $0.12 saved |
Note: Cost calculated at $3.00/M input tokens + $15.00/M output tokens. Schema context measured from live get_semantics API responses and DDL character counts ÷4.
5.2 Strategy AI Agents
Strategy AI Agents extend Mosaic's semantic layer into pre-built automations. They are purpose-built agents that answer business questions by routing directly through Mosaic's governed business context, with near-zero token overhead. Rather than optimizing what the LLM receives, they remove the LLM from query generation entirelym Mosaic handles schema lookup, SQL generation, and execution, and the LLM receives only the natural language question.
Token Component | Strategy AI Agents | Mosaic MCP | PostgreSQL |
|---|---|---|---|
Schema context | ~0 tokens | 2,073 | 4,428 |
SQL tokens | ~0 tokens | 132 | 174 |
Question text only | ~50–100 tokens | 200 | 200 |
Total / query | ~50–100 tokens | ~2,405 | ~4,802 |
vs PostgreSQL baseline | ~98% reduction | 50% reduction | Baseline |
Strategy AI Agents deliver approximately 98% token reduction vs direct PostgreSQL and 96% reduction vs Mosaic MCP. AI token consumption is virtually zero; cost is deflected from the LLM provider to the Mosaic software platform. This architecture scales linearly: adding users, questions, or larger schemas does not increase token costs, because no schema context ever touches the LLM.
6. Where Savings Compound in Production
The benchmark numbers above are conservative by design. They were measured against a 28-table demo dataset, without caching, without multi-turn conversation history, and at only 17 questions. Each constraint compresses the measured savings relative to production reality.
Factor | Benchmark Conditions | Production Reality |
|---|---|---|
Schema size | 28 tables, ~7K chars | 100–500+ tables → 100K+ chars; PostgreSQL overhead grows linearly with table count |
Mosaic caching | Disabled (fresh queries) | Enabled → schema tokens drop to $0.00 on repeat queries within session |
Query volume | 17 questions, one session | Hundreds per analyst per day; 10-person team = thousands of daily queries |
Conversation history | Not included | Multi-turn sessions re-send full schema + history each turn (PostgreSQL); Mosaic optimizes context |
6.1 Production Estimates
With Mosaic MCP: Conservative 50–70% cost reduction in production, rising toward the upper bound as schema size increases and caching compounds across sessions.
With Strategy AI Agents: 95–98% cost reduction regardless of schema complexity, query volume, or team size. Schema size is irrelevant because no schema context ever reaches the LLM. This is a fundamentally schema-invariant cost architecture.
Production Scale Illustration
An enterprise with 200 tables, 20 analysts running 200 queries/day, at $3.00/M input tokens: PostgreSQL baseline is approximately $15,500/month in schema context overhead. Strategy AI Agents reduce this to under $450/month in question-text tokens. Annual saving: approximately $181,000, before accounting for reduced error correction and re-query costs.
7. Mosaic in the Broader Semantic Layer Landscape
Semantic layers as a category have been independently validated by multiple vendors and academic researchers. Recent benchmarks from other providers, run against the same ACME Insurance dataset used in this study, confirm that well-implemented semantic layers achieve near-100% accuracy for queries within their modeled scope, dramatically outperforming direct text-to-SQL approaches. This is the right conclusion: the category works.
But not all semantic layer architectures are equivalent. The distinctions that matter in production are coverage, cost, and what happens at the boundaries of the model.
7.1 The Coverage Gap: What Happens When the Semantic Layer Can't Answer
Some semantic layer implementations rely on a metric engine that enforces strict entity traversal rules. When a query requires too many joins, or joins across entity types that haven't been explicitly modeled, those implementations return an error rather than a wrong answer. This is far better than a wrong answer. But it is also not an answer.
Published benchmarks on this same dataset show that certain metric-layer architectures require new intermediate models before complex multi-hop queries can be answered at all. Mosaic's semantic model is designed to encode relationships between existing tables without requiring new intermediate models to be written, tested, and deployed. For organizations with complex, normalized schemas, this distinction has meaningful implications for the time between deployment and full query coverage.
Mosaic works with your schema as it exists
Most semantic layer implementations require new intermediate models before complex queries can be answered. Mosaic encodes relationships between existing tables through configuration with no restructuring, no new model development, no deployment cycle. In complex, normalized enterprise schemas, that difference is measured in months.
7.2 Token Costs: What Competitive Benchmarks Don't Measure
Published accuracy benchmarks from other semantic layer vendors focus exclusively on correctness. None have addressed token cost reduction at the level Mosaic's testing does, and none have published anything comparable to Strategy AI Agents' external agent architecture.
Dimension | Semantic Layer (industry) | Mosaic MCP | Strategy AI Agents |
|---|---|---|---|
Accuracy vs. raw SQL | Near-100% for covered queries | 100% (17/17) | 100% (17/17) |
Coverage of complex joins | Varies — may require new modeling | Full coverage via config | Full coverage via config |
Token cost vs. raw SQL | Not published | 37% reduction per query | ~98% reduction — near zero |
Schema-invariant cost model | No | Partial (caching helps) | Yes — schema never reaches LLM |
AI cost scales with user count | Yes | Yes (minimized) | No — question text only |
7.3 Strategy AI Agents: A Different Architectural Category
Strategy AI Agents represent a fundamentally different architectural approach. Rather than optimizing how much schema context is sent to the LLM, they eliminate schema context from the LLM loop entirely. The LLM receives only the natural language question, 50 to 100 tokens, and Mosaic's MCP server handles the rest: schema lookup, query planning, execution, and result formatting.
This architecture has three properties that no other published semantic layer approach has matched simultaneously:
- Schema-invariant cost: A 500-table enterprise schema costs exactly the same as a 28-table demo schema in LLM tokens: zero. Accuracy gains from the semantic model are preserved while token costs are eliminated.
- User-count-invariant cost: Adding 100 more analysts does not increase AI token consumption. Each analyst's question costs approximately 75 tokens regardless of how many people are asking.
- Zero wrong answers by construction: The LLM is never asked to write SQL. It cannot produce a wrong join because it never writes a join. The Mosaic MCP produces deterministic, semantically-validated queries every time.
8. ROI Framework
Use the following to calculate projected savings based on your organization's deployment parameters.
8.1 Mosaic MCP Savings
Input Variable | Value / Formula |
|---|---|
Monthly AI database queries (all users) | _______________ |
Avg tokens per query (PostgreSQL baseline) | ~4,800 tokens |
Mosaic MCP token reduction | 50% |
Monthly token savings | [queries] × 2,400 tokens × [AI provider $/M rate] |
Annual LLM cost savings | [monthly] x 12 |
Add: Analyst time saved correcting wrong queries | [# wrong queries/month] × [avg correction time] × [hourly cost] |
8.2 Strategy AI Agents Savings (Maximum Cost Deflection)
Input Variable | Value / Formula |
|---|---|
Monthly AI database queries (all users) | _______________ |
Strategy AI Agents token reduction | ~98% |
Monthly token savings | [queries] × 4,700 tokens × [AI provider $/M rate] |
Annual LLM cost savings | [monthly] x 12 |
Schema-size independence | 100-500+ table schemas cost the same as 28-table schemas: zero schema tokens |
User-count independence | AI token cost does not scale with number of users |
8.3 Choosing the Right Pathway
Profile | Recommended | Why |
|---|---|---|
Existing LLM integrations needing optimization | Mosaic MCP | 37% immediate savings, drop-in improvement |
High query volume (BI/data teams, 100+ queries/day) | Strategy AI Agents | Maximum token deflection at scale |
Complex enterprise schemas (100+ tables) | Strategy AI Agents | Schema complexity doesn't affect token costs |
New AI initiatives with budget constraints | Strategy AI Agents | Start at near-zero AI infrastructure cost |
Multi-tenant SaaS applications | Strategy AI Agents | AI cost does not scale with user count |
9. Conclusion
Connecting AI agents directly to raw databases is not merely expensive; it is structurally unreliable for the complex multi-table queries that drive enterprise analytics decisions.
Mosaic's semantic layer addresses both dimensions simultaneously. By encoding business logic (canonical join paths, fan-out protections, party-identity unification) once in the semantic model, every subsequent query inherits those guarantees regardless of which AI model, analyst, or interface issues it. The accuracy improvement is not a tuning artifact; it is a structural property of the architecture.
The cost savings are equally structural. Mosaic MCP reduces schema context overhead by 53% and SQL complexity by 24%, delivering 37% cost reduction in the benchmark and 50–70% at production scale with caching. Strategy AI Agents remove the AI from the query execution path entirely, achieving 95–98% cost reduction regardless of schema complexity, an architecture that makes AI query costs effectively schema-invariant and user-count-invariant.
For organizations evaluating AI analytics infrastructure, the question is no longer whether to use a semantic layer. The benchmark data demonstrates its necessity for accurate results on multi-fact-table queries.
Technical Proof Points
100% accuracy | 17/17 questions answered correctly across all complexity levels and query types |
50% token reduction | Mosaic MCP, measured in live testing against 28-table ACORD-aligned schema |
~98% token reduction | Strategy AI Agents: external agent architecture eliminates schema and SQL tokens from LLM context entirely |
0% wrong answers | vs 11.8% with direct PostgreSQL; both PostgreSQL failures involved 5–10× financial inflation or policy misattribution |
37% cost savings | Per query in benchmark (Mosaic MCP); rises to 50–70% at production scale with caching enabled |
Source: Insurance Claims Analytics Benchmark: 17 questions, live execution, measured token counts
Data model: ACORD-aligned insurance schema (data.world/cwd-benchmark-data)
SQL queries generated and executed via Claude Code MCP. Token counts measured from live API responses and query character lengths.
Original benchmark authored by Sushma Ganapuram, Strategy Software.