second responses vs hours
safety & governance score
decision usefulness with GPT-5.4-mini upgrade
The challenge
A global media SaaS company that helps large enterprise clients manage communications, reputation, and strategic decision-making. It analyzes hundreds of millions of sources daily from public news, social, and broadcast to proprietary analyst-curated databases. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting for a human analyst to manually aggregate across those sources: a process measured in hours, not seconds.
The company’s AI team set out to make that synthesis conversational and instant. The hard part was encoding the institutional expertise that makes their output decision-grade and informs communications and strategic decisions that can run into tens or hundreds of millions of dollars.
The solution
Snorkel designed and built a multi-agent conversational intelligence system which orchestrates specialized agents across the company’s data sources, returning grounded, decision-ready answers in seconds. Snorkel built a custom evaluation harness around the client team’s own institutional knowledge: what made an answer useful for decision-makers, what counted as properly grounded, where the process needed to be reliable, and which safety and governance boundaries mattered for their use cases.
Snorkel was able to easily assess the impact of upgrading from GPT-4.1-mini to GPT-5.4-mini. The harness showed a 5-point lift in decision usefulness, a 100% pass rate on safety-critical refusal checks, and an improvement from 82.6% to 98.6% on broader governance checks for avoiding internal jargon and keeping system details out of responses. This provided a clear, data-backed case to upgrade to GPT-5.4-mini.
The outcome
The agent replaces a process which used to take hours and to deliver answers in an average of 15 seconds, with safety scores high enough to clear enterprise launch requirements. As models continue to evolve, the eval-first foundation lets the client test, compare, and swap models without rebuilding the agent or losing the expert judgement that makes it trustworthy.
More customer stories







