How Rox achieved 99% accuracy with Snorkel Evaluate

Impact

99%

Achieved accuracy with specialized evaluators

+24

Point improvement in shipped critical outbound email feature

The challenge

Rox’s ability to ensure outbound emails are fully accurate and aligned with each customer’s brand and objectives is a key differentiator. However, when developing models Rox found that off-the-shelf evaluation approaches were not able to deliver the required quality for critical custom evaluation tasks. Initially Rox wrote it’s own LLM-as-a-judge. While the model seemed to score well, the Rox team wanted higher confidence for production deployment.

The solution

Using the Snorkel Evaluation Suite, Rox scored the judge against human experts and found it aligned only around 75% of the time. The team used Snorkel to iterate on the judge to increase alignment. The aligned judge surfaced an issue with the prototype outbound model, which used the wrong recipient name around 11% of time, enabling Rox to correct the model’s behavior.

The outcome

Achieved 99%+ accuracy with specialized evaluators enabling sufficient trust to ship a critical email outbound feature.

Rox is redefining the revenue stack with our AI-powered sales platform. Off-the-shelf models aren’t capable of delivering the quality we need to ensure our agents are accurately personalizing outbound emails. With Snorkel Evaluate we have been able to confidently assess our outbound email agent, then identify and fix issues to achieve human-level accuracy. The level of visibility and control Snorkel delivers is a huge advantage as we build trustworthy, agentic AI at scale.

Shriram Sridharan, co-founder, Rox

Enterprises facing aggressive revenue targets without more headcount are turning to agentic AI innovator Rox. Rox is redefining the revenue stack with it’s AI-powered sales productivity platform, starting with the Rox sales agent swarm which provides agents that can perform at the level of top sales reps.

Share this customer story

More customer stories

View all stories

From hours to seconds on CLO contract review with 94% end user acceptance

A top 10 US bank manages CLO portfolios totaling billions in assets, each governed by contracts up to 500 pages.

Conversational, decision-grade responses in 15 seconds

A global media intelligence firm analyzes hundreds of millions of sources daily – from public news, social, and broadcast to proprietary analyst-curated databases – to help large enterprise clients manage communications, reputation, and strategic decision-making. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting for a human analyst to manually aggregate across those sources: a process measured in hours, not seconds.

Deploying production AI in <60 days to accelerate claims review 67%

A leading global firm transforming insurance subrogation operations with AI found that manual review processes capped their throughput to ~30% of available claims.

For models that need to be right. Not just good enough.

Request dataset samples

Talk to our team