Arena

Leaderboard

Catalog

Get the data

Manufacturing

Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.

Access sample data

100Test cases available OTS

19Agent tools

Domain agentic intelligence index

We test models on private, non-contaminated tasks.
Here's what we found.

Composite pass^5 score (%)

Last updated: July 21

Composite pass^5 score (%)

Last updated: July 21

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.

Scaling curve

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

Legend

Task difficulty distribution

Tasks bucketed by aggregate success rate

Buckets show difficulty tiers based on aggregate of models results on the benchmarking subset.

Example task

User Request

Correct Agent Solution

What Is Tested

Methodology

Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).

Read blog

Trusted by Leading AI Teams

TAU manufacturing dataset available for purchase

License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.

Purchase now