Manufacturing
Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.
91
Test cases
19
Agent tools
Domain agentic intelligence index
We test models on private, non-contaminated tasks.
Here's what we found.
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.
Scaling curve
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results.
100%
2 of 90 tasks (2%)
2
75%+
6 of 90 tasks (7%)
6
50%+
12 of 90 tasks (14%)
12
25%+
62 of 90 tasks (69%)
62
0%
8 of 90 tasks (7%)
8
Example task
User Request
Correct Agent Solution
What Is Tested
Methodology
Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).
Trusted by Leading AI Teams
TAU manufacturing dataset available for purchase
License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.