Manufacturing

Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.

91

Test cases

19

Agent tools

Domain agentic intelligence index

We test models on private, non-contaminated tasks.
Here's what we found.

Composite pass^5 score (%)
Last updated: April 6
Composite pass^5 score (%)
Last updated: April 6

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.

Scaling curve

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

0%8%16%24%32%40%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5
0%8%16%24%32%40%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5

Task difficulty distribution

Tasks bucketed by aggregate success rate

Buckets show difficulty tiers based on aggregate of models results.

100%

2 of 90 tasks (2%)

2

75%+

6 of 90 tasks (7%)

6

50%+

12 of 90 tasks (14%)

12

25%+

62 of 90 tasks (69%)

62

0%

8 of 90 tasks (7%)

8

Example task

User Request

Correct Agent Solution

What Is Tested

Methodology

Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).

Trusted by Leading AI Teams

TAU manufacturing dataset available for purchase

License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.