GPT-5.5
openai/gpt-5.5
draft_run_completed_three_grader_consensus
Benchmark v0.1 · leaderboard draft
這頁公開 score schema、實測 draft run、示範 run 與可複現 runner。正式主榜只採 v0.1 core 330 題; 目前尚未有 verified official run,draft run 不得當作正式排名引用。
0.4*accuracy_overall + 0.3*source_hit_rate + 0.2*(1-hallucination_rate_l6) + 0.1*refusal_rate_l6_trapL6 human audit gate
Raw packet 含模型答案與 grader notes,維持 review-required;公開頁只列摘要統計。Draft score 不轉 verified official run,直到 L6 100% 人審完成並簽核。
Summary artifact: tmp/lius-benchmark-runs/L6-human-audit-summary-2026-06-10.json. Gate: Do not convert draft scores to verified leaderboard entries until packet items are reviewed and signed off.
| Rank | Model | Score | Acc | S-Hit | Halluc L6 | Refuse L6 | N |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 (OpenAI OAuth draft) gpt-5.4 UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT | 62.5 | 83% | 3.3% | 0% | 82.5% | 330 |
| 2 | GPT-5.5 (OpenAI OAuth draft) gpt-5.5 UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT | 62.4 | 84.6% | 4.9% | 0% | 71.3% | 330 |
| 3 | GPT-5.4-mini (OpenAI OAuth draft) gpt-5.4-mini UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT | 59.1 | 73.8% | 1.8% | 0% | 90% | 330 |
| 4 | GPT-5.3-codex-spark (OpenAI OAuth draft) gpt-5.3-codex-spark UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT | 53.0 | 61.2% | 5.2% | 5% | 80% | 330 |
Draft runs 已跑完 full 330 題,並完成 3-grader median consensus;正式 verified leaderboard 仍需 L6 人工覆核。
| Rank | Model | Score | Acc | S-Hit | Halluc L6 | Refuse L6 | N |
|---|---|---|---|---|---|---|---|
| 1 | anthropic/claude-opus-4-7 anthropic/claude-opus-4-7 ILLUSTRATIVE_DEMO_ONLY_NOT_FOR_LEADERBOARD | 83.0 | 100% | 60% | 0% | 50% | 5 |
| 2 | Claude Opus 4.7 (Anthropic, knowledge cutoff 2026-01) anthropic/claude-opus-4-7 SPEC_EXAMPLE_ONLY | 71.7 | 70.3% | 51% | 10% | 78% | 330 |
示範 runs 只用來展示 schema 與顯示格式;正式 leaderboard 需 full 330 題、3-grader consensus、L6 100% 人工抽驗。
openai/gpt-5.5
draft_run_completed_three_grader_consensus
openai/gpt-5.4-mini
draft_run_completed_three_grader_consensus
openai/gpt-5.4
draft_run_completed_three_grader_consensus
openai/gpt-5.3-codex-spark
draft_run_completed_three_grader_consensus
anthropic/claude-opus-4-7
pending_run
google/gemini-2.5-pro
pending_run
lius-cc/Daoism-Qwen3.5-9B
pending_local_or_hf_run
node scripts/benchmark/run-lius-benchmark.mjs --dry-run --limit 5
LLM_BENCHMARK_BASE_URL=http://127.0.0.1:8000/v1 \
LLM_BENCHMARK_API_KEY=dummy \
LLM_BENCHMARK_MODEL=Daoism-Qwen3.5-9B \
node scripts/benchmark/run-lius-benchmark.mjs --suite core-v0.1 --limit 330
node scripts/benchmark/grade-lius-benchmark.mjs \
--input tmp/lius-benchmark-runs/<run>.jsonl \
--dry-run
LLM_GRADER_BASE_URL=http://127.0.0.1:8001/v1 \
LLM_GRADER_API_KEY=dummy \
LLM_GRADER_MODEL=grader-model \
node scripts/benchmark/grade-lius-benchmark.mjs \
--input tmp/lius-benchmark-runs/<run>.jsonl