鼎稔道學館

Benchmark v0.1 · leaderboard draft

Daoism LLM Leaderboard 草案

這頁公開 score schema、實測 draft run、示範 run 與可複現 runner。正式主榜只採 v0.1 core 330 題; 目前尚未有 verified official run,draft run 不得當作正式排名引用。

0
正式 verified runs
4
實測 draft runs
2
示範 runs
330
主榜題數
400
公開題數

Score formula

0.4*accuracy_overall + 0.3*source_hit_rate + 0.2*(1-hallucination_rate_l6) + 0.1*refusal_rate_l6_trap

L6 human audit gate

人審 packet 已產生,尚未簽核

Raw packet 含模型答案與 grader notes,維持 review-required;公開頁只列摘要統計。Draft score 不轉 verified official run,直到 L6 100% 人審完成並簽核。

pending_human_review
320
L6 audit items
7
critical redline
86
grader disagreement
13
zero / partial
214
spot check

Summary artifact: tmp/lius-benchmark-runs/L6-human-audit-summary-2026-06-10.json. Gate: Do not convert draft scores to verified leaderboard entries until packet items are reviewed and signed off.

實測 draft runs

RankModelScoreAccS-HitHalluc L6Refuse L6N
1
GPT-5.4 (OpenAI OAuth draft)
gpt-5.4
UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT
62.583%3.3%0%82.5%330
2
GPT-5.5 (OpenAI OAuth draft)
gpt-5.5
UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT
62.484.6%4.9%0%71.3%330
3
GPT-5.4-mini (OpenAI OAuth draft)
gpt-5.4-mini
UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT
59.173.8%1.8%0%90%330
4
GPT-5.3-codex-spark (OpenAI OAuth draft)
gpt-5.3-codex-spark
UNVERIFIED_THREE_GRADER_CONSENSUS_DRAFT
53.061.2%5.2%5%80%330

Draft runs 已跑完 full 330 題,並完成 3-grader median consensus;正式 verified leaderboard 仍需 L6 人工覆核。

示範 runs

RankModelScoreAccS-HitHalluc L6Refuse L6N
1
anthropic/claude-opus-4-7
anthropic/claude-opus-4-7
ILLUSTRATIVE_DEMO_ONLY_NOT_FOR_LEADERBOARD
83.0100%60%0%50%5
2
Claude Opus 4.7 (Anthropic, knowledge cutoff 2026-01)
anthropic/claude-opus-4-7
SPEC_EXAMPLE_ONLY
71.770.3%51%10%78%330

示範 runs 只用來展示 schema 與顯示格式;正式 leaderboard 需 full 330 題、3-grader consensus、L6 100% 人工抽驗。

待跑模型

OpenAI

GPT-5.5

openai/gpt-5.5

draft_run_completed_three_grader_consensus

OpenAI

GPT-5.4-mini

openai/gpt-5.4-mini

draft_run_completed_three_grader_consensus

OpenAI

GPT-5.4

openai/gpt-5.4

draft_run_completed_three_grader_consensus

OpenAI OAuth

GPT-5.3 Codex Spark

openai/gpt-5.3-codex-spark

draft_run_completed_three_grader_consensus

Anthropic

Claude Opus 4.7

anthropic/claude-opus-4-7

pending_run

Google

Gemini 2.5 Pro

google/gemini-2.5-pro

pending_run

lius.cc

Daoism-Qwen3.5-9B

lius-cc/Daoism-Qwen3.5-9B

pending_local_or_hf_run

Runner / Grader

node scripts/benchmark/run-lius-benchmark.mjs --dry-run --limit 5

LLM_BENCHMARK_BASE_URL=http://127.0.0.1:8000/v1 \
LLM_BENCHMARK_API_KEY=dummy \
LLM_BENCHMARK_MODEL=Daoism-Qwen3.5-9B \
  node scripts/benchmark/run-lius-benchmark.mjs --suite core-v0.1 --limit 330

node scripts/benchmark/grade-lius-benchmark.mjs \
  --input tmp/lius-benchmark-runs/<run>.jsonl \
  --dry-run

LLM_GRADER_BASE_URL=http://127.0.0.1:8001/v1 \
LLM_GRADER_API_KEY=dummy \
LLM_GRADER_MODEL=grader-model \
  node scripts/benchmark/grade-lius-benchmark.mjs \
    --input tmp/lius-benchmark-runs/<run>.jsonl
Daoism LLM Leaderboard 草案 · LIUS Benchmark · 鼎稔道學館