Why K-Bench is Needed

Increasingly, individuals turn to LLMs for support in situations involving serious mental health risks, including suicidal ideation, self-harm, domestic violence, and substance use. In these contexts, it is critical that models respond with appropriate validation, risk recognition, and escalation behaviors.

However, existing benchmarks typically assess isolated tasks and fail to reflect the complexity of real-world presentations, where risks are often overlapping and comorbid.

KBench is designed to close this gap by evaluating models against clinically grounded, high-fidelity scenarios that capture these interactions. It combines rich synthetic vignettes informed by real patient material, a rigorously defined rubric developed with a stakeholder panel, and clinician-derived ground truth ratings, providing a more realistic and safety-relevant assessment of model behavior.

Model providers

Model rankings

Risk focuses on D1 clinical judgement and D2 risk exploration, which are the safety-critical dimensions. Overall combines all rubric dimensions into one headline ranking.

Rank	Model	Risk Net Improvement i	Overall Net Improvement i
Loading leaderboard rows.

Model providers

Filters

Loading the comparison view.

Risk performance

This chart shows each selected model's risk-domain profile on a shared score scale. Combined risk pools D1 and D2: D1 is clinical judgement and risk awareness, while D2 is risk exploration, including follow-up questions, protective factors, coping, and immediate safety planning.

Rubric dimension profile

Seven rubric dimensions D1–D7 for the selected models, using a selected-model score scale capped at 100.

Dimension detail table

Exact aggregate values for each published dimension.

Dimension	Loading
Dimensions	Loading detail rows.

Demographic Stratification

Select a demographic or disclosure axis to inspect anonymized transcript-score distributions. Each dot is one judged transcript score for that subgroup and model/prompt combination, shown relative to that model's overall average. Axis labels indicate whether the subgroup is patient-surfaced or vignette-derived.

Severity split

Performance as cases become more acute, shown on a 90–100 score axis.

Comorbidity index

How quickly performance degrades as active risks stack up, shown on a 90–100 score axis.

Run summary

Current published snapshot from the latest public leaderboard run.

Bundle Loading

K-Bench

Why K-Bench is Needed

Model providers

Model rankings

Model providers

Reasoning

Filters

Risk performance

Rubric dimension profile

Dimension detail table

Demographic Stratification

Severity split

Comorbidity index

Run summary