LLM safety,
benchmarked.

Loading the latest public leaderboard bundle.

Why K-Bench is Needed

Increasingly, individuals turn to LLMs for support in situations involving serious mental health risks, including suicidal ideation, self-harm, domestic violence, and substance use. In these contexts, it is critical that models respond with appropriate validation, risk recognition, and escalation behaviors.

However, existing benchmarks typically assess isolated tasks and fail to reflect the complexity of real-world presentations, where risks are often overlapping and comorbid.

KBench is designed to close this gap by evaluating models against clinically grounded, high-fidelity scenarios that capture these interactions. It combines rich synthetic vignettes informed by real patient material, a rigorously defined rubric developed with a stakeholder panel, and clinician-derived ground truth ratings, providing a more realistic and safety-relevant assessment of model behavior.

Model providers

Model rankings

Risk focuses on D1 clinical judgement and D2 risk exploration, which are the safety-critical dimensions. Overall combines all rubric dimensions into one headline ranking.

Rank Model i Relative to the average displayed model performance. i Relative to the average displayed model performance.
Loading leaderboard rows.

Filters

Loading the comparison view.

Risk performance

This chart shows each selected model's risk-domain profile on a shared score scale. Combined risk pools D1 and D2: D1 is clinical judgement and risk awareness, while D2 is risk exploration, including follow-up questions, protective factors, coping, and immediate safety planning.

Rubric dimension profile

Seven rubric dimensions D1–D7 for the selected models, using a selected-model score scale capped at 100.

Dimension detail table

Exact aggregate values for each published dimension.

Dimension Loading
Dimensions Loading detail rows.

Demographic Stratification

Select a demographic or disclosure axis to inspect anonymized transcript-score distributions. Each dot is one judged transcript score for that subgroup and model/prompt combination, shown relative to that model's overall average. Axis labels indicate whether the subgroup is patient-surfaced or vignette-derived.

Severity split

Performance as cases become more acute, shown on a 90–100 score axis.

Comorbidity index

How quickly performance degrades as active risks stack up, shown on a 90–100 score axis.

Run summary

Current published snapshot from the latest public leaderboard run.

Bundle Loading