Metrics and Ranking

Algorithms are evaluated along two main dimensions: dose accuracy and computational efficiency, using a two-level evaluation scheme that is identical across all four tasks (photon/proton, CT/MRI).

All dose metrics are computed on a common reference dose grid. Participants must resample their output to this reference grid before submission if their method uses a different internal resolution.


Evaluation Levels

Level 1 — Single-Beam / Segment Dose Evaluation

Level 1 assesses the accuracy of individual beams. For each patient, the set of beams/segments from the treatment plan is used (photon VMAT segments or proton pencil beams), covering a wide range of incident angles, paths through heterogeneous anatomy, and energy settings.

1.1 Masked Mean Absolute Error (MAE) per Beam

For each beam, a high-dose region is defined as all voxels receiving at least 10% of that beam's maximum ground-truth dose. Within this region, the mean absolute difference between the predicted and ground-truth dose is computed and normalized by that beam's maximum ground-truth dose. This metric focuses evaluation on the clinically relevant part of the beam path while ignoring low-dose background noise.

1.2 Integrated Depth-Dose (IDD) Curve Distance

To characterize longitudinal dose behaviour along the beam direction, IDD curves are computed for both photon and proton beams:

  • The IDD curve represents accumulated dose as a function of geometric depth from the patient surface along the beam axis, computed for both ground truth and prediction.
  • A curve distance metric (root-mean-square difference along the depth coordinate), normalized by the peak value of the ground-truth IDD curve, summarizes the overall difference between the predicted and ground-truth IDD curves.

These IDD-based metrics are applicable to all four tasks and are particularly sensitive to dose build-up, fall-off, and — for protons — range accuracy.


Level 2 — Full-Plan Evaluation

Level 2 assesses the accuracy of the complete reconstructed treatment plan, using the same beam set as Level 1, but combining beams with their clinical weights (VMAT monitor unit weights for photons; pencil beam weights for protons).

2.1 Stratified Plan-Level MAE

MAE is computed in three dose strata defined from the ground-truth plan dose, with each stratum normalized by the prescription dose:

Stratum Definition
High-dose Voxels receiving ≥ 80% of the prescription dose
Mid-dose Voxels receiving 30%–80% of the prescription dose
Low-dose Voxels receiving 10%–30% of the prescription dose

The combined plan-level MAE is the unweighted average of these three stratified MAEs, giving equal weight to each dose region.

2.2 3D Local Gamma Index (1% / 1 mm)

A three-dimensional local gamma pass rate is computed with strict criteria: 1% dose difference and 1 mm distance-to-agreement, using the ground-truth plan as reference.

  • A voxel passes (γ ≤ 1) if there exists at least one reference voxel satisfying the combined 1 mm distance-to-agreement and 1% local dose-difference criterion.
  • The gamma pass rate is defined as the percentage of evaluated voxels satisfying this criterion.
  • Evaluation is restricted to voxels receiving at least 10% of the prescription dose.
  • Local normalization (rather than global maximum) is applied, consistent with current clinical patient-specific QA practice.

This metric jointly captures both spatial misalignments and dosimetric discrepancies.

2.3 DVH-Based Clinical Score

A standardized dose–volume histogram (DVH) score is computed for each plan, using:

  • One target structure (PTV)
  • The three closest organs at risk (OARs) to the target

The following DVH quantities are extracted:

Structure DVH Metric
Target (PTV) D98% (near-minimum dose)
Target (PTV) V95% (volume receiving ≥ 95% of prescription)
Each OAR (×3) D2% (near-maximum dose)
Each OAR (×3) Dmean (mean dose)

For each DVH quantity, the absolute relative difference between the predicted and ground-truth plan values is computed. The DVH score is the weighted average of these relative differences, with equal contribution from target metrics and OAR metrics.


Efficiency — Runtime Metric

Real-time dose calculation is a core requirement of this challenge. A runtime metric measures the wall-clock time to process a fixed set of beams (identical for all teams).

The full testing set of beams will be sub-batched for each execution of the submitted algorithms:

  • Participants may choose their batching strategy and internal implementation to minimize total runtime.
  • The efficiency metric is the average runtime per beam = total time (over all batches) ÷ total number of beams in the runtime scenario.

⚠️ Hard limit: Algorithms exceeding an average of 1 second per beam (including data loading and model initialization) will be excluded from the official ranking for that task.


Ranking Method

Rankings are computed independently for each of the four tasks.

Step 1 — Patient-Level Computation and Submission-Level Aggregation

  • Level 1 metrics (beam-level MAE, IDD curve distance): first averaged across all beams of each patient, and then averaged across all test patients.
  • Level 2 metrics (combined plan-level MAE, gamma pass rate, DVH score): computed directly on the reconstructed plan for each patient, then averaged across all test patients.
  • Runtime: a single submission-level value obtained from the standardized runtime scenario.

Step 2 — Per-Metric Ranking

All valid submissions are ranked per metric (rank 1 = best):

Direction Metrics
Lower is better Beam-level masked MAE, IDD curve distance, combined plan-level MAE, DVH score, runtime per beam
Higher is better Local 1%/1 mm gamma pass rate

Step 3 — Final Ranking: RankThenMean with Double Weight on Efficiency

The final score is the weighted average of per-metric ranks:

Submissions are ordered from lowest to highest final score (lower = better).

The runtime metric receives double weight to ensure computational efficiency is not overshadowed by the larger number of accuracy metrics. The 2:5 weighting ratio reflects that both speed and accuracy are critical, with a slight emphasis on accuracy given the clinical consequences of dose miscalculation.

Tie-Breaking Rules

  1. Lower runtime per beam
  2. Lower combined plan-level MAE
  3. Lower DVH score