Validating AI Segmentation Without a Gold Standard: A New FDA Method and What It Means for Developers

At SPIE Medical Imaging 2026, a team of FDA researchers (Tingting Hu, Berkman Sahiner, Shuyue Guan, Mike Mikailov, Kenny Cha, Frank Samuelson, and Nicholas Petrick) presented a statistical method that addresses one of the most stubborn problems in evaluating AI-based image segmentation: how do you prove a device performs well when there is no objective "correct answer" to measure it against?

‍

For any company building software that automatically outlines lesions or organs, this is not an academic question. It sits at the center of how you design your validation study, how you size it, and how you frame your performance claims to FDA. Below is what the method does, how it works, and what it means if you are developing one of these devices.

The Problem With How Segmentation Is Validated Today

Most AI segmentation devices are validated the same way. You collect expert annotations, aggregate them into a single "reference standard" or ground truth contour, then compare your model's output to that contour using a similarity metric. The two most common are the Dice Similarity Coefficient (DSC), which measures overlap, and the Hausdorff Distance (HD), which measures boundary distance.

‍

There are two problems with this approach, and both are well known to anyone who has run one of these studies.

‍

First, there is no definitive gold standard for segmentation. Two qualified radiologists asked to outline the same lesion will produce two different contours. Aggregating multiple expert annotations into one "truth" does not eliminate that disagreement, it just buries it. The reference standard you validate against is itself an estimate with real variability baked in.

‍

Second, even if you accept an aggregated reference, it is genuinely hard to define what score is good enough. Is a Dice of 0.85 acceptable? 0.90? The answer depends on the clinical task, and a single threshold rarely captures what actually matters.

What the FDA Team Proposes

The method reframes the question. Instead of asking how close the AI is to a manufactured ground truth, it asks whether the AI's disagreement with the expert panel is any larger than the experts' disagreement with one another.

‍

Put plainly: if your device differs from the human readers by about as much as the human readers differ from each other, then for practical purposes the device is interchangeable with a human reader. That is a claim you can actually defend, and it does not require constructing a reference standard at all.

‍

This builds directly on the team's 2025 paper in the Journal of Medical Imaging, which established the framework for overlap-based metrics. The SPIE 2026 work extends it to distance-based metrics, so the same approach now works whether you care about region overlap, boundary accuracy, or both.

How the Method Works

The approach adapts the "individual equivalence index" introduced by Obuchowski and colleagues for comparing imaging tests. The team defines a segmentation interchangeability metric, written as delta, as the difference between two quantities: the average dissimilarity between the device and each human reader, minus the average dissimilarity observed within the human panel itself.

‍

A small delta means the device sits comfortably inside the spread of the human experts. A large delta means it is an outlier. Because the metric is built on a user-specified dissimilarity measure, it works with Dice, with Hausdorff Distance, and in principle with other segmentation metrics as well. Confidence intervals are obtained through bootstrapping.

‍

To validate the method, the team ran simulation studies using the Medical Image Segmentation Synthesis (MISS) tool, which generates realistic contour variations through affine, Fourier, and spike transformations. They tested two kinds of scenarios. In "transformation-agreeable" cases, the device and the experts varied in the same ways, representing genuine interchangeability. In "transformation-disagreeable" cases, the device varied differently from the experts, representing a device that should be flagged.

‍

The Results

The method performed the way you would want a statistical test to perform. In the agreement scenarios, it held the Type I error rate near the target of 0.05, meaning it rarely flagged an interchangeable device as different. In the disagreement scenarios, Type II error was low across most settings, meaning it reliably caught devices that genuinely differed from the panel. These results held across reader panels of 2 to 9 experts and datasets of 100 to 500 images, using both Dice and Hausdorff Distance.

‍

Two findings are particularly useful in practice. Statistical power improved as either the number of images or the number of expert readers increased, which gives you two levers when planning a study. And in several cases where one metric struggled to detect a real difference, the other metric caught it, which underscores that overlap-based and distance-based metrics are complementary rather than interchangeable.

What This Means If You Are Developing an AI Segmentation Device

This is FDA research, authored by reviewers and scientists in the agency's device center. It is not guidance, and it does not change any requirement on its own. But it is a clear signal of how the agency is thinking about segmentation performance, and that is worth paying attention to before your next pre-submission.

‍

A few concrete implications for product and regulatory teams.

‍

Interchangeability is becoming a credible validation framing. If you have struggled to justify a reference standard or to defend a fixed acceptance threshold, demonstrating that your device falls within the range of expert variability is an alternative worth evaluating with your statistician and your regulatory lead.

‍

Plan your sample size across both cases and readers. Traditional segmentation studies focus on the number of images. This method draws power from both images and readers, so your study design and budgeting should account for recruiting an adequate expert panel, not just collecting enough cases.

‍

Report more than one class of metric. The complementary behavior of Dice and Hausdorff Distance is a recurring theme in this work. Presenting both an overlap-based and a distance-based metric gives FDA a fuller picture and reduces the risk that a real performance gap hides behind a single favorable number.

‍

Curate a strong expert panel. The framework assumes the readers are highly experienced experts who establish a high-performance benchmark. This is different from a typical multi-reader multi-case study where readers represent end users of varying experience. Document the qualifications of your panel accordingly.

Caveats Worth Noting

The authors are direct about the limitations. The current framework models reader effects as fixed rather than random, an approach that is reasonable for a small panel of experts but that constrains how broadly the results generalize. The validation to date is based on simulated 2D contours with a single region of interest per image, so real-world clinical performance across complex anatomy still needs to be demonstrated. And the method assumes the expert panel is functioning as a high-quality benchmark, which places weight on how you select and qualify your readers.

The Bigger Picture

Segmentation is one of the fastest-growing capabilities in AI-enabled medical imaging, and the field has long lacked a clean answer to the "compared to what?" problem. This line of work, from the 2025 Journal of Medical Imaging paper to the SPIE 2026 extension, is the FDA building out a principled, reference-free way to answer it. For developers, the takeaway is that the agency is actively developing the statistical machinery it will use to reason about these devices. The earlier you align your validation strategy with that thinking, the smoother your review is likely to be.

‍

For more on FDA tools for evaluating imaging AI, see our related post on the MIDRC-MetricTree decision tool.

How Cosm Can Help

Cosm specializes in FDA regulatory and quality strategy for AI/ML-enabled medical devices and Software as a Medical Device. If you are designing a validation study for an image segmentation device, planning a pre-submission, or deciding how to frame your standalone performance claims, contact us or visit www.cosmhq.com to discuss how we can support your submission strategy.

‍

Disclaimer - https://www.cosmhq.com/disclaimer

Blog

Resources