World-first platform sets new standard for fair and transparent AI evaluation in diabetic eye screening
A UK-led research team has developed the first real-world, head-to-head testing platform capable of independently evaluating commercial AI systems for diabetic eye disease — offering what experts describe as a major breakthrough in ensuring safe, equitable, and trustworthy deployment of AI in healthcare.
Published in The Lancet Digital Health, the study demonstrates how the new platform can rigorously compare commercial AI algorithms under identical conditions, free from vendor influence. The goal, researchers say, is to place all companies on a level playing field while giving the NHS a realistic view of how these tools perform across diverse patient populations.
Although the NHS already considers cost-effectiveness and accuracy when selecting AI tools for the UK healthcare system, major gaps remain — particularly around algorithmic fairness, digital infrastructure, and real-world validation. Until now, commercial medical AI systems have rarely been tested at scale across different ethnicities and socioeconomic groups, despite mounting evidence that algorithms can reproduce or magnify existing health inequities.
Led by Professor Alicja Rudnicka (City St George’s, University of London) and Adnan Tufail (Moorfields Eye Hospital), the team partnered with Kingston University and Homerton Healthcare NHS Trust to build a secure research environment for assessment. Twenty-five companies with CE-marked AI systems were invited to participate, with eight accepting.
Each algorithm was run on 1.2 million retinal images from the North East London Diabetic Eye Screening Program — one of the NHS’s largest and most ethnically diverse cohorts. Human graders, using the current NHS protocol, served as the reference standard. Crucially, vendors had no access to patient data or ground-truth grading.
Across 202,886 screening visits, algorithm performance ranged from 83.7–98.7% for detecting diabetic eye disease potentially requiring clinical intervention. Accuracy exceeded 96% for moderate-to-severe disease and 95% for proliferative disease, matching or surpassing previously published human grading benchmarks. Importantly, the algorithms performed consistently across ethnic groups—a first in large-scale UK evaluation.
Processing times varied dramatically: AI systems delivered results in 240 milliseconds to 45 seconds per patient, compared to up to 20 minutes for trained human graders.
Professor Rudnicka commented that the platform “delivers the world’s first fair, equitable and transparent evaluation of AI systems to detect sight-threatening diabetic eye disease," adding that its depth of scrutiny "is far higher than that ever given to human performance."
Co-principal investigator Adnan Tufail added: “There are more than 4 million patients with diabetes in the UK who need regular eye checks. This groundbreaking study sets a new benchmark by rigorously testing AI systems to detect sight threatening diabetic eye disease before potential mass rollout. The approach we have developed paves the way for safer, smarter AI adoption across many healthcare applications.”
The team are now aiming for national-scale deployment of the platform, with approved algorithms centrally hosted and screening centers submitting images for rapid AI analysis integrated directly into patient records. They hope that this model might serve as a blueprint for evaluating AI across other chronic diseases, helping to build patient trust and accelerate safe, equitable adoption of AI in routine care across the UK.
A UK-led research team has developed the first real-world, head-to-head testing platform capable of independently evaluating commercial AI systems for diabetic eye disease — offering what experts describe as a major breakthrough in ensuring safe, equitable, and trustworthy deployment of AI in healthcare.
Published in The Lancet Digital Health, the study demonstrates how the new platform can rigorously compare commercial AI algorithms under identical conditions, free from vendor influence. The goal, researchers say, is to place all companies on a level playing field while giving the NHS a realistic view of how these tools perform across diverse patient populations.
Although the NHS already considers cost-effectiveness and accuracy when selecting AI tools for the UK healthcare system, major gaps remain — particularly around algorithmic fairness, digital infrastructure, and real-world validation. Until now, commercial medical AI systems have rarely been tested at scale across different ethnicities and socioeconomic groups, despite mounting evidence that algorithms can reproduce or magnify existing health inequities.
Led by Professor Alicja Rudnicka (City St George’s, University of London) and Adnan Tufail (Moorfields Eye Hospital), the team partnered with Kingston University and Homerton Healthcare NHS Trust to build a secure research environment for assessment. Twenty-five companies with CE-marked AI systems were invited to participate, with eight accepting.
Each algorithm was run on 1.2 million retinal images from the North East London Diabetic Eye Screening Program — one of the NHS’s largest and most ethnically diverse cohorts. Human graders, using the current NHS protocol, served as the reference standard. Crucially, vendors had no access to patient data or ground-truth grading.
Across 202,886 screening visits, algorithm performance ranged from 83.7–98.7% for detecting diabetic eye disease potentially requiring clinical intervention. Accuracy exceeded 96% for moderate-to-severe disease and 95% for proliferative disease, matching or surpassing previously published human grading benchmarks. Importantly, the algorithms performed consistently across ethnic groups—a first in large-scale UK evaluation.
Processing times varied dramatically: AI systems delivered results in 240 milliseconds to 45 seconds per patient, compared to up to 20 minutes for trained human graders.
Professor Rudnicka commented that the platform “delivers the world’s first fair, equitable and transparent evaluation of AI systems to detect sight-threatening diabetic eye disease," adding that its depth of scrutiny "is far higher than that ever given to human performance."
Co-principal investigator Adnan Tufail added: “There are more than 4 million patients with diabetes in the UK who need regular eye checks. This groundbreaking study sets a new benchmark by rigorously testing AI systems to detect sight threatening diabetic eye disease before potential mass rollout. The approach we have developed paves the way for safer, smarter AI adoption across many healthcare applications.”
The team are now aiming for national-scale deployment of the platform, with approved algorithms centrally hosted and screening centers submitting images for rapid AI analysis integrated directly into patient records. They hope that this model might serve as a blueprint for evaluating AI across other chronic diseases, helping to build patient trust and accelerate safe, equitable adoption of AI in routine care across the UK.