AI Stone Predictions Work—Until You Pool Them

Can AI and Predictive Models Accurately Predict Stone-Free Status? A Systematic Review and Meta-Analysis

Only 5 retrospective cohorts met inclusion criteria; heterogeneity was so extreme the authors declared their own pooled estimates "clinically uninterpretable"

Journal: The Canadian Journal of Urology | Published: 2026-04-15 | Type: Systematic Review, Meta-Analysis | PMID: 42086349 Authors: Ghazwani Y et al. (King Saud bin Abdulaziz University for Health Sciences / Ministry of National Guard Health Affairs, Saudi Arabia) Funding/COI: Funding not listed; authors declare no conflicts of interest

Summary

This systematic review searched six databases through September 2025 to assess whether AI and predictive models can reliably forecast stone-free status (SFS) after ureteroscopy. Five retrospective cohorts made the cut, covering approaches from logistic regression to gradient boosting and radiomics ensembles. Individual models showed acceptable-to-excellent discrimination, but heterogeneity across studies was so severe the authors concluded their own pooled estimates are "clinically uninterpretable."

Claims

Pooled RR for stone size per 1 mm increase: 1.26 (95% CI 0.91–1.76; I² = 94.6%; prediction interval 0.03–49.45) — not statistically significant
Pooled RR for moderate-severe vs. mild/no hydronephrosis: 2.72 (95% CI 0.96–7.72; I² = 96.9%; prediction interval 0.03–249.87) — not statistically significant; prediction interval is effectively zero to infinity
Stone size as continuous variable: SMD 1.36 (95% CI 0.85–1.86; I² = 72.9%; prediction interval −3.77 to 6.48) — larger stones in the non-stone-free group
Stone density (Hounsfield units): SMD 0.64 (95% CI 0.39–0.90; I² = 0%; prediction interval −0.99 to 2.27) — the only predictor with low heterogeneity
Models integrating radiomics with anatomic and clinical features showed highest performance per narrative synthesis

Study Quality

Five retrospective cohorts is a thin foundation for a meta-analysis. The authors applied QUADAS-AI for risk-of-bias assessment and used dual independent screening and extraction — appropriate. SFS definitions varied substantially: from <2 mm residual fragments at day 1 to ≤5 mm at one month, assessed by plain radiography, ultrasound, and/or CT. That's not a minor technical difference; it means the outcome being predicted is not the same thing across studies.

The heterogeneity statistics are the paper's main finding. I² values of 94.6% and 96.9% for the two primary binary outcomes, with prediction intervals spanning three orders of magnitude, indicate these studies are not measuring the same phenomenon. The authors explicitly say so rather than papering over it, which is the right call.

Red Flags

Only 5 included studies — too few for stable meta-analytic estimates
I² = 96.9% for the hydronephrosis analysis; these studies are not interchangeable
Prediction intervals for stone size (0.03–49.45) and hydronephrosis (0.03–249.87) are numerically meaningless
All included studies are retrospective — no prospective validation cohorts
SFS definitions are inconsistent across studies on both threshold and timing
Funding source for the review not disclosed
No external validation reported for any included model

Strengths

Six-database search through September 2025
QUADAS-AI applied — appropriate bias tool for AI/predictive model studies
Dual independent screening and data extraction
Authors explicitly state pooled estimates are "clinically uninterpretable" — unusually honest for a meta-analysis
Hounsfield units (stone density) emerged as a low-heterogeneity predictor (I² = 0%), the one consistent signal in the dataset

Verdict

A meta-analysis that calls its own pooled results clinically uninterpretable is doing something right. The underlying individual models may have genuine predictive value — individual studies report acceptable-to-excellent discrimination — but five retrospective cohorts with incompatible outcome definitions do not support meaningful pooling. The paper's real contribution is cataloguing what a valid evidence base would require: standardized SFS definitions, consistent imaging protocols, and prospective validation. Stone density (HU) stands out as the one predictor that holds across studies. Read this for the honest methodology, not for actionable estimates.