How Accurate is Your AI?
Surgeons have always approached new technology with healthy skepticism. From image-guided sinus surgery to transoral robotic surgery, adoption has historically followed a familiar pattern: review the evidence, assess accuracy and safety, and determine whether outcomes justify integration into daily practice. Artificial intelligence (AI) should be no different.
Yet, unlike procedural innovations, many AI tools are entering surgical practices quietly—embedded within electronic health records (EHRs), referral workflows, and scheduling systems—often without specialty-specific validation. As AI increasingly influences how patients are triaged, scheduled, and routed, surgeons must understand not only what these tools do, but how well they do it.
Accuracy Is Not a Binary Concept
In surgery, we evaluate technology using defined performance metrics: sensitivity, specificity, complication rates, and learning curves. AI systems should be evaluated with similar rigor. Accuracy alone is insufficient; false positives and false negatives have very different operational and clinical consequences.
Recent validation work examining AI-driven referral detection in surgery provides a useful framework. We recently made a controlled comparison of an AI empowered fax sorting solution released by ModMed using synthetic fax data. The study compared human review, the commercial EHR referral-labeling tool, and a custom domain-trained AI model. Results showed meaningful differences that mirror challenges surgeons already recognize in clinical decision-making.
Human reviewers—often experienced staff performing first-pass triage—achieved approximately 91% accuracy, missing nearly 1 in 10 true referrals, typically when language was subtle or embedded in nonstandard formats. This error rate is familiar to any surgeon who has seen delayed, missed, or misrouted referrals despite diligent staff effort.
By contrast, the EHR-integrated AI tool demonstrated perfect sensitivity—it did not miss a single true referral. However, this came at the cost of specificity: nearly 30% of non-referral documents were incorrectly labeled as referrals. Operationally, this translates into increased staff workload, unnecessary follow-up, and the illusion of automation without meaningful efficiency gains.
A third approach—a custom AI model trained specifically on surgical referral data—initially performed similarly to generalized tools. But after exposure to specialty-specific examples, it achieved 100% accuracy, eliminating both missed referrals and false positives. The lesson is clear: AI performance is not inherent to the technology itself, but to how—and on what—it is trained.
The Surgical Parallel: Generalized vs. Specialty-Specific Tools
Surgeons intuitively understand that tools designed for general use often underperform in specialized settings. A generic retractor does not replace a custom designed instrument, and a surgical robot does not eliminate the need for procedure-specific training. AI is no different.
Generalized AI systems embedded within EHRs are trained on heterogeneous clinical data across multiple specialties. While this enables broad deployment, it also increases the risk of contextual misinterpretation. In referral management, this may mean over triaging benign communications or misclassifying nuanced referral language, errors that compound at scale.
Importantly, these inaccuracies are not benign. High false-positive rates increase clerical burden, undermine staff trust in automation, and may worsen burnout, an issue already well documented among physicians and staff navigating EHR heavy workflows. Conversely, false negatives risk delayed care and lost patients, particularly in high demand procedural subspecialties.
| Model | Precision | Recall | F1 Score | Cohen’s κ |
|---|---|---|---|---|
| Human Reviewer | 1.00 | 0.80 | 0.89 | 0.82 |
| ModMed EHR | 0.61 | 1.00 | 0.76 | 0.43 |
| Untrained Custom AI | 0.71 | 1.00 | 0.83 | 0.59 |
| Trained Custom AI | 1.00 | 1.00 | 1.00 | 1.00 |
Surgeons evaluating artificial intelligence (AI) tools should apply the same performance standards used for diagnostic tests and procedural technologies. In a recent otolaryngology referral-classification study, four metrics help explain why AI accuracy alone is insufficient.
Precision reflects how often the AI is correct when it acts. The commercial EHR algorithm demonstrated a precision of 0.61, meaning nearly 40% of documents flagged as referrals were false positives, increasing downstream staff workload. In contrast, the trained custom AI achieved 100% precision, eliminating unnecessary triage.
Recall measures sensitivity—how often true referrals are captured. Both the EHR algorithm and AI models achieved 100% recall, whereas human review missed approximately 20% of referrals on first pass. As in surgical screening, high sensitivity is essential—but not at the expense of excessive false positives.
The F1 score balances precision and recall, providing a single measure of reliability. The EHR algorithm’s F1 score (0.76) was limited by poor specificity, while the trained custom AI achieved a perfect 1.00, reflecting balanced, dependable performance.
Cohen’s kappa assesses agreement beyond chance. Despite perfect recall, the EHR algorithm demonstrated only moderate agreement (κ = 0.43) with expert review, whereas the trained custom AI achieved perfect concordance (κ = 1.00).
Clinical implication: AI systems embedded in EHRs may appear effective based on sensitivity alone, yet still generate inefficiencies and staff burden. As with surgical technology, AI adoption should be guided by validated, specialty-specific performance metrics—not availability or vendor inclusion.
Adoption Should Follow Evidence, Not Availability
The rapid availability of AI tools—particularly those bundled by EHR vendors—creates ease to adopt without validation. But surgeons would never implant a device or adopt a technique solely because it was opened on the back table. AI deserves the same scrutiny.
Key questions Surgeons should ask include:
- Has this AI tool been validated using specialty-specific data?
- What are its false-positive and false-negative rates?
- How does it perform compared with human review in real workflows?
- Does it reduce meaningful work, or simply redistribute it?
Without answers grounded in data, AI risks becoming another layer of complexity rather than a solution.
A Scientific Path Forward
AI holds real promise for surgery—not by replacing clinical judgment, but by improving front-end workflows that determine access to care. When rigorously trained and validated, AI can reduce administrative burden, accelerate scheduling, and allow staff and surgeons to focus on higher-value clinical interactions.
The path forward should mirror how surgeons adopt any new technology: evaluate the evidence, understand limitations, and measure outcomes after implementation. AI should not be used for AI’s sake, nor adopted simply because it arrives pre-installed in an EHR. Instead, it should be deployed selectively, scientifically, and transparently—only when it demonstrably improves the lives of surgeons, the teams they work with, and the patients they treat.
In surgery, as in life, precision matters. AI is no exception.