ScienceSoft’s Head of AI Releases Preprint Comparing LLMs and Purpose-Built SLMs for ED Triage
Published:
Vadim Belski, ScienceSoft’s Head of AI, has released a research preprint evaluating small language model (SLM) pipelines for Emergency Department (ED) triage support. The paper compares commercial large language models (LLMs) with purpose-built clinical AI pipelines designed to assist with Emergency Severity Index (ESI) classification — a time-sensitive process in which triage nurses assign patients one of five priority levels within minutes of arrival.
The preprint, Solving Emergency Department Triage with Small Language Models: Why Large Commercial Models Fail and How Specialized Training Achieves Clinical Accuracy, investigates whether specialized architectures can provide more accurate, traceable, and deployment-ready triage support than off-the-shelf large language models. On ScienceSoft’s team, Vadim Belski led data curation, model training, evaluation, and error analysis. Kate Lukina, MD and business analyst, validated the ESI logic, reviewed model errors for clinical plausibility, and advised on the risks of under-triage and over-triage.
The paper evaluates three technical approaches to automating Emergency Severity Index (ESI) classification:
- Off-the-shelf large language models, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and MedGemma.
- A small language model pipeline based on BiomedBERT, where the model extracts clinical facts, and a deterministic ESI v4 rules engine applies the triage logic.
- A fine-tuned Qwen3.5-9B model — a smaller, open-weight LLM trained for the triage task and tested as an alternative to commercial frontier LLMs.
Unlike general-purpose LLMs that generate free-form answers, the BiomedBERT pipeline separates clinical fact extraction from ESI decision logic. The model extracts features such as symptoms, expected resource needs, clinical flags, arrival mode, and pain score, then passes them to a deterministic ESI decision engine. This makes the recommendation easier to audit: clinicians can review which ESI logic was triggered and which extracted features informed the result. The BiomedBERT + ESI engine pipeline is also designed for on-premises deployment, with inference latency under 50 ms and no external API fees after launch. This architecture may be relevant for healthcare organizations that need low-latency clinical decision support while maintaining tighter control over sensitive health data.
For narrative triage cases, the paper uses the fine-tuned Qwen3.5-9B model to test whether a smaller, domain-trained LLM can reason through more complex clinical notes using structured chain-of-thought supervision.
According to the preprint, generic and medical large language models, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and MedGemma, achieved 45–55% exact ESI accuracy on the evaluation task. The BiomedBERT + ESI engine pipeline achieved 88.9% exact accuracy and 97.2% adjacent accuracy on a preliminary 50-case expert-labeled evaluation set, which the paper describes as approaching nurse inter-rater agreement. The fine-tuned Qwen3.5-9B model achieved 75.0% exact and 97.2% adjacent accuracy on a 36-case narrative evaluation. The paper also reports ongoing GRPO training with a clinically asymmetric reward function and 2,776 ESI-1 narrative training cases.
Beyond benchmark results, the preprint documents 37+ BERT experiments, multiple LLM training cycles, systematic data quality audits, and engineering decisions that shaped the final approach. One audit found that 71% of altered mental status training labels were false positives, underscoring the importance of data quality controls in medical AI development.
To support reproducibility, the paper links to publicly available models and interactive demos on Hugging Face, including the BiomedBERT triage model, a BERT pipeline demo, a Qwen3.5-9B DPO model, and an LLM triage demo.
The preprint is publicly available on ResearchGate and has also been submitted to arXiv. As a preprint, the findings are preliminary and have not yet undergone peer review.
About ScienceSoft
ScienceSoft is an AI transformation and software engineering company with more than 150 successful healthcare IT projects. The company has worked in the medical IT field since 2005 and has built hands-on expertise in clinical AI, healthcare data engineering, and HIPAA-sensitive software architectures. With ISO 9001, ISO 27001, and ISO 13485 certifications, ScienceSoft helps healthcare organizations and healthtech companies design, build, and modernize secure software systems for diagnostics, care delivery, operations, administration, and patient engagement.
Media Contact
ScienceSoft’s AI and healthcare software experts are available for comment on clinical AI architectures, healthcare data engineering, healthcare data quality, and HIPAA-sensitive AI deployment. For interviews, expert comments, or opinion pieces, please follow the link.