Systematic Feature Ablation and SHAP Interpretability Reveal a Four-Gene Transcriptomic Host-Response Signature for Mortality Prediction in Sepsis: A Two-Cohort Machine Learning Study
SHAP Ablation Reveals Four-Gene Sepsis Mortality Signature
DOI:
https://doi.org/10.64949/zj0wmb67Keywords:
sepsis, transcriptomics, machine learning, XGBoost, SHAP, feature ablation, mortality, host responseAbstract
This article is a preprint and has not yet been peer-reviewed. Not for clinical use.
Background and Aims
Transcriptomic machine learning models for sepsis mortality prediction often incorporate clinical covariates and select features using statistical thresholds alone, limiting generalisability and biological interpretability. Systematic SHAP-guided feature ablation was applied to derive and externally validate a parsimonious transcriptomic mortality signature without clinical covariates. The resulting model was used to generate testable biological hypotheses.
Methods
An XGBoost classifier was trained on whole-blood transcriptomic data from the GSE65682 cohort (n = 479; 365 survivors, 114 non-survivors). A 50-gene candidate pool was identified by differential expression analysis (Benjamini-Hochberg correction) and screened by SHAP contribution analysis. Sequential feature ablation guided by cross-validated AUC and AUPRC was applied to identify the optimal feature set. The final model was externally validated in the independent GSE95233 cohort (n = 98; 68 survivors, 30 non-survivors) without retraining. Performance was assessed using ROC-AUC with bootstrapped 95% confidence intervals, area under the precision-recall curve (AUPRC), sensitivity, specificity, negative predictive value, F1 score, and decision curve analysis. This study adheres to the TRIPOD reporting guidelines for prediction model development and validation.
Results
Systematic ablation demonstrated that removing clinical covariates (age, sex) and three genes (CX3CR1, TGFB1, SPON2) progressively improved cross-validated AUC from 0.763 to 0.796, identifying a four-gene model (TUBG2, TRDC, CXCL8, ELANE) as the optimal configuration. The final model achieved a training AUC of 0.69 (95% CI 0.56-0.80) and external validation AUC of 0.67 (95% CI 0.56-0.79), with an AUC generalisation gap of 0.02. The AUPRC in validation (0.46) exceeded the training AUPRC (0.43) and both exceeded their respective baseline prevalences (0.24 and 0.31). Decision curve analysis demonstrated positive net benefit above treat-none across probability thresholds from approximately 0.10 to 0.42 in both cohorts. SHAP directionality was consistent across cohorts: TUBG2 and CXCL8 were risk-promoting; TRDC was protective. ELANE displayed a bimodal SHAP distribution replicated in both cohorts, consistent with a biologically distinct non-neutrophilic subgroup.
Conclusion
SHAP-guided ablation produces a more generalisable transcriptomic model than threshold-based selection, with the removal of clinical covariates improving external performance rather than reducing it. The resulting four-gene signature identifies a reproducible host-response framework implicating cellular stress (TUBG2), immune surveillance failure (TRDC), and neutrophil activation (CXCL8, ELANE) as determinants of sepsis mortality. The bimodal ELANE distribution and dominant role of TUBG2 constitute two specific testable hypotheses for prospective experimental investigation.
References
1. Rudd KE, Johnson SC, Agesa KM, Shackelford KA, Tsoi D, Kievlan DR, et al. Global, regional, and national sepsis incidence and mortality, 1990-2017: analysis for the Global Burden of Disease Study. Lancet. 2020;395(10219):200-11. DOI: https://doi.org/10.1016/S0140-6736(19)32989-7
2. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801-10. DOI: https://doi.org/10.1001/jama.2016.0287
3. Wong HR, Cvijanovich NZ, Anas N, Allen GL, Thomas NJ, Bigham MT, et al. Developing a clinically feasible personalized medicine approach to pediatric septic shock. Am J Respir Crit Care Med. 2015;191(3):309-15. DOI: https://doi.org/10.1164/rccm.201410-1864OC
4. Sweeney TE, Azad TD, Donato M, Haynes WA, Perumal TM, Henao R, et al. Unsupervised analysis of transcriptomics in bacterial sepsis across multiple datasets reveals three robust clusters. Crit Care Med. 2018;46(6):915-25. DOI: https://doi.org/10.1097/CCM.0000000000003084
5. Hotchkiss RS, Monneret G, Payen D. Sepsis-induced immunosuppression: from cellular dysfunctions to immunotherapy. Nat Rev Immunol. 2013;13(12):862-74. DOI: https://doi.org/10.1038/nri3552
6. Davenport EE, Burnham KL, Radhakrishnan J, Humburg P, Hutton P, Mills TC, et al. Genomic landscape of the individual host response and outcomes in sepsis: a prospective cohort study. Lancet Respir Med. 2016;4(4):259-71. DOI: https://doi.org/10.1016/S2213-2600(16)00046-1
7. Scicluna BP, van Vught LA, Zwinderman AH, Wiewel MA, Davenport EE, Burnham KL, et al. Classification of patients with sepsis according to blood genomic endotype: a prospective cohort study. Lancet Respir Med. 2017;5(10):816-26. DOI: https://doi.org/10.1016/S2213-2600(17)30294-1
8. Sweeney TE, Perumal TM, Henao R, Nichols M, Howrylak JA, Choi AM, et al. A community approach to mortality prediction in sepsis via gene expression analysis. Nat Commun. 2018;9(1):694. DOI: https://doi.org/10.1038/s41467-018-03078-2
9. Seymour CW, Kennedy JN, Wang S, Chang CH, Elliott CF, Xu Z, et al. Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. JAMA. 2019;321(20):2003-17. DOI: https://doi.org/10.1001/jama.2019.5791
10. Burnham KL, Davenport EE, Radhakrishnan J, Humburg P, Gordon AC, Hutton P, et al. Shared and distinct aspects of the sepsis transcriptomic response to fulminating bacterial and viral pathogens. Am J Respir Crit Care Med. 2017;196(10):1260-72. DOI: https://doi.org/10.1164/rccm.201608-1685OC
11. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017;30:4765-74.
12. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55-63. DOI: https://doi.org/10.7326/M14-0697
13. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30(1):207-10. DOI: https://doi.org/10.1093/nar/30.1.207
14. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets: update. Nucleic Acids Res. 2013;41(Database issue):D991-5. DOI: https://doi.org/10.1093/nar/gks1193
15. GSE65682. Gene Expression Omnibus [Internet]. National Center for Biotechnology Information; 2015 [cited 2024]. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65682
16. GSE95233. Gene Expression Omnibus [Internet]. National Center for Biotechnology Information; 2017 [cited 2024]. Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95233
17. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high-density oligonucleotide array probe level data. Biostatistics. 2003;4(2):249-64. DOI: https://doi.org/10.1093/biostatistics/4.2.249
18. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. DOI: https://doi.org/10.1093/nar/gkv007
19. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B. 1995;57(1):289-300. DOI: https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
20. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2016. p. 785-94. DOI: https://doi.org/10.1145/2939672.2939785
21. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565-74. DOI: https://doi.org/10.1177/0272989X06295361
22. Murphy SM, Urbani L, Stearns T. The mammalian gamma-tubulin complex contains homologues of the yeast spindle pole body components spc97p and spc98p. J Cell Biol. 1998;141(3):663-74. DOI: https://doi.org/10.1083/jcb.141.3.663
23. Bozza FA, Salluh JI, Japiassu AM, Soares M, Assis EF, Gomes RN, et al. Cytokine profiles as markers of disease severity in sepsis: a multiplex analysis. Crit Care. 2007;11(2):R49. DOI: https://doi.org/10.1186/cc5783
24. Korkmaz B, Moreau T, Gauthier F. Neutrophil elastase, proteinase 3 and cathepsin G: physicochemical properties, activity and physiopathological functions. Biochimie. 2008;90(2):227-42. DOI: https://doi.org/10.1016/j.biochi.2007.10.009
25. Venet F, Monneret G. Advances in the understanding and treatment of sepsis-induced immunosuppression. Nat Rev Nephrol. 2018;14(2):121-37. DOI: https://doi.org/10.1038/nrneph.2017.165
26. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-38. DOI: https://doi.org/10.1097/EDE.0b013e3181c30fb2
27. Vantourout P, Hayday A. Six-of-the-best: unique contributions of gammadelta T cells to immunology. Nat Rev Immunol. 2013;13(2):88-100. DOI: https://doi.org/10.1038/nri3384
Downloads
Published
Data Availability Statement
All gene expression data are publicly available through the NCBI Gene Expression Omnibus under accession numbers GSE65682 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65682) and GSE95233 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95233). Analysis code is available from the corresponding author upon reasonable request.Issue
Section
License
Copyright (c) 2026 Simbarashe G. Magwenzi

This work is licensed under a Creative Commons Attribution 4.0 International License.
