Title : Application of machine-learning algorithms to identify the key determinants of risk for HIV, hepatitis C and hepatitis B in primary care settings
Abstract:
Background: Testing for Blood-Borne-Viruses (BBVs) such as the human immunodeficiency virus (HIV), hep- atitis C virus (HCV) and hepatitis B virus (HBV) is generally focused on specialist settings however people with undiagnosed infections are also present within the general populations. We explore whether using machine-learning algorithms (MLAs) can identify people at heightened risk of indi- vidual or multiple BBVs in primary care settings.
Methods: From de-identified electronic health records data from 165 general practices in North East Lon- don we extracted risk factors for HIV, HCV and HBV and used them to train (75% data) and test (25% data) three MLAs: Logistic Regression (LR), AdaBoost with random under sampling (RUS- Boost) and Balanced Random Forest classifier (BRFC). The ROC curves, ROC AUC, sensitivity and specificity values quantified the models’ performance. Across the models the key features for individual and multiple BBVs positivity were identified.
Results: A total of 1,987,954 patients were included in the study, from whom 75 predictive features were selected for HIV, 24 for HCV, 37 for HBV and 88 for all three BBVs. Different models were optimal for individual BBVs positivity classification, depending on the accuracy metric. As a single infection, HCV was predicted most accurately across models and accuracy metrics. When targeting multiple BBVs, LR was the model with highest AUC value, BRFC was the most sensitive model and RUSBoost was the most specific model. The key identified features were similar across models with age the key predictive feature for both individual and combined BBV positivity. A number of features were important for two of the BBV positive groups: Black African ethnicity (HIV and HBV), liver disease (HBV and HCV) and opiate and cocaine use (HBV and HCV). A number of individual features were important for individual BBVs positivity.
Conclusions: Our findings illustrate that combining digital technology with routinely available general practice data has promise in improving case-finding of targeted BBV testing. There are however challenges in identifying the optimal MLAs and the accuracy metrics for multiple HIV/HCV/HBV positivity. This underscores the importance of evaluating different models and applying a broad set of accuracy criteria when utilising digital technology for precision medicine.
Keywords: HIV; HCV; HBV; Machine-learning algorithms; Prediction of blood-borne viruses diagnosis.

