Repository logo

Statistical methods to evaluate disease outcome diagnostic accuracy of multiple biomarkers with application to HIV and TB research.

Thumbnail Image



Journal Title

Journal ISSN

Volume Title



One challenge in clinical medicine is that of the correct diagnosis of disease. Medical researchers invest considerable time and effort to improving accurate disease diagnosis and following from this diagnostic tests are important components in modern medical practice. The receiver oper- ating characteristic (ROC) is a statistical tool commonly used for describing the discriminatory accuracy and performance of a diagnostic test. A popular summary index of discriminatory accuracy is the area under ROC curve (AUC). In the medical research data, scientists are simultaneously evaluating hundreds of biomarkers. A critical challenge is the combination of biomarkers into models that give insight into disease. In infectious disease, biomarkers are often evaluated as well as in the micro organism or virus causing infection, adding more complexity to the analysis. In addition to providing an improved understanding of factors associated with infection and disease development, combinations of relevant markers are important to the diagnosis and treatment of disease. Taken together, this extends the role of, the statistical analyst and presents many novel and major challenges. This thesis discusses some of the various strategies and issues in using statistical data analysis to address the diagnosis problem, of selecting and combining multiple markers to estimate the predictive accuracy of test results. We also consider different methodologies to address missing data and to improve the predictive accuracy in the presence of incomplete data. The thesis is divided into five parts. The first part is an introduction to the theory behind the methods that we used in this work. The second part places emphasis on the so called classic ROC analysis, which is applied to cross sectional data. The main aim of this chapter is to address the problem of how to select and combine multiple markers and evaluate the appropriateness of certain techniques used in estimating the area under the ROC curve (AUC). Logistic regression models offer a simple method for combining markers. We applied resampling methods to adjust for over-fitting associated with model selection. We simulated several multivariate models to evaluate the performance of the resampling approaches in this setting. We applied these methods to data collected from a study of tuberculosis immune reconstitution in ammatory syndrome (TB-IRIS) in Cape Town, South Africa. Baseline levels of five biomarkers were evaluated and we used this dataset to evaluate whether a combination of these biomarkers could accurately discriminate between TB-IRIS and non TB-IRIS patients, by applying AUC analysis and resampling methods. The third part is concerned with a time dependent ROC analysis with event-time outcome and comparative analysis of the techniques applied to incomplete covariates. Three different methods are assessed and investigated, namely mean imputation, nearest neighbor hot deck imputation and multivariate imputation by chain equations (MICE). These methods were used together with bootstrap and cross-validation to estimate the time dependent AUC using a non-parametric approach and a Cox model. We simulated several models to evaluate the performance of the resampling approaches and imputation methods. We applied the above methods to a real data set. The fourth part is concerned with applying more advanced variable selection methods to predict the survival of patients using time dependent ROC analysis. The least absolute shrinkage and selection operator (LASSO) Cox model is applied to estimate the bootstrap cross-validated, 632 and 632+ bootstrap AUCs for TBM/HIV data set from KwaZulu-Natal in South Africa. We also suggest the use of ridge-Cox regression to estimate the AUC and two level bootstrapping to estimate the variances for AUC, in addition to evaluating these suggested methods. The last part of the research is an application study using genetic HIV data from rural KwaZulu-Natal to evaluate the sequence of ambiguities as a biomarker to predict recent infection in HIV patients.


Doctor of Philosophy in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2015.


Tuberculosis -- Diagnosis., HIV antibodies -- Diagnosis., Biochemical markers., Diagnostic imaging -- Patients., Diagnostic errors -- Statistics., Patients -- Medical examinations -- Statistics., Theses -- Statistics.