Statistical methods to evaluate disease outcome diagnostic accuracy of multiple biomarkers with application to HIV and TB research.
Date
2015
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
One challenge in clinical medicine is that of the correct diagnosis of disease. Medical researchers
invest considerable time and effort to improving accurate disease diagnosis and following from
this diagnostic tests are important components in modern medical practice. The receiver oper-
ating characteristic (ROC) is a statistical tool commonly used for describing the discriminatory
accuracy and performance of a diagnostic test. A popular summary index of discriminatory
accuracy is the area under ROC curve (AUC). In the medical research data, scientists are
simultaneously evaluating hundreds of biomarkers. A critical challenge is the combination
of biomarkers into models that give insight into disease. In infectious disease, biomarkers
are often evaluated as well as in the micro organism or virus causing infection, adding more
complexity to the analysis. In addition to providing an improved understanding of factors
associated with infection and disease development, combinations of relevant markers are important
to the diagnosis and treatment of disease. Taken together, this extends the role of, the
statistical analyst and presents many novel and major challenges. This thesis discusses some
of the various strategies and issues in using statistical data analysis to address the diagnosis
problem, of selecting and combining multiple markers to estimate the predictive accuracy of
test results. We also consider different methodologies to address missing data and to improve
the predictive accuracy in the presence of incomplete data.
The thesis is divided into five parts. The first part is an introduction to the theory behind
the methods that we used in this work. The second part places emphasis on the so called
classic ROC analysis, which is applied to cross sectional data. The main aim of this chapter
is to address the problem of how to select and combine multiple markers and evaluate
the appropriateness of certain techniques used in estimating the area under the ROC curve
(AUC). Logistic regression models offer a simple method for combining markers. We applied
resampling methods to adjust for over-fitting associated with model selection. We simulated
several multivariate models to evaluate the performance of the resampling approaches in this
setting. We applied these methods to data collected from a study of tuberculosis immune
reconstitution in
ammatory syndrome (TB-IRIS) in Cape Town, South Africa. Baseline levels
of five biomarkers were evaluated and we used this dataset to evaluate whether a combination
of these biomarkers could accurately discriminate between TB-IRIS and non TB-IRIS patients,
by applying AUC analysis and resampling methods.
The third part is concerned with a time dependent ROC analysis with event-time outcome
and comparative analysis of the techniques applied to incomplete covariates. Three different
methods are assessed and investigated, namely mean imputation, nearest neighbor hot deck
imputation and multivariate imputation by chain equations (MICE). These methods were used
together with bootstrap and cross-validation to estimate the time dependent AUC using a
non-parametric approach and a Cox model. We simulated several models to evaluate the
performance of the resampling approaches and imputation methods. We applied the above
methods to a real data set.
The fourth part is concerned with applying more advanced variable selection methods to predict
the survival of patients using time dependent ROC analysis. The least absolute shrinkage and
selection operator (LASSO) Cox model is applied to estimate the bootstrap cross-validated, 632
and 632+ bootstrap AUCs for TBM/HIV data set from KwaZulu-Natal in South Africa. We
also suggest the use of ridge-Cox regression to estimate the AUC and two level bootstrapping
to estimate the variances for AUC, in addition to evaluating these suggested methods.
The last part of the research is an application study using genetic HIV data from rural
KwaZulu-Natal to evaluate the sequence of ambiguities as a biomarker to predict recent infection
in HIV patients.
Description
Doctor of Philosophy in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2015.
Keywords
Tuberculosis -- Diagnosis., HIV antibodies -- Diagnosis., Biochemical markers., Diagnostic imaging -- Patients., Diagnostic errors -- Statistics., Patients -- Medical examinations -- Statistics., Theses -- Statistics.