## A frequentist and a Bayesian approach to estimating HIV prevalence accounting for non-response using population-based survey data.

##### Abstract

Enhanced and novel frequentist and Bayesian approaches to estimating disease measures
such as HIV prevalence utilizing the recent advances in statistical computing software
are explored and applied making use of population-based complex survey data. In
particular design-consistent estimates and logistic regression models for HIV prevalence
are respectively computed and fitted using each of the approaches.
Practical survey data are rarely obtained using simple random sampling schemes,
instead complex sampling designs, that are designed to refect complex underlying population
structures, are employed. These designs usually involve stratification, multistage
sampling and unequal selection probability of sampling units giving rise to data that
are hierarchical (multilevel), clustered, and hence correlated. This is particularly true
for large-scale population-based surveys. Consequently this often gives rise to units
that are correlated within clusters as well as multiple sources of variability rendering
standard statistical methods based on the assumption of independence of units inappropriate.
Survey logistic regression models built from a generalized linear modelling
framework were used to explain the variation in HIV prevalence accounting for the nonindependence
of the units. In addition, a hierarchical logistic regression model built
from a generalized linear mixed modelling framework was used to capture the variability
and correlation of the units within clusters and further determine how different
layers interact and impact on a response variable. In particular, the logistic regression
models for HIV prevalence on demographic, behavioural and socio-economic variables
were developed from a frequentist and a Bayesian perspective.
Statistical methods that incorporate prior known information about unknown parameters
are vital in most scientific and biological research especially in studies where
replicative experimental investigations are not possible. The Bayesian statistical paradigm
offers a framework upon which a prior distribution of a parameter can be combined with
the likelihood of the observed data to obtain a posterior distribution for explaining the
stochastic variation in a response variable. Computer-intensive simulation-based algorithms
such as the Markov chain Monte Carlo (MCMC) methods were used to draw
samples from the posterior distribution for inference purposes. A Bayesian logistic regression
model for HIV prevalence on demographic and socio-economic variables was
fitted from a generalized linear modelling framework using the MCMC algorithms.
Furthermore, practical complex survey data are often characterized by missing observations
due to non-response, a phenomenon that is true to the data used for the
current research. Often, the analyses of such data take a complete case approach,
that is taking a list-wise deletion of all cases with missing observations, assuming that
missing values are missing completely at random (MCAR). In the current research, we
systematically simulate or generate multiple values for the missing observations under
a multiple imputation method accounting for the structure of the data. A rectangular
complete data set is produced and the variability or uncertainty induced by the very
process of imputing the values for the missing observations is accounted for.
The study utilizes complex (multi-layered and clustered data with missing values)
survey data obtained from the 2010-11 Zimbabwe Demographic and Health Surveys
(2010-11ZDHS). The results show that HIV prevalence varies considerably across subgroups
of the population. All the analyses are done using R statistical software packages.