Statistics
Permanent URI for this communityhttps://hdl.handle.net/10413/6771
Browse
Browsing Statistics by Title
Now showing 1 - 20 of 158
- Results Per Page
- Sort Options
Item Adjusting the effect of integrating antiretroviral therapy and tuberculosis treatment on mortality for non-compliance : an instrumental variables analysis using a time-varying exposure.(2018) Yende-Zuma, Fortunate Nonhlanhla.; Mwambi, Henry Godwell.; Vansteelandt, Stijn.In South Africa and elsewhere, research has shown that the integration of antiretroviral therapy (ART) and tuberculosis (TB) treatment saves lives. The randomised controlled trials (RCTs) which provided this compelling evidence used intent-to-treat (ITT) strategy as part of their primary analysis. As much as ITT is protected against selection bias caused by both measured and unmeasured confounders, but it is capable of drawing results towards the null and underestimate the e ectiveness of treatment if there is too much non-compliance. To adjust for non-compliance, \as-treated"and \per-protocol"comparisons are commonly made. These contrast study participants according to their received treatment, regardless of the treatment arm to which they were assigned, or limit the analysis to participants who followed the protocol. Such analyses are generally biased because the subgroups which they compare often lack comparability. In view of the shortcomings of the \as-treated"and \per-protocol"analyses, our objective was to account for non-compliance by using instrumental variables (IV) analysis to estimate the e ect of ART initiation during TB treatment on mortality. Furthermore, to capture the full complexity of compliance behaviour outside the TB treatment duration, we developed a novel IV-methodology for a time-varying measure of compliance to ART. This is an important contribution to the IV literature since IV-methodology for the e ect of a time-varying exposure on a time-to-event endpoint is currently lacking. In RCTs, IV analysis enable us to make use of the comparability o ered by randomisation and thereby have the capability of adjusting for unmeasured and measured confounders; they have the further advantage of yielding results that are less sensitive to random measurement error in the exposure. In order to carry out IV analysis, one needs to identify a variable called an instrument, which needs to satisfy three important assumptions. To apply the IV methodology, we used data from Starting Antiretroviral Therapy at Three Points in Tuberculosis (SAPiT) trial which was conducted by the Centre for the AIDS Programme of Research in South Africa. This trial enrolled HIV and TB co-infected patients who were assigned to start ART either early or late during TB treatment or after TB treatment completion. The results from IV analysis demonstrate that survival bene t of fully integrating TB treatment and ART is even higher than what has been reported in the ITT analysis since non-compliance has been accounted for.Item Age, period and cohort analysis of young adult mortality due to HIV and TB in South Africa: 1997-2015.(2019) Nkwenika, Tshifhiwa Mildred.; Mwambi, Henry Godwell.; Manda, Samuel.Young adult mortality is very important in South Africa with the impact of Human Immunodeficiency Virus /Acquired Immune deficiency Syndrome (HIV/AIDS), Tuberculosis (TB), injuries and emerging non-communicable diseases (NCDs). Investigation of temporal trends for adult mortality associated with TB and HIV has often based on age, gender, period and birth cohort separately. The overall aim of this study was to estimate age effect across period and birth cohort; period effect across age and birth cohort; and birth cohort effect across age and period on TB and HIV-related mortality. Mortality data and mid population estimates were obtained from Statistics South Africa for the period 1997 to 2015. Observed HIV/AIDS deaths were adjusted for under-reporting while adjustments for the misclassification of AIDS deaths and the proportion of ill-defined natural causes were made. Three-year age, period and birth cohort intervals for 15-64 years, 1997-2015 and 1934-2000 respectively were used. Age-Period-Cohort (APC) analysis using the Poisson distribution was used to compute effects of age, period and cohort on mortality due to TB and HIV.A total of 5, 825,502 adult deaths from the period 1997 to 2015, of which 910,731 (15.6%) were TB deaths while 252,101 (4.3%) were HIV deaths. A concave down association between TB mortality and period was observed while an upward trend was observed for HIV-related mortality. The estimated TB relative mortality showed a concave down association with age, a peak at 36-38 years was found. There was a concave down relationship between TB relative risk between 1997 and 2015. Findings showed a general downward trend between TB mortality and birth cohort, which 1934 cohort had higher rates of mortality. There was an inverse flatter U-shaped association between age and HIV-related mortality, where 30-32 years was more pronounced. An inverse U-shaped relationship between HIV-related mortality and period from 1997 to 2015 was estimated. An inverted V-shape relationship between birth cohort and HIV-related mortality was estimated. The study has found an inverse U-shaped association between TB-related mortality and age, period and general downward trend with birth cohort for deaths reported between 1997 and 2015.A concave down relationship between HIV-related mortality and age, period and inverted V-shaped with birth cohort was found. The association between HIV-related mortality and period differs from the officially reported trend with adjustment, which shows an upward progression. Our findings are based on a slight advanced statistical model using Age-Period-Cohort. Using APC analysis, we found a secular trend in TB and HIV-related mortality rates which could contribute certain clues in long-term planning, monitoring and evaluation.Item Analysis of a binary response : an application to entrepreneurship success in South Sudan.(2012) Lugga, James Lemi John Stephen.; Zewotir, Temesgen Tenaw.Just over half (50:6%) of the population of South Sudan lives on less than one US Dollar a day. Three quarters of the population live below the poverty line (NBS, Poverty Report, 2010). Generally, effective government policy to reduce unemployment and eradicate poverty focuses on stimulating new businesses. Micro and small enterprises (MSEs) are the major source of employment and income for many in under-developed countries. The objective of this study is to identify factors that determine business success and failure in South Sudan. To achieve this objective, generalized linear models, survey logistic models, the generalized linear mixed models and multiple correspondence analysis are used. The data used in this study is generated from the business survey conducted in 2010. The response variable, which is defined as business success or failure was measured by profit and loss in businesses. Fourteen explanatory variables were identified as factors contributing to business success and failure. A main effect model consisting of the fourteen explanatory variables and three interaction effects were fitted to the data. In order to account for the complexity of the survey design, survey logistic and generalized linear mixed models are refitted to the same variables in the main effect model. To confirm the results from the model we used multiple correspondence analysis.Item Analysis of discrete time competing risks data with missing failure causes and cured subjects.(2023) Ndlovu, Bonginkosi Duncan.; Zewotir, Temesgen Tenaw.; Melesse, Sileshi Fanta.This thesis is motivated by the limitations of the existing discrete time competing risks models vis-a-vis the treatment of data that comes with missing failure causes or a sizableproportions of cured subjects. The discrete time models that have been suggested to date (Davis and Lawrance, 1989; Tutz and Schmid, 2016; Ambrogi et al., 2009; Lee et al., 2018) are cause-specific-hazard denominated. Clearly, this fact summarily disqualifies these models from consideration if data comes with missing failure causes. It is also a well documented fact that naive application of the cause-specific-hazards to data that has a sizable proportion of cured subjects may produce downward biased estimates for these quantities. The existing models can be considered within the multiple imputation framework (Rubin, 1987) for handling missing failure causes, but the prospects of scaling them up for handling cured subjects are minimal, if not nil. In this thesis we address these issues concerning the treatment of missing failure causes and cured subjects in discrete time settings. Towards that end, we focus on the mixture model (Larson and Dinse, 1985) and the vertical model (Nicolaie et al., 2010) because these models possess certain properties which dovetail with the objectives of this thesis. The mixture model has been upgraded into a model that can handle cured subjects. Nicolaie et al. (2015) have demonstrated that the vertical model can also handle missing failure causes as is. Nicolaie et al. (2018) have also extended the vertical model to deal with cured subjects. Our strategy in this thesis is to exploit both the mixture model and the vertical model as a launching pad to advance discrete time models for handling data that comes with missing failure causes or cured subjects.Item Analysis of longitudinal binary data : an application to a disease process.(2008) Ramroop, Shaun.; Mwambi, Henry Godwell.The analysis of longitudinal binary data can be undertaken using any of the three families of models namely, marginal, random effects and conditional models. Each family of models has its own respective merits and demerits. The models are applied in the analysis of binary longitudinal data for childhood disease data namely the Respiratory Syncytial Virus (RSV) data collected from a study in Kilifi, coastal Kenya. The marginal model was fitted using generalized estimating equations (GEE). The random effects models were fitted using ‘Proc GLIMMIX’ and ‘NLMIXED’ in SAS and then again in Genstat. Because the data is a state transition type of data with the Markovian property the conditional model was used to capture the dependence of the current response to the previous response which is known as the history. The data set has two main complicating issues. Firstly, there is the question of developing a stochastically based probability model for the disease process. In the current work we use direct likelihood and generalized linear modelling (GLM) approaches to estimate important disease parameters. The force of infection and the recovery rate are the key parameters of interest. The findings of the current work are consistent and in agreement with those in White et al. (2003). The aspect of time dependence on the RSV disease is also highlighted in the thesis by fitting monthly piecewise models for both parameters. Secondly, there is the issue of incomplete data in the analysis of longitudinal data. Commonly used methods to analyze incomplete longitudinal data include the well known available case analysis (AC) and last observation carried forward (LOCF). However, these methods rely on strong assumptions such as missing completely at random (MCAR) for AC analysis and unchanging profile after dropout for LOCF analysis. Such assumptions are too strong to generally hold. In recent years, methods of analyzing incomplete longitudinal data have become available with weaker assumptions, such as missing at random (MAR). Thus we make use of multiple imputation via chained equations that require the MAR assumption and maximum likelihood methods that result in the missing data mechanism becoming ignorable as soon as it is MAR. Thus we are faced with the problem of incomplete repeated non–normal data suggesting the use of at least the Generalized Linear Mixed Model (GLMM) to account for natural individual heterogeneity. The comparison of the parameter estimates using the different methods to handle the dropout is strongly emphasized in order to evaluate the advantages of the different methods and approaches. The survival analysis approach was also utilized to model the data due to the presence of multiple events per subject and the time between these events.Item Analysis of time-to-event data including frailty modeling.(2006) Phipson, Belinda.; Mwambi, Henry Godwell.There are several methods of analysing time-to-event data. These include nonparametric approaches such as Kaplan-Meier estimation and parametric approaches such as regression modeling. Parametric regression modeling involves specifying the distribution of the survival time of the individuals, which are commonly chosen to be either exponential, Weibull, log- normal, log-logistic or gamma distributed. Another well known model that does not require assumptions about the hazard function to be made is the Cox proportional hazards model. However, there may be deviations from proportional hazards which may be explained by unaccounted random heterogeneity. In the early 1980s, a series of studies showed concern with the possible bias in the estimated treatment e®ect when important covariates are omitted. Other problems may be encountered with the traditional proportional hazards model when there is a possibility of correlated data, for instance when there is clustering. A method of handling these types of problems is by making use of frailty modeling. Frailty modeling is a method whereby a random e®ect is incorporated in the Cox pro- portional hazards model. While this concept is fairly simple to understand, the method of estimation of the ¯xed and random e®ects becomes complicated. Various methods have been explored by several authors, including the Expectation-Maximisation (EM) algorithm, pe- nalized partial likelihood approach, Markov Chain Monte Carlo (MCMC) methods, Monte Carlo EM approach and di®erent methods using Laplace approximation. The lack of available software is problematic for ¯tting frailty models. These models are usually computationally extensive and may have long processing times. However, frailty modeling is an important aspect to consider, particularly if the Cox proportional hazards model does not adequately describe the distribution of survival time.Item The application of classification techniques in modelling credit risk.(2014) Mushava, Jonah.; Murray, Michael.The aim of this dissertation is to examine the use of classification techniques to model credit risk through a practice known as credit scoring. In particular, the focus is on one parametric class of classification techniques and one non-parametric class of classification techniques. Since the goal of credit-scoring is to improve the quality of the decisions in evaluating a loan application, advanced and interesting methods that improve upon the performance of linear discriminant analysis (LDA) and classification and regression trees (CART) will be explored. For LDA these methods include a description of quadratic discriminant analysis (QDA), flexible discriminant analysis (FDA) and mixture discriminant analysis (MDA). Multivariate adaptive regression splines (MARS) are used in the FDA procedure. An Expectation Maximization (EM)-algorithm that estimates the model parameters in MDA will be developed thereof. Techniques that help to improve the performance of CART such as bagging, random forests and boosting are also discussed at length. A real life dataset was used as an illustration to how these credit-scoring models can be used to classify a new applicant. The dataset shall be split into a ‘learning sample’ and a ‘testing sample’. The learning sample will be used to develop the credit-scoring model (also known as a scorecard) whilst the testing sample will be used to test the predictive capability of the scorecard that would have been constructed. The predictive performance of the scorecards will be assessed using four measures; a classification error rate, a sensitivity measure, a specificity measure and the area under the ROC curve (AUC). Based on these four model performance measures, the empirical results reveal that there is no single ideal scorecard for modelling credit risk because such a conclusion depends on the aims and objectives of the lender, the details of the problem and the data structure.Item Application of mixed model and spatial analysis methods in multi-environmental and agricultural field trials.(2015) Negash, Asnake Worku.; Mwambi, Henry Godwell.; Zewotir, Temesgen Tenaw.Agricultural experimentation involves selection of experimental materials, selection of experimental units, planning of experiments, and collection of relevant information, analysis and interpretation of the results. An overall work of this thesis is on the importance, improvement and efficiency of variety contrast by using linear mixed mode with spatial-variance covariance compare to the usual ANOVA methods of analysis. A need of some considerations on the recently widely usage of a bi-plot analysis of genotype plus genotype by environment interaction (GEE) on the analysis of multi-environmental crop trials. An application of some parametric bootstrap method for testing and selecting multiplicative terms in GGE and AMMI models and to show some statistical methods for handling missing data using multiple imputations principal component and other deterministic approaches. Multi-environment agricultural experiments are unbalanced because several genotypes are not tested in some environments or missing of a measurement from some plot during the experimental stage. A need for imputation of the missing values sometimes is necessary. Multiple imputation of missing data using the cross-validation by eigenvector method and PCA methods are applied. We can see the advantage of these methods having easy computational implementation, no need of any distributional or structural assumptions and do not have any restrictions regarding the pattern or mechanism of missing data in experiments. Genotype by environment (G×E) interaction is associated with the differential performance of genotypes tested at different locations and in different years, and influences selection and recommendation of cultivars. Wheat genotypes were evaluated in six environments to determine the G×E interactions and stability of the genotypes. Additive main effects and multiplicative interactions (AMMI) was conducted for grain yield of both year and it showed that grain yield variation due to environments, genotypes and (G×E) were highly significant. Stability for grain yield was determined using genotype plus genotype by environment interaction (GGE) biplot analysis. The first two principal components (PC1 and PC2) were used to create a 2-dimensional GGE biplot. Which-won where pattern was based on six locations in the first and five locations in the second year for all the twenty genotypes? The resulting pattern is one realization among many possible outcomes, and its repeatability in the second was different and a future year is quite unknown. A repeatability of which won-where pattern over years is the necessary and sufficient condition for mega-environment delineations and genotype recommendation. The advantages of mixed models with spatial variance-covariance structures, and direct implications of model choice on the inference of varietal performance, ranking and testing based on two multi-environmental data sets from realistic national trials. A model comparison with a ᵪ2-test for the trials in the two data sets (wheat and barley data) suggested that selected spatial variance-covariance structures fitted the data significantly better than the ANOVA model. The forms of optimally-fitted spatial variance-covariance, ranking and consistency ratio test were not the same from one trial (location) to the other. Linear mixed models with single stage analysis including spatial variance-covariance structure with a group factor of location on the random model also improved the real genotype effect estimation and their ranking. The model also improved varietal performance estimation because of its capacity to handle additional sources of variation, location and genotype by location (environment) interaction variation and accommodating of local stationary trend. The knowledge and understanding of statistical methods for analysis of multi-environmental data analysis is particularly important for plant breeders and those who are working on the improvement of plant variety for proper selection and decision making of the next level of improvement for country agricultural development.Item The application of multistate Markov models to HIV disease progression.(2011) Reddy, Tarylee.; Mwambi, Henry Godwell.Survival analysis is a well developed area which explores time to single event analysis. In some cases, however, such methods may not adequately capture the disease process as the disease progression may involve intermediate events of interest. Multistate models incorporate multiple events or states. This thesis proposes to demystify the theory of multistate models through an application based approach. We present the key components of multistate models, relevant derivations, model diagnostics and techniques for modeling the effect of covariates on transition intensities. The methods that are developed in the thesis are applied to HIV and TB data partly sourced from CAPRISA and the HPP programmes in the University of KwaZulu-Natal. HIV progression is investigated through the application of a five state Markov model with reversible transitions such that state 1: CD4 count 500, state 2: 350 CD4 count < 500, state 3: 200 CD4 count < 350, state 4: CD4 count < 200 and state 5: ARV initiation. The mean sojourn time in each state and transition probabilities are presented as well as the effect of covariates namely age, gender and baseline CD4 count on transition rates. A key finding, consistent with previous research, is that the rate of decline in CD4 count tends to decrease at lower levels of the marker. Further, patients enrolling with a CD4 count less than 350 had a far lower chance of immune recovery and a substantially higher chance of immune deterioration compared to patients with a higher CD4 count. We noted that older patients tend to progress more rapidly through the disease than younger patients.Item An application of some inventory control techniques.(1992) Samuels, Carol Anne.; Moolman, W. H.; Ryan, K. C.No abstract available.Item Application of statistical multivariate techniques to wood quality data.(2010) Negash, Asnake Worku.; Mwambi, Henry Godwell.; Zewotir, Temesgen Tenaw.Sappi is one of the leading producer and supplier of Eucalyptus pulp to the world market. It is also a great contributor to South Africa economy in terms of employment opportunity to the rural people through its large plantation and export earnings. Pulp mills production of quality wood pulp is mainly affected by the supply of non uniform raw material namely Eucalyptus tree supply from various plantations. Improvement in quality of the pulp depends directly on the improvement on the quality of the raw materials. Knowing factors which affect the pulp quality is important for tree breeders. Thus, the main objective of this research is first to determine which of the anatomical, chemical and pulp properties of wood are significant factors that affect pulp properties namely viscosity, brightness and yield. Secondly the study will also investigate the effect of the difference in plantation location and site quality, trees age and species type difference on viscosity, brightness and yield of wood pulp. In order to meet the above mentioned objectives, data for this research was obtained from Sappi’s P186 trial and other two published reports from the Council for Scientific and Industrial Research (CSIR). Principal component analysis, cluster analysis, multiple regression analysis and multivariate linear regression analysis were used. These statistical analysis methods were used to carry out mean comparison of pulp quality measurements based on viscosity, brightness and yield of trees of different age, location, site quality and hybrid type and the results indicate that these four factors (age, location, site quality and hybrid type) and some anatomical and chemical measurements (fibre lumen diameter, kappa number, total hemicelluloses and total lignin) have significant effect on pulp quality measurements.Item Application of survival analysis methods to study under-five child mortality in Uganda.(2013) Nasejje, Justine.; Achia, Thomas Noel Ochieng.; Mwambi, Henry G.Infant and child mortality rates are one of the health indicators in a given community or country. It is the fourth millennium development goal that by 2015, all the united nations member countries are expected to have reduced their infant and child mortality rates by two-thirds. Uganda is one of those countries in sub-Saharan Africa with high infant and child mortality rates and therefore has the need to find out the factors strongly associated to these high rates in order to provide alternative or maintain the existing interventions. The Uganda Demographic Health Survey (UDHS) funded by USAID, UNFPA, UNICEF, Irish Aid and the United kingdom government provides a data set which is rich in information. This information has attracted many researchers and some of it can be used to help Uganda monitor her infant and child mortality rates to achieve the fourth millennium goal. Survival analysis techniques and frailty modelling is a well developed statistical tool in analysing time to event data. These methods were adopted in this thesis to examine factors affecting under-five child mortality in Uganda using the UDHS data for 2011 using R and STATA software. Results obtained by fitting the Cox-proportional hazard model and frailty models and drawing inference using both the Frequentists and Bayesian approach showed that, Demographic factors (sex of the household head, sex of the child and number of births in the past one year) are strongly associated with high under-five child mortality rates. Heterogeneity or unobserved covariates were found to be signifcant at household but insignifcant at community level.Item Applications of Levy processes in finance.(2009) Essay, Ahmed Rashid.; O’Hara, J. G.The option pricing theory set forth by Black and Scholes assumes that the underlying asset can be modeled by Geometric Brownian motion, with the Brownian motion being the driving force of uncertainty. Recent empirical studies, Dotsis, Psychoyios & Skiadopolous (2007) [17], suggest that the use of Brownian motion alone is insufficient in accurately describing the evolution of the underlying asset. A more realistic description of the underlying asset’s dynamics would be to include random jumps in addition to that of the Brownian motion. The concept of including jumps in the asset price model leads us naturally to the concept of a L'evy process. L'evy processes serve as a building block for stochastic processes that include jumps in addition to Brownian motion. In this dissertation we first examine the structure and nature of an arbitrary L'evy process. We then introduce the stochastic integral for L'evy processes as well as the extended version of Itˆo’s lemma, we then identify exponential L'evy processes that can serve as Radon-Nikod'ym derivatives in defining new probability measures. Equipped with our knowledge of L'evy processes we then implement this process in a financial context with the L'evy process serving as driving source of uncertainty in some stock price model. In particular we look at jump-diffusion models such as Merton’s(1976) [37] jump-diffusion model and the jump-diffusion model proposed by Kou and Wang (2004) [30]. As the L'evy processes we consider have more than one source of randomness we are faced with the difficulty of pricing options in an incomplete market. The options that we shall consider shall be mainly European in nature, where exercise can only occur at maturity. In addition to the vanilla calls and puts we independently derive a closed form solution for an exchange option under Merton’s jump-diffusion model making use of conditioning arguments and stochastic integral representations. We also examine some exotic options under the Kou and Wang model such as barrier options and lookback options where the solution to the option price is derived in terms of Laplace transforms. We then develop the Kou and Wang model to include only positive jumps, under this revised model we compute the value of a perpetual put option along with the optimal exercise point. Keywords Derivative pricing, L'evy processes, exchange options, stochastic integration.Item Appraising South African residential property and measuring price developments.(2022) Bax, Dane Gregory.; Zewotir, Temesgen Tenaw.; North, Delia Elizabeth.Housing wealth is well established as one of the most important sources of wealth for households and investors. However, owning a home is a fundamental human need, making monitoring residential property prices a social endeavour as well as an economic one, especially under times of economic uncertainty. Residential property prices also have a direct effect on the macroeconomy because of how they influence wealth effects where increased consumption by households is experienced through gains in households balance sheets due to increased equity. Collecting correct and adequate data is vitally important in analysing property market movements and developments, particularly given globalization, and the interlinked nature of financial markets. Although measuring residential property price developments is an important economic and social activity, matching properties over time is extremely difficult because the sale of homes is typically infrequent, characteristics vary, and homes are uniquely located in space. This thesis focuses on appraising several residential property types located throughout South Africa from January 2013 to August 2017, investigating different modelling approaches with the aim of developing a residential property price index. Various methods exist to create residential property price indices, however, hedonic models have proven useful as a quality adjusted approach where pure price changes are measured and not simply changes in the composition of samples over time. Before fitting any models to appraise homes, an autoencoder was built to detect anomalous data, due to human error at the data entry stage. The autoencoder identified improbable data resulting in a final data set of 415 200 records, once duplicate records were identified and removed. This study first investigated generalised linear models as a candidate approach to appraise homes in South Africa which showed possible alternatives to the ubiquitous log linear model. Relaxing functional form assumptions and considering the nested locational structure of homes, hierarchical generalised linear models were considered as the next candidate method. Partitioning around the mediods was applied to find additional spatial groupings which were treated as random effects along with the suburb. The findings showed that the marginal utility of structural attributes was non-linear and smooth functions of covariates were an appropriate treatment. Furthermore, the use of random effects helped account for the spatial heterogeneity of homes through partial pooling. Finally, machine learning algorithms were investigated because of minimal assumptions about the data generating process and the possibility of complex non-linear and interaction effects. Random forests, gradient boosted machines and neural networks were adopted to fit these appraisal functions. The gradient boosted machines had the best goodness of fit, showing non-linear relationships between the structural characteristics of homes and listing prices. Partial dependence plots were able to quantify the marginal utility over the distributions of different structural characteristics. The results show that larger sized homes do not necessarily yield a premium and a diminished return is evident, similar to the results of the hierarchical generalised additive models. The variable importance plots showed that location was the most important predictor followed by the number of bathrooms and the size of a home. The gradient boosted machines achieved the lowest out of sample error and were used to develop the residential property price index. A chained, dual imputation Fisher index was applied to the gradient boosted machines showing nominal and real price developments at a country and provincial level. The chained, dual imputation Fisher index provided less noisy estimates than a simple median mix adjusted index. Although listing prices were used and not transacted prices, the trend was similar to the ABSA Global Property Guide. In order to make this research useful to property market participants, a web application was developed to show how the proposed methodology can be democratised by property portals and real estate agencies. The Listing Price Index Calculator was created to easily communicate the results through a front-end interface, showing how property portals and real estate agencies can leverage their data to aid sellers in determining listing prices to go to market with, help buyers obtain an average estimate of the home they wish to purchase and guide property market participants on price developments.Item Aspects of categorical data analysis.(1998) Govender, Yogarani.; Matthews, Glenda Beverley.The purpose of this study is to investigate and understand data which are grouped into categories. At the onset, the study presents a review of early research contributions and controversies surrounding categorical data analysis. The concept of sparseness in a contingency table refers to a table where many cells have small frequencies. Previous research findings showed that incorrect results were obtained in the analysis of sparse tables. Hence, attention is focussed on the effect of sparseness on modelling and analysis of categorical data in this dissertation. Cressie and Read (1984) suggested a versatile alternative, the power divergence statistic, to statistics proposed in the past. This study includes a detailed discussion of the power-divergence goodness-of-fit statistic with areas of interest covering a review on the minimum power divergence estimation method and evaluation of model fit. The effects of sparseness are also investigated for the power-divergence statistic. Comparative reviews on the accuracy, efficiency and performance of the power-divergence family of statistics under large and small sample cases are presented. Statistical applications on the power-divergence statistic have been conducted in SAS (Statistical Analysis Software). Further findings on the effect of small expected frequencies on accuracy of the X2 test are presented from the studies of Tate and Hyer (1973) and Lawal and Upton (1976). Other goodness-of-fit statistics which bear relevance to the sparse multino-mial case are discussed. They include Zelterman's (1987) D2 goodness-of-fit statistic, Simonoff's (1982, 1983) goodness-of-fit statistics as well as Koehler and Larntz's tests for log-linear models. On addressing contradictions for the sparse sample case under asymptotic conditions and an increase in sample size, discussions are provided on Simonoff's use of nonparametric techniques to find the variances as well as his adoption of the jackknife and bootstrap technique.Item An assessment of modified systematic sampling designs in the presence of linear trend.(2017) Naidoo, Llewellyn Reeve.; North, Delia Elizabeth.; Zewotir, Temesgen Tenaw.; Arnab, Raghunath.Sampling is used to estimate population parameters, as it is usually impossible to study a whole population, due to time and budget restrictions. There are various sampling designs to address this issue and this thesis is related with a particular probability sampling design, known as systematic sampling. Systematic sampling is operationally convenient and efficient and hence is used extensively in most practical situations. The shortcomings associated with systematic sampling include: (i) it is impossible to obtain an unbiased estimate of the sampling variance when conducting systematic sampling with a single random start; (iii) if the population size is not a multiple of the sample size, then conducting conventional systematic sampling, also known as linear systematic sampling, may result in variable sample sizes. In this thesis, I would like to provide some contribution to the current body of knowledge, by proposing modifications to the systematic sampling design, so as to address these shortcomings. Firstly, a discussion on the measures used to compare the various probability sampling designs is provided, before reviewing the general theory of systematic sampling. The per- formance of systematic sampling is dependent on the population structure. Hence, this thesis concentrates on a specific and common population structure, namely, linear trend. A discussion on the performance of linear systematic sampling and all relative modifica- tions, including a new proposed modification, is then presented under the assumption of linear trend among the population units. For each of the above-mentioned problems, a brief review of all the associated sampling designs from existing literature, along with my proposed modified design, will then be explored. Thereafter, I will introduce a modified sampling design that addresses the above-mentioned problems in tandem, before providing a comprehensive report on the thesis. The aim of this thesis is to provide solutions to the above-mentioned disadvantages, by proposing modified systematic sampling designs and/or estimators that are favourable over its existing literature counterparts. Keywords: systematic sampling; super-population model; Horvitz-Thompson estimator; Yates' end corrections method; balanced modified systematic sampling; multiple-start balanced modified systematic sampling; remainder modified systematic sampling; balanced centered random sampling.Item Bayesian data augmentation using MCMC: application to missing values imputation on cancer medication data.(2017) Ndlela, Thamsanqa Innocent.; Lougue, Siaka.Missing data is a very serious issue that negatively affect inferences and findings of researchers in data science and statistics. The ignorance of missing data or deletion of cases that contain missing observations may lead to reducing statistical power, loss of information, increasing standard errors of estimates and increases estimation bias in data analysis. One of the advantages of using imputation methods is to keep the full sample size, which makes the results to be more precise. Amongst all the missing data imputation techniques, data augmentation is not so popular in the literature and very few articles mentioned the use of the technique to account for missing data problems. Data Augmentation technique can be used for imputation of missing data in both Bayesian and classical statistics. In the classical approach, data augmentation is implemented through EM algorithm that uses maximum likelihood function to impute and estimate unknown parameters of a model. EM algorithm is a useful tool for a likelihood-based decision when dealing with missing data problems. The Bayesian data augmentation approach is used when it is not possible to directly estimate a posterior distribution P( j xov), of the parameters, given the observed data xov due to the missing data in x. This study aims to contribute to a better understanding of Bayesian data augmentation and improve the quality of estimates and precision of the analysis of data with missing values. The General Household Survey [GHS 2015] is the main source of data in this study. All the analyses are made using the software R and more precisely the package mix. In this study, we have find that Bayesian data augmentation can solve the problem of missing data in cancer drug intake data. The Bayesian data augmentation performs very well in improving modelling of cancer drug affected by missing data.Item Bayesian generalized linear mixed modeling of breast cancer data in Nigeria.(2017) Ogunsakin, Ropo Ebenezer.; Logue, Siaka.Breast cancer is the world’s most prevalent type of cancer among women. Statistics indicate that breast cancer alone accounted for 37% out of all the cases of cancer diagnosed in Nigeria in 2012. Data used in this study are extracted from patient records, commonly called hospital-based records, and identified key socio-demographic and biological risk factors of breast cancer. Researchers sometimes ignore the hierarchical structure of the data and the disease when analyzing data. Doing so may lead to biased parameter estimates and larger standard error. That is why the analyses undertaken in this study included the multilevel structure of cancer diagnosis, types, and medication through a Generalized Linear Mixed Model (GLMM) which consider both fixed and random effects (level 1 and 2). In addition to the classical statistics approach, this study incorporates the Bayesian GLMM approach as well as some bootstrapping techniques. All the analyses are done using R or SAS for the classical statistics approaches, and WinBUGS for the Bayesian approach. The Bayesian analyses were strengthened by advanced analyses of convergence and autocorrelation checks, and other Markov Chain assumptions using the CODA and BOA packages. The findings reveal that Bayesian techniques provide more comprehensive results, given that Bayesian analysis is a more statistically strong technique. The Bayesian methods appeared more robust than the classical and bootstrapping techniques in analyzing breast cancer data in Western Nigeria. The results identified age at diagnosis, educational status, grade tumor, and breast cancer type as prognostic factors of breast cancer.Item Bayesian hierarchical spatial and spatio-temporal modeling and mapping of tuberculosis in Kenya.(2013) Iddrisu, Abdul-Karim.; Mwambi, Henry G.; Ochia, Thomas Noel Ochieng.Global spread of infectious disease threatens the well-being of human, domestic, and wildlife health. A proper understanding of global distribution of these diseases is an important part of disease management and policy making. However, data are subject to complexities by heterogeneity across host classes and space-time epidemic processes [Waller et al., 1997, Hosseini et al., 2006]. The use of frequentist methods in Biostatistics and epidemiology are common and are therefore extensively utilized in answering varied research questions. In this thesis, we proposed the Hierarchical Bayesian approach to study the spatial and the spatio-temporal pattern of tuberculosis in Kenya [Knorr-Held et al., 1998, Knorr-Held, 1999, L opez-Qu lez and Munoz, 2009, Waller et al., 1997, Julian Besag, 1991]. Space and time interaction of risk (ψ[ij]) is an important factor considered in this thesis. The Markov Chain Monte Carlo (MCMC) method via WinBUGS and R packages were used for simulations [Ntzoufras, 2011, Congdon, 2010, David et al., 1995, Gimenez et al., 2009, Brian, 2003], and the Deviance Information Criterion (DIC), proposed by [Spiegelhalter et al., 2002], used for models comparison and selection. Variation in TB risk is observed among Kenya counties and clustering among counties with high TB relative risk (RR). HIV prevalence is identified as the dominant determinant of TB. We found clustering and heterogeneity of risk among high rate counties and the overall TB risk is slightly decreasing from 2002-2009. Interaction of TB relative risk in space and time is found to be increasing among rural counties that share boundaries with urban counties with high TB risk. This is as a result of the ability of models to borrow strength from neighbouring counties, such that near by counties have similar risk. Although the approaches are less than ideal, we hope that our formulations provide a useful stepping stone in the development of spatial and spatio-temporal methodology for the statistical analysis of risk from TB in Kenya.Item Bayesian modelling of non–gaussian time series of serve acute respiratory illness.(2019) Musyoka, Raymond Nyoka.; Mwambi, Henry.; Achia, Thomas Noel Ochieng.; Gichangi, Anthony Simon Runo.Respiratory syncytial virus (RSV), Human metapneumovirus (HMPV) and Influenza are some of the major causes of acute lower respiratory tract infections (ALRTI) in children. Children younger than 1 year are the most susceptible to these infections. RSV and influenza infections occur seasonally in temperate climate regions. We developed statistical models that were assessed and compared to predict the relationship between weather and RSV incidence in chapter 2. Human metapneumovirus (HMPV) have similar symptoms to those caused by respiratory syncytial virus (RSV). Currently, only a few models satisfactorily capture the dynamics of time series data of these two viruses. In chapter 3, we used a negative binomial model to investigate the relationship between RSV and HMPV while adjusting for climatic factors. In chapter 4, we considered multiple viruses incorporating the time varying effects of these components.The occurrence of different diseases in time contributes to multivariate time series data. In this chapter, we describe an approach to analyze multivariate time series of disease counts and model the contemporaneous relationship between pathogens namely, RSV, HMPV and Flu. The use of the models described in this study, could help public health officials predict increases in each pathogen infection incidence among children and help them prepare and respond more swiftly to increasing incidence in low-resource regions or communities. We conclude that, preventing and controlling RSV infection subsequently reduces the incidence of HMPV. Respiratory syncytial virus (RSV) is one of the major causes of acute lower respiratory tract infections (ALRTI) in children. Children younger than 1 year are the most susceptible to RSV infection. RSV infections occur seasonally in temperate climate regions. Based on RSV surveillance and climatic data, we developed statistical models that were assessed and compared to predict the relationship between weather and RSV incidence among refugee children younger than 5 years in Dadaab refugee camp in Kenya. Most time-series analyses rely on the assumption of Gaussian-distributed data. However, surveillance data often do not have a Gaussian distribution. We used a generalised linear model (GLM) with a sinusoidal component over time to account for seasonal variation and extended it to a generalised additive model (GAM) with smoothing cubic splines. Climatic factors were included as covariates in the models before and after timescale decompositions, and the results were compared. Models with decomposed covariates fit RSV incidence data better than those without. The Poisson GAM with decomposed covariates of climatic factors fit the data well and had a higher explanatory and predictive power than GLM. The best model predicted the relationship between atmospheric conditions and RSV infection incidence among children younger than 5 years. Human metapneumovirus (HMPV) have similar symptoms to those caused by respiratory syncytial virus (RSV). The modes of transmission and dynamics of these epidemics still remain poorly understood. Climatic factors have long been suspected to be implicated in impacting on the number of cases for these epidemics. Currently, only a few models satisfactorily capture the dynamics of time series data of these two viruses. In this study, we used a negative binomial model to investigate the relationship between RSV and HMPV while adjusting for climatic factors. We specifically aimed at establishing the heterogeneity in the autoregressive effect to account for the influence between these viruses. Our findings showed that RSV contributed to the severity of HMPV. This was achieved through comparison of 12 models of various structures, including those with and without interaction between climatic cofactors. Most models do not consider multiple viruses nor incorporate the time varying effects of these components. Common ARIs etiologies identified in developing countries include respiratory syncytial virus (RSV), human metapneumovirus (HMPV), influenza viruses (Flu), parainfluenza viruses (PIV) and rhinoviruses with mixed co-infections in the respiratory tracts which make the etiology of Acute Respiratory Illness (ARI) complex. The occurrence of different diseases in time contributes to multivariate time series data. In this work, the surveillance data are aggregated by month and are not available at an individual level. This may lead to over-dispersion; hence the use of the negative binomial distribution. In this paper, we describe an approach to analyze multivariate time series of disease counts. A previously used model in the literature to address dependence between two different disease pathogens is extended. We model the contemporaneous relationship between pathogens, namely; RSV, HMPV and Flu from surveillance data in a refugee camp (Dadaab) for children under 5 years to investigate for serial correlation. The models evaluate for the presence of heterogeneity in the autoregressive effect for the different pathogens and whether after adjusting for seasonality, an epidemic component could be isolated within or between the pathogens. The model helps in distinguishing between an endemic and epidemic component of the time series that would allow the separation of the regular pattern from irregularities and outbreaks. The use of the models described in this study, can help public health officials predict increases in each pathogen infection incidence among children and help them prepare and respond more swiftly to increasing incidence in low-resource regions or communities. This knowledge helps public health officials to prepare for, and respond more effectively to increasing RSV incidence in low-resource regions or communities. The study has improved our understanding of the dynamics of RSV and HMPV in relation to climatic cofactors; thereby, setting a platform to devise better intervention measures to combat the epidemics. We conclude that, preventing and controlling RSV infection subsequently reduces the incidence of HMPV.