The influence of HIV-1 genomic target region selection and sequence length on the accuracy of inferred phylogenies and clustering outcomes.
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
To improve the methodology of HIV-1 cluster analysis, we addressed how analysis of HIV-1
clustering is associated with parameters that can affect the outcome of viral clustering. The
extent of HIV clustering, tree certainty, subtype diversity ratio (SDR), subtype diversity
variance (SDV) and Shimodaira-Hasegawa (SH)-like support values were compared between
2881 HIV-1 full genome sequences and sub-genomic regions of which 2567 were retrieved
from the LANL HIV Database and 314 were sequenced from blood samples from a cohort in
KwaZulu-Natal. Sliding window analysis was based on 99 windows of 1000 bp, 45 windows of
2000 bp and 27 windows of 3000 bp. Clusters were enumerated for each window sequence
length, and the optimal sequence length for cluster identification was probed. Potential
associations between the extent of HIV clustering and sequence length were also evaluated. The
phylogeny based on the full-genome sequences showed the best tree accuracy; it ranked highest
with regards to both tree certainty and SH-like support. Product 4, a region associated with env,
had the best tree accuracy among the sub-genomic regions. Among the HIV-1 structural genes,
env had the best tree certainty, SH-like support, SDR score and the best SDV score overall. The
hierarchy of cluster phylotype enumeration mirrored the tree accuracy analysis, with the full
genome phylogeny showing the highest extent of clustering, and the product 4 region being
second best. Among the structural genes, the highest number of phylotypes was enumerated
from the pol phylogeny, followed by env. The extent of HIV-1 clustering was slightly higher for
sliding windows of 3 000 bp than 2000 bp and 1000 bp, thus 3000 bp was found to be the
optimal length for phylogenetic cluster analysis. We found a moderate association between the
length of sequences used and proportion of HIV sequences in clusters; the influence of viral
sequence length may have been diminished by the substantial number of taxa. Full-genome
sequences could provide the most informative HIV cluster analysis. Selected sub-genomic
regions with the best combination of high extent of HIV clustering and high tree accuracy, such
as env, could also be considered as a second choice.
Description
Masters Degree. University of KwaZulu-Natal, Durban.