The influence of HIV-1 genomic target region selection and sequence length on the accuracy of inferred phylogenies and clustering outcomes.
MetadataShow full item record
To improve the methodology of HIV-1 cluster analysis, we addressed how analysis of HIV-1 clustering is associated with parameters that can affect the outcome of viral clustering. The extent of HIV clustering, tree certainty, subtype diversity ratio (SDR), subtype diversity variance (SDV) and Shimodaira-Hasegawa (SH)-like support values were compared between 2881 HIV-1 full genome sequences and sub-genomic regions of which 2567 were retrieved from the LANL HIV Database and 314 were sequenced from blood samples from a cohort in KwaZulu-Natal. Sliding window analysis was based on 99 windows of 1000 bp, 45 windows of 2000 bp and 27 windows of 3000 bp. Clusters were enumerated for each window sequence length, and the optimal sequence length for cluster identification was probed. Potential associations between the extent of HIV clustering and sequence length were also evaluated. The phylogeny based on the full-genome sequences showed the best tree accuracy; it ranked highest with regards to both tree certainty and SH-like support. Product 4, a region associated with env, had the best tree accuracy among the sub-genomic regions. Among the HIV-1 structural genes, env had the best tree certainty, SH-like support, SDR score and the best SDV score overall. The hierarchy of cluster phylotype enumeration mirrored the tree accuracy analysis, with the full genome phylogeny showing the highest extent of clustering, and the product 4 region being second best. Among the structural genes, the highest number of phylotypes was enumerated from the pol phylogeny, followed by env. The extent of HIV-1 clustering was slightly higher for sliding windows of 3 000 bp than 2000 bp and 1000 bp, thus 3000 bp was found to be the optimal length for phylogenetic cluster analysis. We found a moderate association between the length of sequences used and proportion of HIV sequences in clusters; the influence of viral sequence length may have been diminished by the substantial number of taxa. Full-genome sequences could provide the most informative HIV cluster analysis. Selected sub-genomic regions with the best combination of high extent of HIV clustering and high tree accuracy, such as env, could also be considered as a second choice.