To contain infectious diseases, it is crucial to determine the origin and transmission routes of the pathogen, as well as how the virus evolves. With the development of genome sequencing technology, genome epidemiology has emerged as a powerful approach for investigating the source and transmission of pathogens.
We presented the rationale for genomic tracing of SARS-CoV-2 and the challenges we currently face. Identifying the most genetically similar reference sequence to the query sequence is a critical step in genome tracing, typically achieved using either a phylogenetic tree or a sequence similarity search.
1.2 Data preparation
The genome variations of open-access SARS-CoV-2 sequences and corresponding metadata were downloaded from the UShER website (7 December 2022) , and only sequences labeled as ‘high quality, complete, and human source’ were included in the study. The Wuhan-Hu-1 (GenBank: MN908947.3) was used as the reference sequence to call the variants. The same sequences were merged, and the non-redundant sequences were referred to as genotypes in the study.
The impact of the new scoring scheme on genomic tracing is illustrated in Figure 1.
Figure 1 A demonstration of the performance and workflow of the new scoring system. (A) Scores calculated from the unweighted method which assigned all mutations the same weight. (B) Scores calculated from the weighted method considering the varying mutation incidences. (C) Scores calculated from the weighted method considering the clade-specific mutation incidence and the correlation between concurrent mutations. (D) The workflow of Valkyrie for data preprocessing, query and visualization.
The time to construct a phylogenetic tree is proportional to the number of sequences; it took more than three weeks to construct a phylogenetic tree using 3 054 071 genotypes. To save time, the genotypes were split into different evolutionary clusters, with each cluster containing a maximum number of 20 000 genotypes. Second, a maximum likelihood phylogenetic tree and the corresponding ancestral sequence were constructed for sequences in each cluster using IQ-TREE (v2.1.1, GTR + G substitution model was used). Notably, the reference sequence (MN908947.3) was included as the out-group.
The mutation incidence was represented by the absolute number of occurrences of the mutation on the merged phylogenetic tree. The frequency of mutation was estimated by dividing the mutation incidence number by the total number of nodes that descended from a node carrying an ancestral nucleotide. Similarly, the frequency of back mutation was estimated as the proportion of nodes carrying the ancestral nucleotide under a node where a mutation occurred. The total number of descendant genotypes has to be >50 to avoid a large deviation caused by a small sample size.