Predicting the trend of SARS-CoV-2 mutation frequencies using historical data

Date
2025
Language
American English
Embargo Lift Date
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
Oxford University Press
Can't use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Abstract

Motivation: As the SARS-CoV-2 virus rapidly evolves, predicting the trajectory of viral mutations has become a critical yet complex task. A deep understanding of future mutation patterns, in particular the mutations that will prevail in the near future, is vital in steering diagnostics, therapeutics, and vaccine strategies for disease control.

Results: In this study, we developed a model to forecast future SARS-CoV-2 mutation surges in real-time, using historical mutation frequency data from the USA. We transformed the temporal prediction problem into a supervised learning framework using a sliding window approach. This involved breaking the time series of mutation frequencies into very short segments. Considering the time-dependent nature of the data, we focused on modeling the first-order derivative of the mutation frequency. We predicted the final derivative in each segment based on the preceding derivatives, employing various machine learning methods, including random forest, XGBoost, support vector machine, and neural network models. Empowered by the novel transformation strategy and the high capacity of machine learning models, we observed low prediction error that is confined within 0.1% and 1% when making predictions of mutation rates for the future 30 and 80 days, respectively. In addition, the method also led to a notable increase in prediction accuracy compared to traditional time-series models, as evidenced by much lower MAE (Mean Absolute Error) and MSE (Mean Squared Error) for predictions made within different time horizons. To further assess the method's effectiveness and robustness in predicting mutation patterns for unforeseen mutations, we first designed a synthetic case where we categorized all mutations into three major patterns. The model demonstrated its robustness by accurately predicting unseen mutation patterns when training on data from two pattern categories while testing on the third pattern category, showcasing its potential in forecasting a variety of mutation trajectories. We then applied our method to prediction for a recent time frame between 1 January 2025 and 10 June 2025, for both the USA and UK, where the model training was conducted using frequency sequence data collected between 12 December 2019 and 26 January 2023 in the USA. The model demonstrated superior performance for both datasets.

Availability and implementation: To enhance accessibility and utility, we built our methodology into a GitHub package (https://github.com/ZhouXY199502/SWD). Our method has the potential applicability to study other infectious diseases or forecasting tasks, thus extending its relevance beyond the current COVID pandemic.

Description
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
Zhou X, Yan Y, Hu K, et al. Predicting the trend of SARS-CoV-2 mutation frequencies using historical data. Bioinformatics. 2025;41(10):btaf508. doi:10.1093/bioinformatics/btaf508
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Bioinformatics
Source
PMC
Alternative Title
Type
Article
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Final published version
Full Text Available at
This item is under embargo {{howLong}}