The origins of the coronavirus pandemic, first detected in Wuhan, China, remain unknown. Photo / AP
Detective work by a leading American scientist has revealed early sequences of the coronavirus genome were deleted from a key global database at the request of Chinese researchers.
The sequences, which have been recovered from cloud storage and published in a pre-print, have been described by experts as "the most important data" on the origins of Covid-19 in more than a year.
The recovered data does not support either the "natural origins" or "lab leak" theory over the pandemic's source, scientists say. However, it suggests the virus was circulating in Wuhan earlier than previously thought, and could perhaps point toward answers on the origins of Sars-CoV-2 - answers that could not only help end this pandemic but prevent the next one.
The emergence of the sequences also suggests there is more data from the early days of the epidemic that China is sitting on, and which may be recoverable by investigators.
The paper was published on the pre-print server bioRxiv by Professor Jesse Bloom, an influenza virus expert at the Fred Hutchinson Cancer Research Center in Seattle, in the United States.
While researching Sars-CoV-2, Professor Bloom found a project by Wuhan University that sequenced 34 positive coronavirus cases from January 2020, and 16 further cases in early February.
The project looked into diagnosing Sars-Cov-2 infection by a technique called nanopore sequencing. Its results, published in March as a pre-print then in June after peer review, remain publicly available.
However, the genomic sequences obtained as part of the study - which were uploaded to the US-maintained Sequence Read Archive (SRA), part of the National Institutes of Health (NIH) - are not.
These sequences - maps of how viruses are built - are critical for scientists studying how the viral genome has changed over time.
But searches for this project on the SRA - detailed in other literature as PRJNA612766 - return messages saying it has been removed, Professor Bloom found, a procedure that only takes place if the SRA staff are asked to take the data down.
The NIH told Britain's Daily Telegraph they had "reviewed the submitting investigator's request to withdraw the data" in June 2020, and removed it.
"The requestor indicated the sequence information had been updated, was being submitted to another database, and wanted the data removed from SRA to avoid version control issues," a spokesperson said. "Submitting investigators hold the rights to their data and can request withdrawal of the data."
The Telegraph has contacted the paper's authors for comment, but they had not replied at the time of publication. They also did not respond to Professor Bloom.
In the paper, Professor Bloom said he could see "no plausible scientific reason for the deletion".
He added: "It therefore seems likely the sequences were deleted to obscure their existence ... This suggests a less than wholehearted effort to trace early spread of the epidemic."
However, the deleted data has not been lost. By searching in the Google and Amazon clouds used for storage by the SRA, the professor found some of the removed genomic sequences.
The deleted data could now help fill in a missing link for scientists puzzled by the evolution of Sars-CoV-2 in people.
Regardless of how it ultimately reached humans, experts agree that Sars-CoV-2 originated in bats. But until now, the earliest known samples from Wuhan - including those taken from the Huanan Seafood Market, originally suspected as the potential source of the outbreak - were further away from these viruses, genetically, than other Sars-CoV-2 samples obtained in other parts of China and even other parts of the world.
The deleted sequences fill in that gap, as they are closer to the bat coronaviruses, and so could represent an earlier step in the evolution of the virus. They add further evidence that the virus was already circulating in humans in Wuhan as early as autumn 2019.
Experts said this was a major development.
Professor Rasmus Neilsen, a genomics expert at the University of California, Berkeley, tweeted: "These are the most important data that we have received regarding the origins of Covid-19 for more than a year."
However, there are also other important implications, as Professor Bloom explained in a lengthy Twitter thread.
"First, [the] fact this dataset was deleted should make us sceptical that all other relevant early Wuhan sequences have been shared," he said, pointing out that China ordered many labs to destroy early samples.
Chinese scientists were also under orders to get their publications on coronavirus checked and centrally approved - an order that came one day before the publication of the pre-print linked to the deleted samples.
"The second major implication is that it may be possible to obtain additional information about early spread of Sars-CoV-2 in Wuhan even if efforts for more on-the-ground investigations are stymied," said Professor Bloom, who is "cautiously optimistic" that there is more information in the cloud or other databases. He was a recent co-author of a letter to Science calling for more investigations on the pandemic's origins.
Other scientists agree.
Professor Stuart Neil, a virologist at Kings College London, told the Telegraph: "This is prima facie evidence that there were certain authorities in China wanting to obfuscate some of the early evidence of what was going on in a way to maybe cloud the issue a little bit.
"And the big question is what were they trying to obfuscate. There is really important forensic molecular epidemiology that has to be done to try and trace the origins of this."