Explaining: "A Timeframe for SARS-CoV-2 Genomes: A Proof of Concept for Postmortem Interval Estimations"

Suppose you work as a coroner and find some tissue sample with SARS-CoV-2 virus in it. Could you assign a date to the sample just by knowing the sequence of the virus found? We tested the waters to check if it's possible at least in theory, and maybe in a real scenario.

Lets talk about our job in this paper, focusing on the programming part.

Starting point

A lot of SARS-CoV-2 virus sequences has been collected since the start of the pandemics in early 2020. Today we have huge databases with millions of sequences and each of them has known a date of collection. Those sequences can be arranged in evolutionary trees, like the one in Fig.1.

Fig. 1. A tree of similar species
Fig. 1. A tree of similar species

Instead of five leafs, the SARS-CoV-2 tree has millions, so we also have specialized software to insert new sequences in the tree, close to the most similar sequences.

With these tools we proceeded to test our hypothesis.

Is it even possible?

My first thoughts were that even if the idea seems cool, it could be too costly in terms ot computing time. It's time to roll the sleves up and give it a try.

I downloaded the SARS-CoV-2 database from the NCBI with about 2 million dated sequences, and selected a random sequence to play with UShER (the program to insert sequences in the tree) and check the output.

The first thing I need to do was to align this sequence with the SARS-CoV-2 reference sequence, using mafft. This gives a Fasta file with both sequences running along each other in sync (i.e. matching on similar bases), like shown in Fig.2

Fig. 2: A small part of an alignment
Fig. 2: A small part of an alignment

UShER doesn't accept Fasta files as input, but Vcf format. Transforming one into the other is easy and quick, but I had to do some back and forth tests to make UShER completely happy with the input.

The main use of UShER is to quickly place your sequence in a huge tree. The program can be tuned to give you a number of different output files, and we was interested in a subtree with only the n closest sequences. We tried 10, 20 and finally settled with 40 neighbors using the -k parameter:

$ usher -k 40

This gives you a file called 'subtree.nh', the nh extension referring to Newick format, used to store trees. In this file you can find the mectioned 40 neighbors with your query sequence among them. The next thing we need to do is to extract the identifiers of the 40 sequences and retrieve their collection date from the database. Within these 40 neighbors some are closer (less differences) and some further (more differences) from our queried sequence, but even a raw average of all the dates gives a pretty good aproximation to the collection date we are looking for.

The above three steps took around 30 seconds in a modest CPU, so it seemed affordable. Yay!

Scaling it a bit

Once the pieces were in place, I did the same 20.000 times more. The number is arbitrary, and comes from the following logic: 30 seconds per sample per CPU in a machine with 6 CPUs, it takes 100.000 seconds, 1.700 minutes, about 28 hours. So I could get a reasonable amout of data in a single day.

To do the parallelization I created two programs with Nim: one to select a sequence randomly from the database, align, transform into Vcf and let UShER put it in the tree to get the closest sequences, and another to launch the previous one in parallel in 6 CPUs. The later is incredible easy with Nim, thanks to execProcesses.

Launched it today, and we'll se tomorrow...

The next day

Our 20.000 Newick files with 40+1 sequences each were waiting for us, so the next step is to collect each sequence identifier and its collection date. In a bit of undo work, we had to rearrange each tree and its alignment, into dates with differences (mutations) between each one of the 40 sequences with our sequence:

query
neighbor_1 MUT1,MUT2,MUT3,MUT4
neighbor_2 MUT1,MUT5
...
neighbor_40 MUT1,MUT2,MUT4,MUT15

This table is now passed downstream to Jacobo, who statistically asigns the best date range for the query sequence. More details about this process in the paper.

Fig 3. Estimation errors between our datation and the real date, in days.
Fig 3. Estimation errors between our datation and the real date, in days.

Things I would do differently

Wrapping up

Now we know that hypotetically you could find a decaying corpse and check for SARS-CoV-2 presence. If it exists, a sequencing of the virus could give you a very reasonable date range for the death event. Although for now it's only on paper, the whole effort is worthy.

Remember this is a simplified explanation of only part of the paper. I encourage you to read the whole paper and send any questions or comments you might have to me (regarding the above text) or the corresponding author.