Software

SPSmart is a tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics (SNPs for Population Studies).

We have created a fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference indices, and stored into a relational database that currently handles as many as 4 Å~ 109 genotypes and that can be easily extended to new database initiatives. We have also built a web interface to the data mart that allows the browsing of underlying data indexed by population and the combining of populations, allowing intuitive and straightforward comparison of population groups. All the information served is optimized for web display, and most of the computations are already pre-processed in the data mart to speed up the data browsing and any computational treatment requested.

SPSmart allows populations to be combined into user-defined groups, while multiple databases can be accessed and compared in a few simple steps from a single query. It performs the queries rapidly and gives straightforward graphical summaries of SNP population variability through visual inspection of allele frequencies outlined in standard pie-chart format. In addition, full numerical description of the data is output in statistical results panels that include common population genetics metrics such as heterozygosity, Fst and In.

This interface is a genetic variant site explores able to retrieve data for single nucleotide variation (SNV) from different populations using data from the 1000 Genomes project. It is able to handle >7.3 billion genotypes and 28 milion SNVs, and derive summary statistics of interest for medical and population genetics applications.

The whole dataset is pre-processed and summarized into a data mart accessible through a web interface. The query system allows the combination and comparison of each available population sample, while searching by rs-number list, chromosome region, or genes of interest. Frequency and FST filters are available to further refine queries, while

results can be visually compared with other large-scale Single Nucleotide Polymorphism (SNP) repositories such as HapMap or Perlegen.

ENGINES is capable of accessing large-scale variation data repositories in a fast and comprehensive manner. It allows quick browsing of whole genome variation, while providing statistical information for each variant site such as allele frequency, heterozygosity or FST values for genetic differentiation.

ENGINES: click here to access to the data mart generating scripts and to the web interface.

mitPower is a tool that allows estimating the statistical power in case-control association disease studies. Several utilities are available such as: (i) a posteriori estimation of the statistical power, (ii) sample size needed in order to reach a given statistical power, and (iii) the estimation of the minimum deviation from the null hypothesis (of no association) detectable under a given statistical power (expressed as OR and haplogroup frequency in cases). It also allows two different calibration procedures, asymptotic and permutation. Note that mitPower is a generalized model that work with 2 x k tables; therefore, it can be used with uniparental data (Y-chromosome, mtDNA; either haplogroups or SNPs) or with autosomal diploid data (e.g. autosomal SNPs).

Y-BAT is a tool that allows to infer the most likely bio geographic ancestral origin (BGA) of a Y-STR haplotype in a continental scale. It uses the reference database from Purps et al. (2014) for classification. It works on individual profiles or in a batch mode (hundreds of profiles can be classified in seconds). It can be very useful to those interests in forensic applications (criminalistics, identification, etc) or in a molecular anthropological context. BGA stimated to a population scale, can be interpreted as proportions of ancestry in this population for the Y-chromosome.

Infering most likely geographical origina

BGA origin of an haplotype

Here we described a procedure to infer the most likely biogeographical ancestral origin (BGA) of an haplotype. It was originally conveived for mtDNA profiles. We have however demonstrated that it can be useful if applied to Y-chromosome haplotypes or autosomal ones. See Appendix I: PCA-QDA approach in this paper:

Egeland T, Bøvelstad HM, Storvik GO, Salas A: Inferring the most likely geographical origin of mtDNA sequence profiles. Ann Hum Genet 2004, 68(Pt 5):461-471. pdf

For an application to Y-chromosome haplotypes, see:

Toscanini U, Gaviria A, Pardo-Seco J, Gómez-Carballa A, Moscoso F, Vela M, Cobos S, Lupero A, Zambrano AK, Martinon-Torres F et al: The geographic mosaic of Ecuadorian Y-chromosome ancestry. Forensic Sci Int Genet 2018, 33:59-65.pdf

For an application to autosomal haplotypes, see this paper on ichthyosis:

Esperón-Moldes US, Pardo-Seco J, Montalván-Suárez M, Fachal L, Ginarte M, Rodríguez-Pazos L, Gómez-Carballa A, Moscoso F, Ugalde-Noritz N, Ordónez-Ugalde A et al: Biogeographical origin and timing of the founder ichthyosis TGM1 c.1187G > A mutation in an isolated Ecuadorian population. Sci Rep 2019, 9(1):7175.

An R library that can be used to interpret mtDNA mixtures in forensic cases.

T. Egeland, and A. Salas, A statistical framework for the interpretation of mtDNA mixtures: forensic and medical applications, PLoS One 10 (2011) e26723.

Mixtures of identical mtDNA profiles will not be informative, i.e., a mixture is unidentifiable. It is of interest to estimate the probability that a specific case will not lead to an informative mtDNA mixture.

This probability will obviously depend on the database. Assuming that there are different k profiles with frequencies p1,….,pk, the probability that a mixture of a random sample of be m.1 profiles will be informative in the sense that not all are identical is:

This probability can also be estimated from simulations. Note that this probability strongly depends on the population group represented by the database and the range of sequence information targeted. For instance, sub-Saharan African lineages are generally more divergent and therefore more informative than European ones, and control region data may show little resolution in, for example, some Native American populations. The calculations can be performed using the R-library unseen2 library freely available from http://folk.uio.no/thoree/nhap/.

Haplogrep

A software to classify mitochondrial DNA haplotypes into haplogroups. This has been developed in collaboration with Hansi Weissensteiner and Schönherr and collaborators in Insbruck and Prof. Hans-Jürgen Bandelt from Hamburg.

This is being a highly cited paper and Haplogrep constitutes the reference software worldwide to classify human haplotypes into haplogroups. See ex. Google Schoolar

H. Weissensteiner, D. Pacher, A. Kloss-Brandstätter, L. Forer, G. Specht, H.-J. Bandelt, F. Kronenberg, A. Salas, and S. Schonherr, HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing, Nucleic Acids Res. 44 (2016) W58-W63.

Mitochondrial DNA (mtDNA) profiles can be classified into phylogenetic clusters (haplogroups), which is of great relevance for evolutionary, forensic and medical genetics. With the extensive growth of the underlying phylogenetic tree summarizing the published mtDNA sequences, the manual process of haplogroup classification would be too timeconsuming. The previously published classification tool HaploGrep provided an automatic way to address this issue. Here, we present the completely updated version HaploGrep 2 offering several advanced features, including a generic rule-based system for immediate quality control (QC). This allows detecting artificial recombinants and missing variants as well as annotating rare and phantom mutations. Furthermore, the handling of high-throughput data in form of VCF files is now directly supported. For data output, several graphical reports are generated in real time, such as a multiple sequence alignment format, a VCF format and extended haplogroup QC reports, all viewable directly within the application. In addition, HaploGrep 2 generates a publication-ready phylogenetic tree of all input samples encoded relative to the revised Cambridge Reference Sequence. Finally, new distance measures and optimizations of the algorithm increase accuracy and speed-up the application. HaploGrep 2 can be accessed freely and without any registration at http://haplogrep.uibk.ac.at.

Figure 1 in Weissensteiner et al. (2016)

Kinship testing using IBD parameters

There is a large number of applications where family relationships need to be determined from DNA data. In forensic science, competing ideas are in general verbally formulated as the two hypotheses of a test. For the most common paternity case, the null hypothesis states that the alleged father is the true father against the alternative hypothesis that the father is an unrelated man. A likelihood ratio is calculated to summarize the evidence. We propose an alternative framework whereby a model and the hypotheses are formulated in terms of parameters representing identity-by-descent probabilities. There are several advantages to this approach. Firstly, the alternative hypothesis can be completely general. Specifically, the alternative does not need to specify an unrelated man. Secondly, the parametric formulation corresponds to the approach used in most other applications of statistical hypothesis testing and so there is a large theory of classical statistics that can be applied. Theoretical properties of the test statistic under the null hypothesis are studied.More information here:García-Magariños M, Egeland T, López-de-Ullibarri I, Hjort NL, Salas A: A parametric approach to kinship hypothesis testing using identity-by-descent parameters. Stat Appl Genet Mol Biol 2015, 14(5):465-479.

CovidPhy

A tool for phylogeographic analysis of SARS-CoV-2 variation.

We have designed CovidPhy, a web interface that can process SARS-CoV-2 genome sequences in plain fasta text format or provided through identity codes from the Global Initiative on Sharing Avian Influenza Data (GISAID) or GenBank. CovidPhy aggregates information available on the large GISAID database (>1.49 M genomes). Sequences are first aligned against the reference sequence and the interface provides different sources of information, including automatic classification of genomes into a pre-computed phylogeny and phylogeographic information, haplogroup/lineage frequencies, and sequencing variation, indicating also if the genome contains known variants of concern (VOC). Additionally, CovidPhy allows searching for variants and haplotypes introduced by the user and includes a list of genomes that are good candidates for being responsible for large outbreaks worldwide, most likely mediated by important superspreading events, indicating their possible geographic epicenters and their relative impact as recorded in the GISAID database.

Bello X, Pardo-Seco J, Gomez-Carballa A, Weissensteiner H, Martinon-Torres F, Salas A. 2022. CovidPhy: A tool for phylogeographic analysis of SARS-CoV-2 variation. Environ Res 204: 111909.