Press Office
Martin Steinegger: Metagenomic data analysis ‘on steroids’

Martin Steinegger: Metagenomic data analysis ‘on steroids’

Invitation

Sep 18, 2018, 13:00

Tuesday, September 18, 2018

in the New Auditorium (4012) at 3:00 p.m. (15:00h)

MARTIN STEINEGGER
MPI for Biophysical Chemistry, Göttingen, Germany

will give a seminar with the title:
Metagenomic data analysis ‘on steroids’

Metagenomic data analysis ‘on steroids’

Sequencing costs have dropped much faster than Moore's law in the past decade, and sensitive sequence searching has become the main bottleneck in the analysis of large metagenomic datasets. While previous search methods sacrificed sensitivity for speed gains, MMseqs2[1] is as sensitive as BLAST, more sensitive than PSI-BLAST, and 400 times faster. MMseqs2 can annotate 1.1 billion sequences in 8.3 hours on 28 cores. MMseqs2 offers great potential to increase the fraction of annotatable (meta)genomic sequences.

Clustering protein sequences can considerably reduce the redundancy of sequence sets and costs of downstream analysis and storage. We present Linclust[2] a method that can cluster sequences down to 50% pairwise sequence similarity and its run time scales linearly with the input set size, not nearly quadratically as in conventional algorithms. We cluster 1.6 billion metagenomic sequence fragments in 10 hours on a single server to 50% sequence identity, >1000 times faster than has been possible before.

Sequence assembly of short reads into longer contigs is critical for abundance analysis, taxonomical and functional annotation. The open-source de-novo Protein-Level ASSembler Plass[3] (https://plass.mmseqs.org) assembles six-frame-translated sequencing reads into protein sequences. It recovers 2 to 10 times more protein sequences from complex metagenomes and can assemble huge datasets. We assembled two redundancy-filtered reference protein catalogs, 2 billion sequences from 640 soil samples (SRC) and 292 million sequences from 775 marine eukaryotic metatranscriptomes (MERC), the largest free collections of protein sequences.

[1] Steinegger M. and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)

[2] Steinegger M. and Soeding J. Clustering huge protein sequence sets in linear time. Nature Communications, doi: 10.1038/s41467-018-04964-5 (2018)

[3] Steinegger M., Mirdita M. and Soeding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. biorxiv, doi: 10.1101/386110 (2018)