Research | Sumit Walia

Compressive Pangenomics using PanMANs

Nature Genetics (Accepted) | GitHub | Wiki | BioRxiv

Scientific motivation:
Genomics is moving from studying a single reference genome to analyzing entire pangenomes, representing thousands of individuals. While scientifically powerful, these datasets are extremely large and challenging to store and process efficiently.

What PanMAN enables:
PanMAN introduces a compact representation of large genomic populations using mutation-annotated network structures that preserve meaningful evolutionary and biological information while dramatically shrinking data size.

Overall contributions:

Developed a novel data structure and file format to store shared mutational and evolutionary information at scale (millions of genome sequences)
Enabled storage and analysis of ultra-large genomic datasets by reducing memory and storage requirements by more than 600X compared to common formats.
Preserved biological interpretability while making large-scale pangenomic studies computationally practical.

Ultrafast & Ultralarge Phylogenetic Tree Construction using DIPPER

Nature Computational Science (Under Review) | GitHub | Wiki | BioRxiv

Scientific motivation:
Phylogenetic trees help scientists understand how species, pathogens, and viral strains evolve. However, traditional tools do not scale well to the millions of genomes now commonly generated in large-scale studies.

What DIPPER enables:
DIPPER is designed to construct very large phylogenetic trees extremely quickly, supporting real-time biological discovery and global-scale genomic surveillance.

Overall contribution of this work:

Delivered a high-performance GPU-accelerated tool capable of scaling to ultra-large datasets (up to 10 million sequences).
Introduced memory-efficient strategies (such as divide-and-conquer and on-the-fly distance computation) that allow trees to be built without exceeding the limited GPU capacity.
Achieved up to 40X speedup while improving memory efficiency by up to 6X over state-of-the-art tools, expanding what is computationally feasible in evolutionary genomics.

Ultrafast & Ultralarge Multiple Sequence Alignment using TWILIGHT

ISMB'25 | Bioinformatics | GitHub | Wiki

Scientific motivation:
Multiple sequence alignment (MSA) is a cornerstone of genomic analysis, but aligning millions of sequences traditionally requires enormous compute time and resources.

What TWILIGHT enables:
TWILIGHT makes it possible to perform massive, high-quality multiple sequence alignments efficiently, supporting applications in evolution, disease tracking, and large-scale biological discovery.

Overall contribution of this work:

Introduced a heterogeneous CPU–GPU execution pipeline that scales alignment to previously impractical dataset sizes - up to millions of sequences.
Utilized parallel processing, asynchronous data transfer, and dynamic load balancing to maximize throughput.
Delivered over 50X speedup while maintaining high alignment accuracy, making large-scale MSA much more practical for real research workflows.

High-Performance Genome Sequence Alignment using TALCO

HPCA'24 | GitHub

Scientific motivation:
Genome sequence alignment is fundamental in bioinformatics, yet most approaches struggle to simultaneously achieve high accuracy, high speed, and energy efficiency.

What TALCO enables:
TALCO introduces a tiling-based alignment strategy leveraging the convergence of traceback pointers to deliver highly accurate alignments with exceptional computational efficiency.

Overall contribution of this work:

Designed a hardware accelerator for TALCO alignment strategy and deployed it on AWS F1 cloud infrastructure.
Demonstrated approximately 2000X improvement in throughput-per-Watt compared to leading CPU/GPU methods.
Recognized as an HPCA Best Paper Nominee, underscoring its impact and significance.