Project With Lloyd Elliott

Fast Compression and Data Structures for Genetics

In the past decade, new consortia for genomics involving half a million or more subjects have become available to researchers.  The scale of these consortia greatly surpasses all previous studies. Advanced software such as bgenie and plink2 are designed to discover genome-wide associations in this context at speed. However, some inefficiencies in the file formats used by advanced genetics software are amplified by the scale of these consortia, and improvements to these file formats could greatly reduce the cost of these studies.

In this project, the student will research methods to improve random access and compression of genetic file formats. This will involve developing pre-seeded dictionaries for compressions such as zlib and Zstandard. The student will also design new database formats for compressed matrices that allow random access on both the rows and the columns of the matrix. Experience with C/C++, compression specifications, or genetic file formats such as bed or bgen would be useful.