Infinidat Blog

Genomics Research and Cancer Treatment - Empowered by Infinidat NAS

How does a storage company help to fight cancer? The International Agency for Research on Cancer estimated that in 2018, there were 18.1 million new cases of cancer and 9.6 million deaths globally due to some form of cancer. Furthermore, statistics show 20% of men and 17% of women will be diagnosed with cancer at some point in their life. Unfortunately, we see it everywhere and know it may impact our families, our friends, or our co-workers.

Doctors and scientists at one of our customer’s research centers are engaged in state-of-the-art clinical and pre-clinical research that provides cancer patients with the most advanced diagnostic and treatment support. In addition to academic teaching and training, doctors collaborate with leading international research groups as well as major pharmaceutical and biotech companies. Numerous clinical studies are conducted for developing new anticancer drugs. Advanced technologies such as Gene Sequencing, Microarrays, Bioinformatics, Molecular Cytogenetics, Stem Cells, and others are employed and continuously improved. Several applied R&D programs are at various stages of development for new treatment modalities.

I had an opportunity recently to visit this customer, where I spoke with a Storage Engineer and one of their Bioinformatics Researchers. We talked about the challenges related to research computing infrastructure in general and data storage in particular. The team is running a combination of Intel- and Nvidia-powered servers with multiple InfiniBox solutions to store and process petabytes of genomics data. Similar to other genomics research organizations, they need to store and process enormous amounts of genomics data, supporting a mix of FASTQ, BAM and VCF files. 

Figure 1: DNA Sequencing 
Source: NIH

The human genome consists of three billion DNA base pairs. How does that translate to storage capacity? Well, it depends on what we’re counting. Just storing a human genome - every pair takes 2 bits for A, C, G and T bases, so we need somewhere around 700MBs. However, when the genome data comes out of a sequencer, it contains repeating fragments of tiny genome segments with some accuracy estimation and therefore requires much bigger files. Right off the genome sequencer, researchers may receive around 200GB of data for a single human genome! This raw data is stored in a FASTQ file.

Storing 1000’s of these FASTQ files requires massive storage capacity - and a lot of focus on compression techniques. In the earliest days at Infinidat, we worked on a special host-side compression algorithm for FASTQ files. Many other compression techniques, both lossy and lossless, have since been developed to address this challenge. 

Only a fraction of the genome is different between individuals and represents mutations. It is common to extract these differences into much smaller VCF files consuming 100s of MBs. Other file formats (SAM, BAM, etc.) are also used as the data passes through transformation processes.

Figure 2:  Approximate file sizes and times to generate for NGS data formats
Source: NIH

Whatever the format, genomics data is always stored compressed - and so any compression or deduplication solution implemented within the storage system adds minimal additional optimization. Advances in genomics sequencing allow the production of genomic data files at an ever-increasing rate. Storing genomic data effectively, as the number of sequences grows, is critical to managing storage costs. While sequencing costs are falling, compression is finite and the cost of storage for the resulting output files is increasing.

It is not enough just to store these files efficiently during processing. The United States Food and Drug Administration (US FDA) recommends “to retain data files that maintain the complete features of the raw data... Whereas genomic samples may be destroyed upon participant request, destruction of data contradicts the principles of scientific integrity, particularly in the context of clinical studies.” This requires storage solutions at an enormous capacity, with high reliability and support for cross-geography disaster recovery capabilities.

Processing large amounts of data is time-consuming. Research institutions deploy distributed compute capacity to run pipelines like GATK and others.

Figure 3: Example of a DNA sequencing workflow
Source: Broad Institute

Some tools within such pipelines have high CPU utilization, for example, BWA and HaplotypeCaller, but most steps are I/O bound. A mix of CPU and GPU-based (like SqreamDB) systems is typically used to transform and analyze genomics data. Faster storage may provide additional benefits to more complex scenarios when many pipelines of different characteristics are being run at the same time over a large cluster, increasing the number of genomes processed in parallel.

The customer was looking for a high performance, high capacity system that could make it easy to support sequencing, analysis and data archiving while managing the rising costs of storage on average throughout the industry. They found it when they deployed their first InfiniBox over four years ago. 

They appreciate the performance, capacity and versatility of Infinidat NAS. The hospital started with InfiniBox for the genomics workloads as it met their requirements while also being about ⅓ lower cost per terabyte versus competitive alternatives. Over several years, they have also started leveraging InfiniBox to address other needs. For example,  when the storage admin was looking for temporary capacity for VMware datastores, he was able to borrow space from the InfiniBox to run this workload, in addition to the ongoing genomics research, without researchers even noticing any impact to performance. 

The customer also likes the reliability of the system. For as long as they’ve been a customer, they’ve never experienced any storage-related downtime with the InfiniBox. 

It is always great to meet another happy customer. It’s also great to find out how our scalable NAS solution helps to facilitate cancer research with the hope that it will help give us all a healthier future.

About Gregory Touretsky

Gregory Touretsky (@gregnsk) is a Senior Director, Product Management at INFINIDAT. He drives the company’s roadmap around NAS, cloud and containers topics. Before that Gregory was a Solutions Architect with Intel, focused on distributed computing and storage solutions, data sharing and the cloud. He has over twenty years of practical experience with distributed computing and storage. Gregory has an M.S. in Computer Science from Novosibirsk State Technical University and an MBA from Tel-Aviv University.