Big Data Processing in Genomics and Medicine

From terabytes to petabytes, Arvados handles the unique challenges of managing and processing genomic data. Curoverse puts the power of this platform into the hands of research IT organizations, research labs, diagnostic testing labs, and sequencing centers. It empowers bioinformaticians and implements the latest industry standards.


The Arvados data management system, Keep, is very well-suited to managing genomic data files, including FASTQ, BAM, VCF and other files created by next generation sequencing (NGS) and analysis. Keep helps informaticians organize their files into datasets, create canonical references to each dataset, and track the origin and usage of datasets within the system. The system provides both API interfaces and the ability to interact with datasets through standard POSIX semantics (directories and files).

Keep is ideal for centers running production workflows on whole exomes and genomes, where carefully tracking all the data that flows into and out of the production environment is critical for security and data management practices. The content addressing in Keep creates canonical cryptographically verifiable references to every file and data set. Keep can run on a wide range of underlying file systems including object stores. Finally, Keep delivers the high-throughput required for large scale genomic data processing.

The Arvados containerized workflow engine, Crunch, is an ideal environment for the implementation and execution of production genomic analysis pipelines. It’s already used for common pipelines such as BWA+GATK, bcbio, and other popular tools. Crunch provides a flexible, standards-based way to define pipelines. It supports the use of Docker containers to define stable runtime environments for tools and pipelines. When Crunch runs a pipeline, it automatically provisions nodes in a virtualized environment or an existing HPC cluster. It deploys the Docker containers and tools, manages operations, reports logging and status information, and records each job. As a result, Crunch makes it possible to run multiple instances of a pipeline and record each one for simple and automatic reproducibility. The system also handles pipeline versioning and delivers high-perofrmance throughput for the most demanding genomics requirements.

Arvados does not constrain how you work. The platform provides a command line interface (CLI), REST APIs, and an intuitive browser-based UI (“Workbench”). You can use any language to write your tools and applications; and to ease development, Arvados includes SDKs for Python, Perl, R, Ruby, and Go. Moreover, Arvados runs on every major cloud provider and on clusters in your own datacenter.


Arvados is designed for collaboration within a lab and between organizations. Within labs, Arvados projects provide a flexible way to organize datasets and pipelines for collaboration, sharing, and publishing.

Between organizations, Arvados supports a powerful federated computing model. Instead of moving datasets, you can move pipelines between clusters. By providing an environment that cryptographically verifies the integrity of datasets and reliably reproduces pipelines, Arvados supports the reliable, secure distribution of analyses across clusters.

Robust, Flexible Security

Arvados ensures you can enforce flexible security schemas that protect your data. Instead of traditional file- and directory-level Unix-style access control, Arvados applies access controls at the dataset level.

Because datasets are defined without making file copies or creating new directories, this approach not only has a great deal of flexibility, it’s also robust and reliable. The same files can be placed in multiple datasets without a storage cost, and individuals can be given access to those files through permissions set at the dataset level.

With Arvados, it’s straightforward to implement a HIPAA compliant solution.

Precision Medicine and Clinical NGS Diagnostics

Arvados provides an ideal platform for developing clinical NGS diagnostics solutions. The careful tracking of datasets and the reproducibility of pipelines makes authoring and deploying CAP/CLIA compliant tests straightforward. Arvados eases the process of updating pipelines without breaking different tests and with a clear audit trail for how tests have changed over time. With the robust provenance, each report delivered to a physician can be reliably tracked back to the specific samples, sequencing runs, datasets, and pipelines used to generate the test.

Support for Standards

Curoverse is very active in several major standards efforts, including the Global Alliance for Genomics and Health (GA4GH), NIST and CDC efforts to standardize clinical variant representation, the common workflow language (CWL) group, and others. Arvados will soon be an open source reference implementation for these standards efforts. A Curoverse Cluster can be used to create a GA4GH endpoint that serves as a standards-compliant interface to existing HPC systems.

Arvados is also the platform that powers that the Harvard Personal Genome Project (PGP) and other PGPs around the world.

Arvados and Bioinformatics Cores

Many institutions already have investments in existing HPC systems for bioinformatics. Arvados can work with these systems to increase utilization and take advantage of the existing compute and storage capacity.

Curoverse Solutions

Curoverse provides solutions with Arvados that are ideal for research centers, bioinformatics cores, pharmaceutical companies, health care systems, and diagnostic labs. We can deploy and operate your Arvados clusters in the cloud or in your datacenter.

Cluster Operation Subscription (COS) offers a set of subscription service that guarantees software maintenance, provides great support and training, and deliveres full 24x7x365 remote administration and optimization of clusters. We also provide a wide range of professional services.