Arvados Technology

Open source infrastructure for managing, processing, and
sharing genomic and other biomedical data.

Modern Software Architecture

Arvados is deployed with a multilayer integrated stack of technologies built with proven open source software. All the layers work together as a complete solution based on modern distributed computing patterns.

Elastic Infrastructure

Arvados is designed to run on an elastic computing foundation, which can be provided by a cloud or off-the-shelf hardware running virtualization and related services.

System Services

The System Services layer provides core Arvados services: the Keep data manager and the Crunch job management.

API

All the services in the system are accessed through a RESTful API, and there are SDKs for Python, Perl, Ruby, Java, and Go.

System Interfaces

At the Interface layer, Arvados provides a number of different ways for users and admins to access the capabilities of the system.

Security

Security is woven throughout the system at every layer.

Flexible Data Management

The Arvados data management system, Keep, is a content-addressable storage system. It can manage data on commodity drives or use a wide range of other underlying file systems including object/blob stores. Keep does for data what Git does for code.

Define datasets without folders

Quickly put files into a data set of any size without moving or copying them using data set management tools instead of folders.

Reliable dataset retrieval

Ensure reliable and durable data retrieval with content addressing that automatically verifies a hash of every file.

API and POSIX semantics

Interact through the API or load datasets as network drives so you can interact with file collections using traditional file paths.

Deduplication

Eliminate duplicate data storage by automatically checking for duplication on write using content addresses.

Origin and use tracking

Track the origin of datasets and how they are used across the system by recording each pipeline run as metadata.

Fast throughput

Move computations to data and optimize disk access with a single reader and writer for each spindle.

Multi-tier storage

Manage data across different tiers of storage from production to archive on-premise or in the cloud.

Reproducible Pipeline Processing

The Arvados job manager, Crunch, is a containerized workflow engine that provides a flexible way to define and run computational pipelines, which can be reliably reproduced. It takes advantage of Git, Docker, and other technologies to make life easier.

Pipeline Definition

Define pipelines in an easy-to-use JSON document or script (soon Common Workflow Language).

Docker Orchestration

Use Docker images to define run-time environments for individual jobs.

Compute Provisioning

Let Crunch manage provisioning compute nodes, installing containers, and software.

Reproducibility

Reliably reproduce every job and pipeline you run.

Fault Tolerance

Automatically recover from disk and node failures.

Fast Throughput

Move computations to data and optimize disk access.

Portability

Easily move computations between Arvados instances.

Self-service

Run jobs without assistance in cluster management.

Status Reporting

Access job status reports during and after job execution.

Optimized Re-running

Save time and money by skipping jobs that don’t need to be re-run.

Launch Applications

Crunch can launch web applications or stand up databases as part of pipeline.

Scaling

Easily scale jobs to run in parallel on multiple nodes.

Flexible Working Environment

Arvados is designed to provide a highly flexible environment for getting your work done, so it has a variety of different interfaces.

REST APIs

The entire system can be accessed through REST APIs from any programming language.

Command Line Interface

If you like working on the command line, that’s always an option.

Software Development Kits (SDKs)

SDKs for Python, Perl, Java, Go, and Ruby. (R coming soon.)

Web UI

Workbench is a web application that makes it easy to use Arvados from your browser.

Projects and Metadata

Arvados lets you organize work into projects to make it easier to keep track of the datasets and pipelines you’re using. Everything in the system can be easily tagged with metadata.

Personal Servers

In a typical configuration, your Arvados cluster will have virtual machines set up for each user so they have their own environment to test and develop work.

Secure Collaboration and Publishing

Arvados is designed to empower people to securely collaborate, share data, and publish their work.

Collaborate in a Project

Add multiple users to a project so individuals can share and collaborate on the work.

Publish Public Projects

If you want to share your work publicly or provide a URL for methods in a paper, you can publish public projects that anyone can see.

Copy Projects Between Clusters

A single command reliably copies every aspect of a project from one Arvados cluster to another.

REST APIs

Currently, Arvados uses OAuth2 for authentication, but can be integrated with LDAP and Active Directory.

Federated Computing

If you want to collaborate across clusters, you can move pipelines between environments, which enables secure data sharing without moving data around.

Flexible Permissioning

The data manager makes it possible to apply access control permissions at the dataset level, which is much more flexible than using traditional files and directory level permissions.