Data pipelines

ITFoM has challenging data and compute volumes both in the near term and the long term. In addition there are considerable specific detailed workflows used in genomics which do not utilise the large scale compute infrastructure due to a lack of interfacing between domain specific expertise and computational expertise. In WP4 we will develop a scalable and portable data pipeline system, building on and extending the established HPC middleware and storage infrastructures (Partnership for Advanced Computing in Europe, PRACE; EGI.eu, EUDAT). We will also bring the domain expertise from high throughput genomics and proteomics groups to utilise this infrastructure, again building and where necessary adapting the existing systems (ELIXIR, ICGC, VPH, BLUEPRINT).

ITFoM has assembled some of the best-in-class researchers and service providers in this area, spanning computational high end researchers (such as IBM and Bull), computational work flow experts (such as CERN and UCL) and domain specific data experts (such as EBI, CNAG, KTH, ETH and UU). The combination of this expertise will allow for both the immediate delivery of working pipelines for the early project stages, drawing on the experience of pipeline delivery in 1,000 Genome project, BLUEPRINT, ICGC, VPH and ENCODE but also able to scale to the potential of every individual in Europe having a personalised healthcare plan based on their individual molecular data.

Development of the data management scheme

ITFoM will develop and implement a scalable state of the art data management system which will be used to store ITFoM data and to implement the data pipelines. Care will be given to address all issues with management of patient data (e.g. safe replication, staging, duplication, removing redundancies, curation, anonymisation, semantic integration and archiving)in compliance with EU/national legislation. In particular, ITFoM will address the issues of storing, managing and deploying heterogenous data sets in a distributed storage architecture.

To achieve this, ITFoM will collaborate with key European e-infrastructure projects (e.g. EUDAT, VISION Cloud, VPH-Share, ELIXIR a.o.) to develop specific Cloud based infrastructure enabling running data pipelines within virtual machines on different cloud infrastructures. Through project partner CERN, ITFoM will learn from the Large Hadron Collider (LHC) to develop a strategy for handling large quantities of raw data. An initial data warehouse and data mining system will be provided by dipSBC (data-integration platform for Systems Biology Collaboration).

Development of the virtualisation infrastructure

Both the near term and long term goals of ITFoM demand a robust, practical workflow of data items flowing through the system with multiple quality control checks. Current pipelines are mixtures of formalised workflows in a variety of batch processing environments. For this to be shared first across ITFoM and secondly more broadly we will have to virtualise many of these components; often with current entire pipelines being a single component of a "meta‐pipeline" (for example, RNAseq processing, which itself is a multi‐stage pipeline, will be one node in a meta pipeline of patient data processing).

ITFoM will leverage and deploy existing HPC middleware solutions, developed e.g. in PRACE, EGI.eu and DEISA. In close collaboration with EGI.eu, ITFoM will create a deployment framework to ensure the portability and quality assurance of the methods and pipelines from a Euroepan research setting to national healthcare systems.

Development of domain specific pipelines

ITFoM will develop data pipelines specific for the different data generation apporaches by virtualising existing state-of-the-art pipelines for i) reference based nucleic acid sequencing, ii) de novo nucleic acid sequencing, iii) mass spectrometry proteomics, iv) affinity proteomics, v) metabolomics, vi) high throughput microscopy image analysis and vii) analysis of genetic variation.To achieve this, partners with suitable pipelines running at their institutes were systematically included in the Consortium.

Methods for data-specific compression

Recent work on compressing DNA sequence has shown that a domain specific compression framework can produce 100 to 1000 fold better compression than straightforward application of generic compression. This is mainly achieved by transforming the data to make explicit the implicit high redundancy of biological data, with only a small amount of variation between individuals. We will extend this compression scheme to utilise the growing understanding of structural variation and complex index structure in the human genome, and to adapt to new technologies, such as very long read structure.

Evaluation on test cases

The domain specfic, virtualised pipelines will be tested using personalised medicine data from specified test cases.