Data analysis pipelines
The SOCIBP genomic pipeline is a collection of containerized analysis modules capable of processing genomic data from different sequencing platforms, panels, and sample types that are commonly used in cancer genomics, either in clinical or research settings (Figure). The concept of the SOCIBP pipeline is based on current best practices in genomic research and data reproducibility, including the ICGC ARGO best practices for scientific workflow development. Priorities for us at this stage are 1) portability (easy deployment of the modules on different servers regardless of their architecture or job scheduler) and 2) reproducibility (processes will run in the same way and produce the same results from the same input on any computing platform). Currently, the SOCIBP pipeline has been deployed and is being tested on the LeoMed server (Zurich) and the Insel HPC (Bern).
Over the past 12 months, we have made substantial improvements to variant filtering for our whole-exome data analysis and amplicon-based panel sequencing data analysis pipelines. We have also added several value-added analysis modules, especially for the whole-exome data analysis pipeline. These modules include the detection of microsatellite instability and homologous recombination deficiency, the detection of allele-specific copy number alterations, clonality and tumor mutational signatures. For the amplicon-based panel sequencing data analysis pipelines, we have also added a new module to detect fusion from RNA panel sequencing data.
Modularity is the main characteristic of the SOCIBP pipeline. This facilitates a more efficient development of its components and deployment of only the modules needed for specific analyses, depending on data types and goals. Currently, the SOCIBP modules are run as independent versioned tools, which is sufficient for expert usage. Our next goal is to develop an umbrella wrapper (written in nextflow and/or snakemake) that will be versioned, and would handle the execution of each module or a series of modules. This would enable much easier usage of the SOCIBP pipeline and would facilitate its adoption in hospitals or research centers.
In summary, this pipeline could be useful to help bridge many complex sources of genomic data (whole genome, whole exome, targeted exomes, targeted amplicon-based panels).