

Based on the python language, it inherits concepts from GNU make ( ): a workflow is defined by a set of rules, each defining an operation characterized by its inputs, outputs and parameters, and a list of target files to be generated through these operations.
#BREW INSTALL WGET TAKES A LOT TIME SOFTWARE#
Snakemake ( Köster and Rahmann, 2012) is a software conceived for building such workflows. A workflow also enables to perform identical treatments on dozens of samples, using powerful computing infrastructures when necessary. The exploitation of the data by properly implemented bioinformatics workflows (comprehensive specification of the tools, their versions and selection of parameters) is crucial to ensure traceability and reproducibility of the results from the raw data. Schematic wiring of a basic workflow for ChIP-seq analysis. It is also necessary, like in any other biological experiment, to carefully define the control conditions that will distinguish signal from noise (see section Commentary for more details). Biological samples are subject to variation, and replicates are thus essential to estimate the statistical significance of the final results, and to ensure a tradeoff between sensitivity and specificity. Ideally, the experimental design should already take into account a perspective on the bioinformatics analyses that will enable the extraction of relevant information from the raw data. A workflow, or pipeline, is defined as a chaining of commands and tools applied to a set of data files, so that the output of a given step is used as input for the subsequent one ( Figure 1). The analysis of sequenced reads requires a number of successive bioinformatics processing steps, organized into workflows. Transcription factor binding can be characterized at a genome-scale by ChIP-seq, whereas RNA-seq allows to quantify all the transcripts. Next-generation sequencing technologies enable the characterization of biological gene regulation at an unprecedented scale. They are illustrated by a study combining ChIP-seq and RNA-seq to identify target genes of the global transcription factor FNR in Escherichia coli ( Myers et al., 2013), with the advantage that results can be compared with the most up-to-date collection of existing knowledge about transcriptional regulation in this model organism, extracted from the RegulonDB database ( Gama-Castro et al., 2016).

The protocols describe a modular system enabling the user to compose and fine-tune workflows based on SnakeChunks, a library of rules for the Snakemake workflow engine ( Köster and Rahmann, 2012). Despite the availability of user-friendly solutions, there is a strong need for accessible solutions to allow experimental biologists to analyze and explore their results in an autonomous and flexible way. In order to ensure reproducible results, there is a crucial need to improve the automation of processing for the forthcoming studies relying on big datasets. Next-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences.
