Orchestrating Hi-C analysis with Bioconductor
Author(s): Jacques Serizay
Affiliation(s): Pasteur Institute, Paris; Institut de Biologie de l'Ecole Normale Supérieure, Paris
Hi-C is a chromosome conformation capture experimental method used to comprehensively detect chromatin interactions based on spatial proximity. During the last decade, it has become a prevalent approach in nuclear biology (gene regulation, genome spatial reorganization, genome rearrangements, etc), but also in a wide range of more distant fields such as medical biology, microbiology and environmental biology, genome assembly, biophysics and more recently in synthetic biology. A Hi-C experiment yields genome-wide interaction raw counts for pairs of genomic loci, which can be provided either at a base-pair resolution (stored in so-called “pairs” text files, in which each line describes a single interaction measured between two genomic loci) or binned at a chosen resolution (stored into symmetric sparse “matrix” binarized files, where consecutive columns/rows correspond to consecutive genomic bins). Several Bioconductor Hi-C packages have been developed providing statistical methods to investigate Hi-C. However, they suffer from several pitfalls: (1) import methods for disk-stored Hi-C matrices or pairs files are still lacking, (2) no standard class has been defined for Hi-C data representation and (3) existing packages are rarely operating on GInteractions instances directly. Thus, the lack of methodology surrounding Hi-C data has impeded their integration in the Bioconductor ecosystem. In the HiCExperiment R package, we implement a representation of Hi-C data built on top of BiocFile and GInteractions fundamental Bioconductor classes. We provide import methods to parse any of the three main Hi-C file formats currently in use (`.(m)cool`, `.hic` and HiC-Pro derived files) into HiCExperiment instances. This method allows for random requests, only parsing subsets of the disk-stored genome-wide indexed contact matrices. We further extend pre-existing Bioconductor methods to efficiently coerce a HiCExperiment object into either a (sparse) numeric matrix or a list of pairs of genomic loci. We also provide two companion packages: (1) HiCool, which sets up a contained, R-managed conda environment to process Hi-C data from reads to contact matrix, and (2) HiContacts, an analysis package to perform common Hi-C operations. We also maintain two additional gateway packages, fourDNData and DNAZooData, exposing Hi-C contact matrices and other genomic files generated by the DNA Zoo and 4DN consortia to the end user.