Forensics for Multi-Omics Data Generation: Diagnosing and Resolving Mislabeled Samples by Integrating Multiple Data Sources

Forensics for Multi-Omics Data Generation: Diagnosing and Resolving Mislabeled Samples by Integrating Multiple Data Sources


Author(s): Ryan Conrad Thompson,Matt Johnson,Diane Marie Del Valle,Edgar Gonzalez-Kozlova,Seunghee Kim-Schulze,Kai Nie,Eric Vornholt,Lora Liharska,Brian Kopell,The Mount Sinai COVID-19 Biobank Team,Eric E Schadt,Miriam Merad,Sacha Gnjatic,Carlos Estevez,Alex W Charney,Noam D Beckmann

Affiliation(s): Icahn School of Medicine at Mount Sinai

Social media: https://twitter.com/DarwinAwdWinner

Mislabeling events, which inevitably occur at varying rates in any study involving human handling of samples, pose a serious multi-layered threat to the integrity of biological findings. In the best-case scenario, they happen at any random point during sample collection, processing or data generation, by a wide range of mechanisms, introducing random noise that lowers overall study power. Alternatively, they can also arise in a specific fashion, in a way that allows to pinpoint weak links in the data generation process, but that also introduce systematic bias in data analyses. Thus, it is essential to eliminate mislabeling events as part of any quality control process, and before performing discovery analyses. This can be achieved either by inferring correct sample-specific labels or by discarding data from mislabeled samples that cannot be corrected. Further, in the context of human subjects research, mislabeled samples can also become part of a larger ethical problem, wherein study teams must confirm the identity of every possible sample and data point from subjects who withdraw from the study to ensure proper handling of their data. Multi-omics data provide a unique and powerful way to address this problem, where the complexity and redundancy of multiple measures on a single sample/subject can be leveraged to identify and often correct mislabeling events. Here, we will present a novel framework to efficiently identify and correct mislabeling events, as well as to determine their probable cause, in the Mount Sinai COVID-19 Biobank, a repository of blood and other samples from 1000+ subjects hospitalized with COVID-19 and controls in New York City, for which a large array of longitudinal multi-omics assays were performed, and clinical data collected. This framework includes: matching sample genotypes between NGS- and genotype array-based assays; a novel visualization of these genotype matches and mismatches that simplifies manual annotation of mislabeled samples; and corroborating genotype-based inferences of mislabeling using other data and metadata such as Olink, ELISA, clinical data, and sample quality control information. We will also discuss practical issues that arise in the process of correcting mislabeled samples for use in downstream analyses, including: inferring how and when a mislabeling event occurred (and why it matters); organizing information on mislabeled samples and their corrected labels to be both human-readable and amenable to programmatic manipulation; and handling of mislabeled samples from subjects who withdraw from the study. We are currently developing a software package to encapsulate our framework for correcting mislabeled samples in other multi-omics data sets.