Supervised learning approaches are not limited to population identification and can be used to support the exploration and interpretation of single cell data, making full use of breadth and depth delivered by single cell technologies.
Since the term was coined at the end of the 19th century, cytometry has gone through multiple iterations driven by advances in technology. Now an integral part of the research and clinical practice, the “measurement of number and characteristics of cells” is linked to key developments in our understanding of cancer and autoimmune diseases, and our ability to treat those with increasing efficacy. The development of CyTOF and single cell RNAseq has changed cytometry in an unprecedented way, where the number of features that can be measured for every cell exceeds by far our ability to analyze it with classical methods such as manual gating.
To address this challenge, increasingly complex population identification solutions were developed breathing new life into concepts such as cell types and cellular states. As all downstream analysis depends on a representation of the data, one must strike the right balance between the identification of known population, and the discovery of new states related to the condition of the patient or to treatment. Recent advances in machine learning have helped scaling up and refining population identification without addressing those limitations.
To help with the identification of biologically relevant cellular states, we implemented a supervised method that learns differences in cell composition between conditions, bypassing the reliance on population identification. At the other end of the scale, cells shifting from one cellular state to another is the process underlying most disease- or treatment-related events. Understanding those coordinated changes is required for the correct interpretation of experimental results, yet available statistical methods are considering those changes as independent. We used random forest, a tested-and-true supervised learning technique, and compositional biplots to identify relevant populations and support the interpretation of multiple changes in the immune system detected upon infection in patients.