Projet de spécialité - Simulations pour la détection de modules de communautés bactériennes à l'aide d'autoencodeurs variationnels profonds
Biological and clinical context: A typical human intestinal microbiome contains several kilograms of bacteria and around 100 times more genes than the human genome . It has co-evolved with us for all our history and, in recent years, the intimate connection between our gut flora and our health has emerged as a central theme in medicine. Metagenomics allows us to study the gut microbiome by directly sequencing the genomic DNA without the need for prior cultivation in the lab, leading to big sequence data that can be processed through computational methods. We can for example study using machine learning techniques the association between the genetic information of the bacterial community and a phenotype (e.g. disease) - known as MWAS (Metagenome-wide association study). While these associations are usually done at the gene or bacterial level, it is well known that bacteria are involved in several molecular functions and conversely the same molecular function can be carried out by different species.
Goal: We propose here to create simulations to investigate the inference of associations between modules and health status using deep dimensionality reduction.
We propose to discover modules in the abundance signals of genes in metagenomic samples through deep learning. More precisely, we will elaborate a simulation that mimicks functional modules and bacterial communitites and try to recover them using deep variational autoencoders . Test on real gut microbiota data can be carried out if time allows since the data is available and already preprocessed by the Bioconductor package .
Looking for modules in abundance data: Technically, we propose here to use mixture models and deep variational autoencoders to reduce the high dimensionality of the matrix of strain abundance vs. samples to a lower dimensional space interpreted as functional modules of associated bacteria. We plan first to investigate simple linear admixtures (using techniques like Sparse Non-negative Matrix Factorization) as they are easy to implement and to interpret: the participation of a bacterial species to a module is defined as its mixture coefficient. The sampling of the metagenomic data is not perfect and therefore species with low abundance can be completely blind to the sampling. As a substantial minimum abundance of bacteria is necessary to detect a signal in the metagenomic data, defining modules with linear mixtures can therefore be problematic. Autoencoders (possibly deep) provide a more powerful possibility for dimensionality reduction using positive, non-linear activation functions as weights. The interpretation of their weights is however not straightforward, and in sillico sampling of the input space is necessary to infer the communities associated to a module (technically, to the activation of a given neuron of the middle layer of the autoencoder). You will design a simulation plan and investigate to what degree of noise and modularity the different models are robust and efficient at detecting functional modules.
Techniques involved: Deep learning, data science, python/R, keras/pyTorch.
Contact: Clovis Galiez email@example.com
External references: -  Gilbert et al, Nature med. 2018, https://doi.org/10.1038/nm.4517 -  Pasolli et al, Nature Methods 2017, https://doi.org/10.1038/nmeth.4468 -  https://en.wikipedia.org/wiki/Autoencoder