IRL - Exploiting clustering for bilingual-rooted spoken term selection

De Ensiwiki
Aller à : navigation, rechercher

Exploiting clustering for bilingual-rooted spoken term selection

Labo LIG


The goal of this project is to introduce research in machine learning through the study of different clustering techniques for the task of spoken term selection.

The data used for this project contains a collection of clusters of french words, where each cluster is related to a french word and contains a group of different phoneme sequences (words realizations) in a different language. For each (french word, phoneme sequence) pair, there is a confidence metric that can be exploited.

The goal of the project is to investigate different features (such as editing distance, length, confidence, etc) for filtering the contents of these clusters, keeping only the ​N most meaningful phoneme sequences.

A second objective for this project would be to perform ​concept clustering​. For instance, when the query for the word “aimer” is performed, the result should include not only the phoneme sequences from the “aimer” cluster, but also the ones from “adorer” and other related words.

Finally, we ask the student to implement a tool for visualizing the obtained clusters.

Key-activities to be developed

Data pre-processing and features extraction. Investigation of different clustering techniques for selection of term candidates. Experimentation with classical machine learning algorithms for clustering. Implementation of a visualization platform.


A good level of programming skills in python is required. Knowledge in libraries for machine learning is a plus, but not mandatory. The student must be able to read (and fully understand) documents in English. Speaking is not necessary. The final document can also be written in French.