Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes

Authors: Glocker, Kevin and Herygers, Aaricia and Georges, Munir


Link: https://www.isca-speech.org/archive/interspeech_2023/glocker23_interspeech.html


This paper proposes Allophant, a multilingual phoneme recognizer. It requires only a phoneme inventory for cross-lingual transfer to a target language, allowing for low-resource recognition. The architecture combines a compositional phone embedding approach with individually supervised phonetic attribute classifiers in a multi-task architecture. We also introduce Allophoible, an extension of the PHOIBLE database. When combined with a distance based mapping approach for grapheme-to-phoneme outputs, it allows us to train on PHOIBLE inventories directly. By training and evaluating on 34 languages, we found that the addition of multi-task learning improves the model's capability of being applied to unseen phonemes and phoneme inventories. On supervised languages we achieve phoneme error rate improvements of 11 percentage points (pp.) compared to a baseline without multi-task learning. Evaluation of zero-shot transfer on 84 languages yielded a decrease in PER of 2.63 pp. over the baseline.

AImotion Challenge Results: a Framework for AirSim Autonomous Vehicles and Motion Replication

Authors: Bruno J. Souza, Lucas C. de Assis, Dominik R??le, Roberto Z. Freire, Daniel Cremers, Torsten Sch?n, Munir Georges

Link: https://ieeexplore.ieee.org/document/10026940


The use of simulation environments is becoming more significant in the development of autonomous cars, as it allows for the simulation of high-risk situations while also being less expensive. In this paper, we presented a framework that allows the creation of an autonomous vehicle in the AirSim simulation environment and then transporting the simulated movements to the TurtleBot. To maintain the car in the correct direction, computer vision techniques such as object detection and lane detection were assumed. The vehicle's speed and steering are both determined by Proportional-Integral-Derivative (PID) controllers. A virtual personal assistant was developed employing natural language processing to allow the user to interact with the environment, providing movement instructions related to the vehicle's direction. Additionally, the conversion of the simulator movements for a robot was implemented to test the proposed system in a practical experiment. A comparison between the real position of the robot and the position of the vehicle in the simulated environment was considered in this study to evaluate the performance of the algorithms.

An End-to-End Neural Network for Image-to-Audio Transformation

Authors: Liu Chen, Michael Deisher, Munir Georges

Link: https://doi.org/10.1109/ICASSP49357.2023.10096121


This paper describes an end-to-end (E2E) neural architecture for the audio rendering of small portions of display content on low resource personal computing devices. It is intended to address the problem of accessibility for vision-impaired or vision-distracted users at the hardware level. Neural image-to-text (ITT) and text-to-speech (TTS) approaches are reviewed and a new technique is introduced to efficiently integrate them in a way that is both efficient and back-propagate-able, leading to a non-autoregressive E2E image-to-speech (ITS) neural network that is efficient and trainable. Experimental results are presented showing that, compared with the non-E2E approach, the proposed E2E system is 29% faster and uses 19% fewer parameters with a 2% reduction in phone accuracy. A future direction to address accuracy is presented.

Analysis of Knowledge Tracing performance on synthesised student data (2023)

Authors: Panagiotis Pagonis, Kai Hartung, Di Wu, Munir Georges, S?ren Gr?ttrup

Link: https://oa.tib.eu/renate/server/api/core/bitstreams/9a028a47-5c64-4794-8dd7-c9582d4b9d35/content


Knowledge Tracing (KT) aims to predict the future performance of students by tracking the development of their knowledge states. Despite all the recent progress made in this field, the application of KT models in education systems is still restricted from the data perspectives: 1) limited access to real life data due to data protection concerns, 2) lack of diversity in public datasets, 3) noises in benchmark datasets such as duplicate records. To resolve these problems, we simulated student data with three statistical strategies based on public datasets and tested their performance on two KT baselines. While we observe only minor performance improvement with additional synthetic data, our work shows that using only synthetic data for training can lead to similar performance as real data.

Bias in Flemish Automatic Speech Recognition

Authors: Aaricia Herygers, Vass Verkhodanova, Matt Coler, Odette Scharenborg, Munir Georges

Link: https://www.essv.de/paper.php?id=1186


Research has shown that automatic speech recognition (ASR) systems exhibit biases against different speaker groups, e.g., based on age or gender. This paper presents an investigation into bias in recent Flemish ASR. Seeing as Belgian Dutch, which is also known as Flemish, is often not included in Dutch ASR systems, a state-of-the-art ASR system for Dutch is trained using the Netherlandic Dutch data from the Spoken Dutch Corpus. Using the Flemish data from the JASMIN-CGN corpus, word error rates for various regional variants of Flemish are then compared. In addition, the most misrecognized phonemes are compared across speaker groups. The evaluation confirms a bias against speakers from West Flanders and Limburg, as well as against children, male speakers, and non-native speakers.

Measuring Sentiment Bias in Machine Translation

Authors: Kai Hartung, Aaricia Herygers, Shubham Vijay Kurlekar, Khabbab Zakaria, Taylan Volkan, S?ren Gr?ttrup, Munir Georges

Link: https://link.springer.com/chapter/10.1007/978-3-031-40498-6_8


Biases induced to text by generative models have become an increasingly large topic in recent years. In this paper we explore how machine translation might introduce a bias in sentiments as classified by sentiment analysis models. For this, we compare three open access machine translation models for five different languages on two parallel corpora to test if the translation process causes a shift in sentiment classes recognized in the texts. Though our statistic test indicate shifts in the label probability distributions, we find none that appears consistent enough to assume a bias induced by the translation process.

The Hochschul-Assistenz-System HAnS: An ML-Based Learning Experience Platform

Authors: Thomas Ranzenberger, Tobias Bocklet, Steffen Freisinger, Lia Frischholz, Munir Georges, Kevin Glocker, Aaricia Herygers, René Peinl, Korbinian Riedhammer, Fabian Schneider, Christopher Simic, Khabbab Zakaria

Link: https://www.essv.de/paper.php?id=1188


The usage of e-learning platforms, online lectures and online meetings for academic teaching  increased during the Covid-19 pandemic. Lecturers created video lectures, screencasts, or audio podcasts for online learning. The Hochschul-Assistenz-System (HAnS) is a learning experience platform that uses machine learning (ML) methods to support students and lecturers in the online learning and teaching processes. HAnS is being developed in multiple iterations as an agile open-source collaborative project supported by multiple universities and partners. This paper presents the current state of the development of HAnS on German video lectures.

Unsupervised Multilingual Topic Segmentation of Video Lectures: What can Hierarchical Labels tell us about the Performance?

Authors: Steffen Freisinger, Fabian Schneider, Aaricia Herygers, Munir Georges, Tobias Bocklet, Korbinian Riedhammer

Link: https://www.isca-speech.org/archive/slate_2023/freisinger23_slate.html


The current shift from in-person to online education, e.g., through video lectures, requires novel techniques for quickly searching for and navigating through media content. At this point, an automatic segmentation of the videos into thematically coherent units can be beneficial. Like in a book, the topics in an educational video are often structured hierarchically. There are larger topics, which in turn are divided into different subtopics. We thus propose a metric that considers the hierarchical levels in the reference segmentation when evaluating segmentation algorithms. In addition, we propose a multilingual, unsupervised topic segmentation approach and evaluate it on three datasets with English, Portuguese and German lecture videos. We achieve WindowDiff scores of up to 0.373 and show the usefulness of our hierarchical metric.


Typological Word Order Correlations with Logistic Brownian Motion

Authors: Kai Hartung, Gerhard J?ger, S?ren Gr?ttrup, Munir Georges

Link: https://aclanthology.org/2022.sigtyp-1.3/


In this study we address the question to what extent syntactic word-order traits of different languages have evolved under correlation and whether such dependencies can be found universally across all languages or restricted to specific language families. To do so, we use logistic Brownian Motion under a Bayesian framework to model the trait evolution for 768 languages from 34 language families. We test for trait correlations both in single families and universally over all families. Separate models reveal no universal correlation patterns and Bayes Factor analysis of models over all covered families also strongly indicate lineage specific correlation patters instead of universal dependencies.

Hierarchical Multi-Task Transformers for Crosslingual Low Resource Phoneme Recognition

Authors: Kevin Glocker, Munir Georges

Link: https://aclanthology.org/2022.icnlsp-1.21


This paper proposes a method for multilingual phoneme recognition in unseen, low resource languages. We propose a novel hierarchical multi-task classifier built on a hybrid convolution-transformer acoustic architecture where articulatory attribute and phoneme classifiers are optimized jointly. The model was evaluated on a subset of 24 languages from the Mozilla Common Voice corpus. We found that when using regular multi-task learning, negative transfer effects occurred between attribute and phoneme classifiers. They were reduced by the hierarchical architecture. When evaluating zero-shot crosslingual transfer on a data set with 95 languages, our hierarchical multi-task classifier achieves an absolute PER improvement of 2.78% compared to a phoneme-only baseline.


Audio-Visual Recipe Guidance for Smart Kitchen Devices

Authors: Caroline Kendrick, Mariano Frohnmaier, Munir Georges

Link: https://aclanthology.org/2021.icnlsp-1.30


An important degree of accessibility, novelty, and ease of use is added to smart kitchen devices with the integration of multimodal interactions. We present the design and prototype implementation for  one such interaction: guided cooking with a smart food processor, utilizing both voice and touch  interface. The prototype’s design is based on user research. A new speech corpus consisting of  2,793 user queries related to the guided cooking scenario was created. This annotated data set was  used to train and test the neural-network-based natural language understanding (NLU)  component. Our evaluation of this new in-domain NLU data set resulted in an intent detection  accuracy of 97% with high reliability when tested. Our data and prototype (VoiceCookingAssistant, 2021) are open-sourced to enable further research in audio-visual interaction within the smart  kitchen context.