header

Profile


photo

Konstantin Wilfried Kühlem, M.Sc.
Room K111
Phone: +49 241 80 29326
Fax: 49 241 80 22134
Email: kuehlem@vr.rwth-aachen.de

Social Virtual Reality Team



Publications


Demo: A Latency-Optimized LLM-based Multimodal Dialogue System for Embodied Conversational Agents in VR


Konstantin Wilfried Kühlem, Jonathan Ehret, Torsten Wolfgang Kuhlen, Andrea Bönsch
ACM International Conference on Intelligent Virtual Agents (IVA)
pubimg

Interactions with Embodied Conversational Agents (ECAs) are essential in many social Virtual Reality (VR) applications, highlighting the growing demand for free-flowing, context-aware conversations supported by low-latency, multimodal ECA responses. We introduce a modular, extensible framework powered by an Large Language Model (LLM), featuring streaming-based optimization techniques specially crafted for multimodal responses. Our system is capable of controlling self-behavior and task execution, in the form of moving through the Immersive Virtual Environment (IVE) directly controlled by the LLM, and is also capable of reacting to events in the IVE. In our study, our applied optimizations achieved a latency improvement of about (66%) on average compared to having no optimizations.

» Show Videos
» Show BibTeX

@inproceedings{Kuehlem2025,
author = {W. K\"{u}hlem, Konstantin and Ehret, Jonathan and W. Kuhlen, Torsten and B\"{o}nsch, Andrea},
title = {A Latency-Optimized LLM-based Multimodal Dialogue System for Embodied Conversational Agents in VR},
year = {2025},
isbn = {9798400715082},
publisher = {Association for Computing Machinery},
doi = {10.1145/3717511.3749287},
abstract = {Interactions with Embodied Conversational Agents (ECAs) are essential in many social Virtual Reality (VR) applications, highlighting the growing demand for free-flowing, context-aware conversations supported by low-latency, multimodal ECA responses. We introduce a modular, extensible framework powered by an Large Language Model (LLM), featuring streaming-based optimization techniques specially crafted for multimodal responses. Our system is capable of controlling self-behavior and task execution, in the form of moving through the Immersive Virtual Environment (IVE) directly controlled by the LLM, and is also capable of reacting to events in the IVE. In our study, our applied optimizations achieved a latency improvement of about (66\%) on average compared to having no optimizations.},
booktitle = {Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents},
articleno = {49},
numpages = {3},
series = {IVA '25}
}





Poster: Listening Effort In Populated Audiovisual Scenes Under Plausible Room Acoustic Conditions


Cosima A. Ermert, Karin Loh, Karl Baylan, Konstantin Wilfried Kühlem, Andrea Bönsch, Torsten Wolfgang Kuhlen, Janina Fels
International Symposium on Auditory and Audiological Research (ISAAR) 2025
pubimg

Listening effort in real-world environments is shaped by a complex interplay of factors, including time-varying background noise, visual and acoustic cues from both interlocutors and distractors, and the acoustic properties of the surrounding space. However, many studies investigating listening effort neglect both auditory and visual fidelity: static background noise is frequently used to avoid variability, talker visualization often disregards acoustic complexity, and experiments are commonly conducted in free-field environments without spatialized sound or realistic room acoustics. These limitations risk undermining the ecological validity of study outcomes. To address this, we developed an audiovisual virtual reality (VR) framework capable of rendering immersive, realistic scenes that integrate dynamic auditory and visual cues. Background noise included time-varying speech and non-speech sounds (e.g., conversations, appliances, traffic), spatialized in controlled acoustic environments. Participants were immersed in a visually rich VR setting populated with animated virtual agents. Listening effort was assessed using a heard-text-recall paradigm embedded in a dual-task design: participants listened to and remembered short stories told by two embodied conversational agents while simultaneously performing a vibrotactile secondary task. We compared three room acoustic conditions: a free-field environment, a room optimized for reverberation time, and an untreated reverberant room. Preliminary results from 30 participants (15 female; age range: 18–33; M = 25.1, SD = 3.05) indicated that room acoustics significantly affected both listening effort and short-term memory performance, with notable differences between free-field and reverberant conditions. These findings underscore the importance of realistic acoustic environments when investigating listening behavior in immersive audiovisual settings.



Datenschutzerklärung/Privacy Policy Home Visual Computing institute RWTH Aachen University