Transcription generally works quite well with Standard German recordings. However, when the recordings are in Bavarian dialect, Automatic Speech Recognition (ASR) systems falter. This is evidenced by a research project conducted under the AI for Media Network by Bayerischer Rundfunk and LMU Munich, using bedtime stories as a case study.

The AI for Media Network aims to foster collaborative projects, whether between media organizations or between media and academia. An example of this is the current collaboration between Bayerischer Rundfunk (BR) and Ludwig-Maximilians-Universität (LMU) Munich. Representatives from the BR archive and the Center for Information and Language Processing (CIS) at LMU met in 2024 during the “Science meets Journalism” workshop organized by the AI for Media Network. There, the team led by Prof. Barbara Plank, Chair of AI and Computational Linguistics at LMU, and Gabriele Wenger-Glemser, Head of Documentation & Research at BR, identified dialect transcription as a mutual interest. Shortly thereafter, they agreed on a joint research project: How well can ASR systems transcribe recordings in Bavarian dialect into Standard German?
The Center for Information and Language Processing at LMU fed three different ASR model families —Whisper, XLS-R, and MMS— with six hours of dialect data from BR. “AI research can benefit from our archival content, and it reflects well on us at BR to emphasize the diversity and regionality of AI models,” says Wenger-Glemser. The BR AI guidelines explicitly mention the target of „working on language models processing regional dialects” under the point “diversity and regional focus.”
Specifically, the dialect recordings consisted of a series of bedtime stories for children, known as “Betthupferl,” in three dialect groups: Franconian (Lower Franconian, Middle Franconian, Upper Franconian), Bavarian (Upper Bavarian, Lower Bavarian, Upper Palatinate), and Swabian. Since the programs were recorded by professional speakers in a studio environment, there were no background noises, making them high-quality data. For comparison, the researchers also fed Standard German recordings into the systems. The task in both cases was to produce a transcript in Standard German.
Transcription of dialect data can lose meaning
The result: The speech recognition models made significantly more errors when transcribing dialect recordings compared to Standard German audios. The LMU researchers compared the ASR-generated transcripts with two manually created transcripts: one in Standard German and one in the respective dialect. The models struggled to accurately transcribe individual words. For example, in a Middle Franconian bedtime story, the following sentence appeared:
“Sofort alle ausschwärma und da Mathilda ihr Geldstückle sung, sonst zach ich eich, wo da Bartl an Most hoid.”
The Standard German reference sentence would be:
“Sofort alle ausschwärmen und Mathildas Geldstück suchen, sonst zeige ich euch, wo’s langgeht.” (Everybody, immediately spread out and search for Mathilda’s coin, or I’ll show you what’s what!)
The Whisper large-v3 model—like ChatGPT, a product of OpenAI—transcribed it as:
“Sofort alle Ausschwärmer und der Mathilda ihr Geldstück lesung. Sonst zeig ich euch, wo der Badl den Most holt.” (Everybody swarmers and reading Mathilda’s coin or I’ll show you, where the Badl fetches the cider)
In this example, the meaning of the dialect-spoken sentence is lost in the transcription. The most critical quality metric is the word error rate, which indicates the percentage of incorrectly transcribed words. Overall, the Whisper large-v3 model performed best, with a word error rate of 31% for the Betthupferl data. In contrast, the word error rate for Standard German recordings was only 9%. Whisper (and other speech recognition models) can transcribe Standard German much better than Bavarian dialect.
LMU examines whether fine-tuning with dialect data is possible
According to Verena Blaschke, a research associate at CIS, this is due to a lack of training data. The ASR models are typically trained with German or English language data, but not Bavarian dialects: “I suspect that the transcription results would be better if the models were trained with southern German dialect data.”
To improve ASR models’ ability to transcribe Bavarian dialect data, they would need additional training. CIS is currently investigating whether the amount of “Betthupferl” data provided by BR is sufficient for this purpose.
The study’s results, already published as a preprint, will be presented by Blaschke on August 18 at the Interspeech conference in Rotterdam. She hopes to gain insights into the best methods for fine-tuning. LMU, in consultation with BR, selected open-source models to enable other researchers to replicate the study. Additionally, open-source models can be fine-tuned, unlike closed systems.
A Bavarian model would be an asset
If LMU succeeds in training models for the various Bavarian dialects, it would not only benefit BR significantly. Audio and video spoken in dialect could then be transcribed with minimal errors, allowing passages spoken in Bavarian dialects to be subtitled in Standard German. “A classic example is the agricultural magazine ‘Unser Land’. Bavarian farmers often appear and need subtitles, otherwise, they are not understood by those unfamiliar with their dialect. This dialect should be transcribed with speech-to-text models, and the better they work, the less correction effort is required by the subtitling team,” says Constantin Förster from the BR archive.
Additionally, audio and video could be searched for specific keywords, useful for locating particular sound bites. Reliable transcription-based summaries of radio program content would also benefit from a speech recognition model that understands Bavarian dialect.
This dialect transcription project is a prime example of practical collaboration between media and academia, aligning with the AI for Media Network’s stated goals. We will continue to report on the project’s progress.