Fully Transparent German LLM and Encoder Families Trained from Scratch
10 February 2026 9:00am
The School of EECS is hosting the following Guest Seminar:
Fully Transparent German LLM and Encoder Families Trained from Scratch
Speaker: Professor Andreas Hotho, University of Würzburg, Germany
Host: Professor Gianluca Demartini
Abstract:
While most contemporary language models prioritize English, we present two complementary, German-first model families trained entirely from scratch: Llämmlein, a decoder-only LLM family spanning 120M–7B parameters, and ModernGBERT, an encoder-only family spanning 138M–1B parameters. Both are designed for scalable training and full transparency and are trained on a German-only corpus using a custom German tokenizer. We outline the motivations for building fully German models, describe the construction of the dataset and training pipeline, and discuss key technical challenges in tokenization, corpus curation, data quality, and stable scaling across model sizes and heterogeneous cluster configurations. We evaluate the models using SuperGLEBer, a new German-specific benchmark in a fine-tuned setting, and complement this with prompt-based evaluation using the LM Harness to enable standardized, reproducible comparisons across tasks and model sizes.
While most contemporary language models prioritize English, we present two complementary, German-first model families trained entirely from scratch: Llämmlein, a decoder-only LLM family spanning 120M–7B parameters, and ModernGBERT, an encoder-only family spanning 138M–1B parameters. Both are designed for scalable training and full transparency and are trained on a German-only corpus using a custom German tokenizer. We outline the motivations for building fully German models, describe the construction of the dataset and training pipeline, and discuss key technical challenges in tokenization, corpus curation, data quality, and stable scaling across model sizes and heterogeneous cluster configurations. We evaluate the models using SuperGLEBer, a new German-specific benchmark in a fine-tuned setting, and complement this with prompt-based evaluation using the LM Harness to enable standardized, reproducible comparisons across tasks and model sizes.
Finally, we report on downstream alignment and instruction training, including additional stages such as supervised fine-tuning (SFT), preference optimization via DPO, and reinforcement learning (RL). We discuss practical limitations arising from comparatively weak German SFT/DPO resources and describe an empirical mitigation strategy that incorporates English data in a final training phase to improve overall quality and alignment behavior, while preserving a German-first model design.
BIO:
Andreas Hotho is a Professor at the University of Würzburg, where he holds the Chair of Data Science and serves as Speaker of the Centre for Artificial Intelligence and Data Science (CAIDAS). He received his Ph.D. from the University of Karlsruhe, where he worked from 1999 to 2004 at the AIFB Institute in the areas of text, data and web mining, the semantic web, and information retrieval. From 2004 to 2009, he was a senior researcher at the University of Kassel with focus on machine learning for social media analysis and from 2011 to 2018 a member of the L3S Research Center in Hannover. Since 2005, he has led the development of the social bookmarking and publication sharing platform BibSonomy.
For over a decade, his research group has been advancing data science and machine learning, with emphasis on deep learning, foundation models, language models, and knowledge-enriched AI that combine neural and symbolic representations for interpretable and transparent systems. His work covers applications in climate and environmental modeling, network flows, medicine, and digital humanities, and includes projects such as BigData@Geo, we4Bee, and BeeConnected, which link machine learning with sustainability and sensor-based ecosystem analysis.
He has published more than 350 papers in international journals and conferences, co-edited books and special issues, and serves as Editor-in-Chief of the open-access journal Transactions on Graph Data and Knowledge (TGDK). His research has been recognized with the SWSA Ten-Year Award at ISWC 2018 and the Best Paper Award at the Web Conference 2015.
Andreas Hotho is a Professor at the University of Würzburg, where he holds the Chair of Data Science and serves as Speaker of the Centre for Artificial Intelligence and Data Science (CAIDAS). He received his Ph.D. from the University of Karlsruhe, where he worked from 1999 to 2004 at the AIFB Institute in the areas of text, data and web mining, the semantic web, and information retrieval. From 2004 to 2009, he was a senior researcher at the University of Kassel with focus on machine learning for social media analysis and from 2011 to 2018 a member of the L3S Research Center in Hannover. Since 2005, he has led the development of the social bookmarking and publication sharing platform BibSonomy.
For over a decade, his research group has been advancing data science and machine learning, with emphasis on deep learning, foundation models, language models, and knowledge-enriched AI that combine neural and symbolic representations for interpretable and transparent systems. His work covers applications in climate and environmental modeling, network flows, medicine, and digital humanities, and includes projects such as BigData@Geo, we4Bee, and BeeConnected, which link machine learning with sustainability and sensor-based ecosystem analysis.
He has published more than 350 papers in international journals and conferences, co-edited books and special issues, and serves as Editor-in-Chief of the open-access journal Transactions on Graph Data and Knowledge (TGDK). His research has been recognized with the SWSA Ten-Year Award at ISWC 2018 and the Best Paper Award at the Web Conference 2015.
About Data Science Seminar
This seminar series is hosted by EECS Data Science.
Venue
Room: 78 - 632 (MM Lab)