Multilingual and open source: OpenGPT-X research project releases large language model

November 26, 2024 by Katrin Berkler, Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS

Collected at: https://techxplore.com/news/2024-11-multilingual-source-opengpt-large-language.html

The large language model of the OpenGPT-X research project is now available for download on Hugging Face: “Teuken-7B” has been trained from scratch in all 24 official languages of the European Union and contains 7 billion parameters.

Researchers and companies can leverage this commercially usable open source model for their own artificial intelligence applications. The OpenGPT-X consortium—led by the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS—have developed an AI language model that is open source and has a distinctly European perspective.

“In the OpenGPT-X project, we’ve spent the last two years researching the underlying technologies for large AI foundation models and training models with leading industry and research partners. We are delighted to be able to make our ‘Teuken-7B’ model freely available, providing a public, research-based alternative for use in academia and industry,” says Prof. Stefan Wrobel, Director of Fraunhofer IAIS.

“Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt and develop the model for their own work and applications. In this way, we want to contribute, both within the scientific community and together with companies from different industries, to the growing demand for transparent and customizable generative AI solutions.”

Teuken-7B is currently one of the few large language models developed multilingually from the ground up. It contains approximately 50% non-English pre-training data and has been trained in all 24 official European languages. It has proven to be stable and reliable in its performance across multiple languages.

This provides added value, particularly for international companies and organizations with multilingual communication requirements, products and services. The open source model allows companies and organizations to run their own customized models in real-world applications. Sensitive corporate data can remain within the company.

In addition to model training, the OpenGPT-X team also addressed a number of research questions, such as how to train and operate multilingual AI language models in a more energy- and cost-efficient way. To this end, the project developed a multilingual “tokenizer.”

The task of a tokenizer is to break down words into individual word components—the fewer tokens, the more (energy-) efficiently and quickly a language model can generate the answer. The developed tokenizer leads to a reduction in training costs compared to other multilingual tokenizers like Llama3 or Mistral. This is particularly valuable for European languages with longer word structures such as German, Finnish or Hungarian.

Teuken-7B is accessible via the Gaia-X infrastructure. Actors in the Gaia-X ecosystem can thus develop innovative language applications and transfer them into concrete application scenarios in their respective domains. Unlike existing cloud solutions, Gaia-X is a federated ecosystem that allows service providers and data owners to connect. Data remains securely with its owners and is only shared under defined conditions.

“I am excited to witness today’s publication of Teuken-7B, a large language model based on Gaia-X, and would like to congratulate the OpenGPT-X project on having reached this important milestone.

“A special feature of Teuken-7B is that it enables the secure use of sensitive corporate data, as the Gaia-X standards guarantee data storage and processing in accordance with the strictest European data protection and security regulations.

“This new model and innovations like this strengthen the digital sovereignty, competitiveness and resilience of Germany and of Europe,” says Dr. Franziska Brantner, Parliamentary State Secretary at BMWK.

Prof. Bernhard Grill, Director of Fraunhofer IIS, emphasizes the model’s potential for safety-critical applications. “With this independently developed language model, the project partners demonstrate their ability to generate their own large models.

“Access to a large language model enables applications that offer much greater control over this technology without the need for opaque third-party components—for example, in safety-critical fields such as automotive, robotics, medicine and finance. By training on data relevant to a specific application and using application-specific architectures, companies can create customized AI solutions that do not require ‘black box’ components.”

Generative AI by a strong consortium—with a European perspective

Important research results from the OpenGPT-X project have been incorporated into the model development, such as tools and technologies for processing large amounts of data, leveraging powerful European HPC infrastructure and performing efficient model training.

Teuken-7B was trained on the JUWELS supercomputer at Forschungszentrum Jülich. In addition to the two Fraunhofer Institutes and Forschungszentrum Jülich, the consortium’s partners include TU Dresden, the German Research Center for Artificial Intelligence (DFKI), IONOS, Aleph Alpha, ControlExpert, Westdeutscher Rundfunk (WDR) and the German AI Association (KI Bundesverband).

The technology developed in OpenGPT-X will also provide the partners with a basis for training their own models in the future.

“OpenGPT-X is an example of how the resources of a publicly funded project and the collaborative efforts of a broad consortium can deliver valuable foundational technology—from underlying infrastructure to model training to productive applications.

“In the interest of technology and data sovereignty, it is important to build on this foundation: Our hope is that OpenGPT-X will lay the groundwork for many subsequent activities,” emphasizes Daniel Abbou, Managing Director of the German AI Association and President of the European AI Forum.

The research project, which was launched at the beginning of 2022, is now nearing completion. It will run until 31 March 2025 so that further optimizations and evaluations of the models can take place.

The path to using Teuken-7B

Interested developers from academia or industry can download Teuken-7B free of charge from Hugging Face and work with it in their own development environment. The model has already been optimized for chat through “instruction tuning.” Instruction tuning is used to adapt large language models so that the model correctly under-stands instructions from users, which is important when using the models in practice—for example in a chat application.

Teuken-7B is freely available in two versions: one for research-only purposes and an “Apache 2.0” licensed version that can be used by companies for both research and commercial purposes and integrated into their own AI applications. The performance of the two models is roughly comparable, but some of the datasets used for instruction tuning preclude commercial use and were therefore not used in the Apache 2.0 version.

More information:

Model download and model card: huggingface.co/openGPT-X
Model release blog post on the project website: opengpt-x.de/en/models/teuken-7b
OpenGPT-X publications: opengpt-x.de/news-de
European LLM Leaderboard: huggingface.co/spaces/openGPT- … pean-llm-leaderboard
Discord server for community feedback and technical questions: discord.gg/RvdHpGMvB3

Generative AI by a strong consortium—with a European perspective

The path to using Teuken-7B

Leave a Reply Cancel reply