gerenbench - A Small Proprietary Benchmark for German RAG

Retrieval Augmented Generation (RAG) is quickly maturing into a useful (and boring) technology to connect proprietary, often local data to Large Language Models (LLMs) to augment their training data with knowledge and skills through in-context learning. RAG can enable important use cases such as tailored AI assistants for answering questions, generating complex reports ("Deep Research") or automating tasks ("Agentic AI").

RAG employs an information retrieval system, like a database or search engine, to augment the context of an LLM with relevant information. There are many different ways of varying complexity to achieve this, from simple approaches that enrich a user’s prompt with search results every time to agentic workflows that enable an LLM to freely use multiple search engines or databases as tools. When connecting an LLM to proprietary data, it is extremely important to ensure that it cannot access public data sources at the same time, to prevent data exhilaration.

This article presents results of "gerenbench", a small proprietary benchmark that uses German encyclopedia articles to estimate the accuracy of LLMs on RAG tasks. It provides a glimpse at the native and RAG performance of open weight LLMs runnable on consumer hardware as of March 2026.

Benchmark Setup

The benchmarks contains 100 multiple choice questions based on 25 German encyclopedia articles from to mid-nineties. Each question contains the relevant article and four possible answers. It encompasses the following two scenarios:

  1. LLM-Only: The LLM is presented with the question and four possible answers. The relevant article is withheld.
  2. LLM+RAG: The LLM is presented with the relevant article, the question and four possible answers.

In both scenarios, the LLM is instructed (via the prompts given in the appendix) to answer the question by generating a single letter (A, B, C, or D) designating the correct answer. Therefore, scenario 1 tests world knowledge, while scenario 2 tests world knowledge and RAG capability. Result analysis is fully deterministic and automated. A redacted example task is given in the appendix of this article. Note that scenario 2 assumes the best case scenario where the relevant article is included in the LLMs context exclusively every time.

All experiments were run on an Apple M3 with 24 GB of unified RAM using Ollama Version 0.17.4.

Results

The following 10 open weight LLMs where tested on both scenarios (Table 1):

Table 1: LLMs Tested
LLM Name Provider Model Card Size Quantization
Gemma 3 1b Google DeepMind Hugging Face Link 1b q4_K_M
Gemma 3 12b Google DeepMind Hugging Face Link 12b q4_K_M
gpt-oss-20b OpenAI Hugging Face Link 20b q4_K_M
LFM2 24b-a2b Liquid AI Hugging Face Link 24b-a2b q4_K_M
Ministral 3 3b Mistral AI Hugging Face Link 3b q4_K_M
Ministral 3 8b Mistral AI Hugging Face Link 8b q4_K_M
Ministral 3 14b Mistral AI Hugging Face Link 14b q4_K_M
Mistral Small 3.2 24b Mistral AI Hugging Face Link 24 b q4_K_M
Phi-4-mini 3.8b Microsoft Hugging Face Link 3.8b q4_K_M
Phi-4 14b Microsoft Hugging Face Link 14b q4_K_M

Table 2 shows mean accuracies and processing times for scenario 1, i.e. LLM-Only performance, without RAG. "Prompt Duration" denotes the mean time in seconds taken to process the prompt, "Generate Duration" denotes the mean time in seconds taken to generate the answer. Best results are shown in bold font.

Table 2: Accuracy and Processing Time without RAG (Ordered by Mean Accuracy)
LLM Name Accuracy (Mean) Prompt Duration (Mean, s) Generate Duration (Mean, s)
Ministral 3 14b (it-2512-q4_K_M) 0.77127 2.21029 0.15751
Mistral Small 3.2 24b (it-2506-q4_K_M) 0.75971 3.31204 0.177936
Ministral 3 8b (it-2512-q4_K_M) 0.7504 1.25622 0.291435
gpt-oss-20b 0.71902 0.865917 34.4813
Phi-4 14b (q4_K_M) 0.69161 2.39739 0.971595
Gemma 3 12b (it-q4_K_M) 0.6511 1.09994 0.104123
Ministral 3 3b (it-2512-q4_K_M) 0.58817 0.499747 0.0415913
LFM2 24b-a2b (q4_K_M) 0.56871 1.27881 0.0333128
Gemma 3 1b (it-q4_K_M) 0.50059 0.261436 0.0246014
Phi-4-mini 3.8b (q4_K_M) 0.4406 0.601904 0.260388

Table 3 shows mean accuracy and processing times for scenario 2, i.e. RAG. Best results are shown in bold font.

Table 3: RAG Accuracy and Processing Time (Ordered by Mean Accuracy)
LLM Name RAG Accuracy (Mean) RAG Prompt Duration (Mean, s) RAG Generate Duration (Mean, s)
Mistral Small 3.2 24b (it-2506-q4_K_M) 0.9697 11.9376 0.172491
gpt-oss-20b 0.96957 2.0207 8.44707
Phi-4 14b (q4_K_M) 0.96907 10.1061 0.690504
Ministral 3 14b (it-2512-q4_K_M) 0.96079 8.46598 0.146143
Ministral 3 8b (it-2512-q4_K_M) 0.95992 5.20979 0.130678
Gemma 3 12b (it-q4_K_M) 0.95978 4.45786 0.137686
Ministral 3 3b (it-2512-q4_K_M) 0.90893 2.15204 0.0413062
LFM2 24b-A2b (q4_K_M) 0.86951 2.62395 0.0338279
Phi-4-mini 3.8b (q4_K_M) 0.82001 2.58125 0.615699
Gemma 3 1b (it-q4_K_M) 0.50059 0.261436 0.0246014

Figure 1 shows the accuracies of all tested LLMs for scenario 1 (LLM only) and scenario 2 (LLM+RAG). 95% confidence intervals, generated via bootstrap sampling, are included.

Figure 1: Accuracy for LLM-Only (Scenario 1) vs. LLM+RAG (Scenario 2)

Finally, figure 2 shows a Pareto plot of LLM accuracy vs. generation time.

Figure 2: Pareto Plot of LLM Accuracy vs. Generation Time for LLM-Only (Scenario 1, Blue) and LLM+RAG (Scenario 2, Red)

Conclusions and Next Steps

The results show that mid-size, open-weight LLMs today are capable of solving the benchmark task with over 95% accuracy (an error rate of less than 1 in 20 queries). In contrast, when relying solely on internal world knowledge, even the best models tested are only about 75% accurate, producing an incorrect result for 1 out of 4 queries, on average. If an accuracy of 95% is good enough should be evaluated against the risk profile of the task at hand.

Result accuracy scales with model size, while there are still significant differences between models of comparable size, demonstrating the value of creating custom (proprietary) benchmarks. Model runtime is highly dependent on the LLM runtime (hard- and software used for inference), but also important in practice, which is why it is reported in the benchmark results.

Currently, the RAG portion of this benchmark (scenario 2) assumes perfect information retrieval, i.e. that relevant information is present every time, which is rarely the case in practice. Future iterations could be improved by including a quota of test cases where relevant information is missing. Furthermore, to move beyond the current multiple-choice format, a third scenario could be added to test free-form generation accuracy. This would involve removing multiple-choice options from the prompt template and evaluating the generated answers using pattern matching or an "LLM-as-a-judge" approach.

Appendix

All LLMs under test where instructed with the following prompt templates:

system_prompt = """
Your task is to answer multiple-choice questions with high precision and accuracy.
You must answer with the letter of the correct choice, i.e. either A,B,C, or D.
You must not generate anything else. 
"""

prompt_template = """**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

rag_prompt_template = """**Context:**
{context}

**Question:**
{question}

**Multiple Choice Options:**
A) {answer_a}
B) {answer_b}
C) {answer_c}
D) {answer_d}

Based on the given context, answer the given question using one of the multiple choice options.
You must return either A, B, C, or D.
You must return exactly one letter.
---
"""

The system_prompt gives general instructions to the model, while the prompt_template is used for the LLM-Only part of the benchmark (scenario 1) and rag_prompt_template is used for the LLM+RAG part of the benchmark (scenario 2).

The following shows a redacted example task of the benchmark:

Article
Thyssen AG, Dachgesellschaft eines in den Bereichen Investitionsgüter, Handel und Dienstleistungen und Stahlerzeugung und -verarbeitung tätigen Konzerns; Sitz: Düsseldorf. Die T. AG entwickelte sich, v.a. aus den beiden Hauptunternehmen Thyssen & Co. KG (gegr. 1871) und August Thyssen-Hütte AG (gegr. 1890), bis 1914 (...redacted...)

Question
Durch wen sollte die (...redacted...)?

Possible Answers
a) (...redacted...)
b) (...redacted...)
c) (...redacted...)
d) (...redacted...)