An Attempt to Evaluate Arabic ASR Models

3/8/2026 • 7 min • Abdullah Altamimi
An Attempt to Evaluate Arabic ASR Models

An Attempt to Evaluate Arabic ASR Models

Automatic Speech Recognition (ASR) is no longer just a transcription tool. It is becoming a core component of the Agentic AI movement.

Agentic systems listen, reason, act, and produce outcomes. In enterprise environments, that often starts with speech. Meetings, calls, briefings, and reviews are all unstructured and spoken. If your ASR layer is weak, everything built on top of it becomes fragile.

In this blog, we share practical learnings from taking Yameen — an agentic AI solution — to production, delivering enterprise-grade meeting minutes in Arabic.

When speech is transformed into structured summaries, decisions, and action items, transcription quality is no longer a cosmetic metric. It directly affects trust, compliance, and operational reliability.

In many scenarios, you run your Arabic ASR model, compute WER, get 20%, and think: this looks decent.

It’s not perfect, but it’s usable.

Then production happens. Users complain about wrong numbers. A negation disappears. A medical term is mistranscribed. And suddenly that 20% WER doesn’t tell you what you actually need to know.

Benchmarking Arabic ASR is not just about computing a single metric. It is a careful process that starts with how you choose your data and ends with manual inspection of the outputs. If you skip parts of that pipeline, your numbers might look acceptable while your system quietly fails where it matters most.


Building a Reliable Arabic ASR Evaluation Pipeline

A proper evaluation pipeline is not a single step. It is a structured process that moves from data to metrics to interpretation.

Building an evaluation pipeline typically follows this order:

  • Choosing the right evaluation data
  • Defining and understanding your evaluation metrics (WER, CER, etc.)
  • Designing normalization rules aligned with your product
  • Inspecting qualitative outputs where impact is highest

Each stage constrains the next. Weak data produces misleading metrics. Poorly chosen metrics misrepresent model quality. Over-aggressive normalization can inflate scores. And without qualitative inspection, critical failures remain hidden.

In the following sections, we dive into each component from a production perspective, not just a research one.

image.png


Evaluation Data

Before discussing metrics, you need to decide what you are evaluating on.

Open datasets are convenient and save time, but they also come with risks. You must ask:

  • What dialect does it contain?
  • Is it read speech or spontaneous?
  • Does it match my domain (calls, lectures, recitation, news, etc.)?
  • What is the recording quality?

If your product targets Saudi customer service calls but your evaluation dataset contains mostly Modern Standard Arabic news reading, your results may look reasonable while hiding real weaknesses.

You should also check how the dataset was labeled:

  • Was it transcribed literally?
  • Were multiple annotators involved?
  • Are spelling conventions consistent?

If you want full control, you may decide to annotate your own data. This is expensive but often more reliable.

When annotating Arabic audio, two rules are critical:

Transcribe literally.

Do not clean the text.

If the speaker says: “يعني هو بصراحة يعني ما ادري”

Your reference must be: “يعني هو بصراحة يعني ما ادري”

Do not remove fillers. Do not fix grammar. ASR systems are evaluated against what was actually spoken.

It is also highly recommended to have more than one annotator label the same audio. Different annotators often produce different transcripts for the same recording, especially in Arabic where spelling variations are common.

For example:

“مسؤولية”

“مسئولية”

Both are valid spellings, but WER will treat one as incorrect if it does not exactly match the reference. Multiple annotators help reduce bias and increase the reliability of your ground truth.

image.png


Evaluation Metrics

Understanding WER and CER

Word Error Rate (WER) measures how many word-level edits are required to transform the prediction into the reference.

WER = (Substitutions + Deletions + Insertions) / Number of words in reference

Consider this example:

Reference: “انا احب تعلم الالة”

Prediction: “انا احب تعليم الالة”

Word-by-word comparison:

  • انا → correct
  • احب → correct
  • تعلم → تعليم (substitution)
  • الالة → correct

There is 1 substitution.

Number of words in reference = 4

WER = 1 / 4 = 25%

Now let’s compute CER on the full sentence.

Reference sentence: “انا احب تعلم الالة”

Prediction sentence: “انا احب تعليم الالة”

If we compare both sentences at the character level, the only difference is the extra character “ي” in “تعليم”.

The reference sentence contains 18 characters.

There is 1 insertion.

CER = 1 / 18 ≈ 5.6%

Notice the contrast:

WER = 25%

CER ≈ 5.6%

The model made a relatively small character-level mistake, but WER amplifies it because evaluation happens at the word level. This is especially important in Arabic, where morphology can turn small character changes into full word substitutions.

Both metrics are useful, but they capture different aspects of model behavior.

image.png


When You Don’t Have Ground Truth

Creating labeled Arabic data can take weeks or months. Sometimes you simply cannot afford it.

In such cases, some ASR systems provide internal signals that can help approximate quality.

For example, Whisper by OpenAI provides:

  • Average log probability
  • No-speech probability
  • Compression ratio

Average log probability reflects how confident the model is in its predictions. Lower confidence scores often correlate with higher error rates.

No-speech probability helps detect silence segments or over-transcription.

Compression ratio can detect hallucinations or repetitive outputs. If a model outputs:

“السلام السلام السلام السلام”

The compression ratio will often be abnormally high, signaling potential repetition.

These metrics are useful for filtering bad segments or ranking models quickly, but they are not replacements for proper evaluation with references.

image.png


Normalization in Arabic Evaluation

Normalization is the process of transforming text into a consistent format before computing metrics.

Why is this necessary?

Because Arabic allows multiple valid spellings and formatting variations. Without normalization, you may penalize a model for differences that do not matter for your actual use case.

Examples of common variations:

Punctuation differences:

“السلام عليكم”

“!السلام عليكم”

Alef variations:

“إيمان”

“ايمان”

Hamza variations:

“مسؤولية”

“مسئولية”

Ta Marbuta vs Ha in dialectal writing:

“مدرسة”

“مدرسه”

If you compute WER directly, all of these may count as errors.

Consider this example:

Reference: “!إيمان مسؤولية كبيرة”

Prediction: “ايمان مسؤوليه كبيره”

Without normalization, you might get multiple substitutions.

After normalization (removing punctuation, unifying Alef forms, normalizing Ta Marbuta), both could become:

“ايمان مسؤوليه كبيره”

Now WER becomes 0%.

So how do you choose the right normalization?

It depends entirely on your product.

If your system displays text directly to users and spelling matters, apply minimal normalization.

If your transcription feeds into an LLM, a search engine, or an information retrieval system, stronger normalization may better reflect semantic correctness.

The key is not to maximize your score artificially, but to define normalization rules that reflect what “correct” means in your real application.

image.png


Why Qualitative Inspection Still Matters

Even with careful labeling, proper metrics, and well-defined normalization, numbers do not tell the full story.

Consider this example:

Reference: “لا يوجد ورم في الكبد والحالة مستقرة”

Prediction A (lower WER but wrong meaning): “يوجد ورم في الكبد والحالة مستقرة”

WER ≈ 14%

Prediction B (higher WER but correctly preserves the critical meaning): “لا يوجد ورم بالكبد لكن الحالة مستقره”

WER ≈ 57%

Prediction A has a much lower WER. Numerically, it looks better.

But it completely removes the negation and flips the meaning. In medical, legal, or other safety-critical applications, this single deletion can be catastrophic.

Prediction B has a much higher WER, yet it preserves the most important information: there is no tumor.

If your pipeline is:

Audio → ASR → LLM → Final output

A model with slightly lower WER may still produce worse downstream results because it fails on critical words.

That is why you should always inspect samples manually, especially:

  • Negations (لا, لم, لن)
  • Numbers
  • Named entities
  • Domain-specific terminology
  • Rare but high-impact keywords

Qualitative evaluation often reveals systematic weaknesses that aggregate metrics hide.

image.png


Conclusion

Benchmarking Arabic ASR is not about reporting a single WER number.

It is about:

  • Using data that reflects your real-world use case.
  • Understanding what WER and CER actually measure.
  • Applying normalization thoughtfully.
  • Inspecting outputs manually where impact is highest.

Metrics are powerful abstractions. But your goal is not to win a leaderboard. It is to build a system that works reliably in your specific context.

In Arabic ASR, small orthographic differences and single-word errors can have large consequences. When ASR becomes the foundation of an Agentic AI system, those errors propagate downstream into summaries, decisions, and automated workflows.

Production-grade Agentic AI demands production-grade evaluation.

If you are building an Agentic AI product and preparing to take it to production — especially in Arabic enterprise environments — getting the ASR layer right is not optional.

If you are looking to take an Agentic AI product to production, feel free to reach out.

Beyond Intuition

Ready to Work
Together?

We help organizations bring their vision to life. Whether you’re pursuing a project or exploring new opportunities, our team is ready to collaborate.

Contact Hero Image