Sarvam-M: The Powerhouse Indic AI Model Built for Efficiency

Philip Moses
2 days ago
3 min read

Large language models (LLMs) like ChatGPT excel in English—but struggle with Indian languages. Complex scripts, limited training data, and cultural nuances make it tough for global AI to work well in India.

Enter Sarvam-M, a 24-billion-parameter hybrid AI model built by Sarvam AI, optimized for:

✅ 11 Indian languages (Hindi, Tamil, Bengali & more)

✅ Math & coding (performs like models 3x its size!)

✅ Low-cost deployment (runs efficiently even on CPUs)

In this blog, we’ll explore how Sarvam-M, a powerful 24-billion-parameter hybrid AI model, was optimized for Indian languages, math, and programming.

We’ll cover:

Training & Fine-Tuning – How Sarvam-M was refined for Indic languages.
Inference Optimizations – Techniques like quantization and lookahead decoding for faster responses.
Knowledge Augmentation – Using Retrieval-Augmented Generation (RAG) for better accuracy.

Let’s dive in!

Why Sarvam-M? The Need for Indic AI

Most large language models (LLMs) struggle with Indian languages due to complex grammar, diverse scripts, and limited training data. Sarvam AI built Sarvam-M to bridge this gap—a 24B-parameter model based on Mistral Small, fine-tuned for:

Indian Languages (Hindi, Tamil, Bengali, and more)
Math & Coding – Comparable to models 3x its size!
Dual Modes – Fast responses ("non-think") vs. deep reasoning ("think" mode).

This makes it ideal for chatbots, translation, and education tools in India.

How Sarvam-M Was Fine-Tuned for Performance

A. Supervised Fine-Tuning (SFT): Better Data = Better AI

Sarvam AI carefully curated 3.7 million high-quality prompts, including translations in 11 Indian languages. Key steps:

Deduplication – Removed duplicate or low-quality data.
Clustering & Filtering – Grouped similar prompts for balanced training.
Cultural Adaptation – Adjusted responses to fit Indian contexts.

B. Reinforcement Learning (RLVR): Smarter Rewards

Instead of generic rewards, Sarvam-M used verifiable feedback for tasks like:

Math Problems – Checking answers against LaTeX-formatted solutions.
Coding Tasks – Running code in a sandbox to verify correctness.
Translation – Using chrF++ scores to measure quality.

This ensured the model learned accurately, not just generically.

Making Sarvam-M Faster & Cheaper to Run

A. FP8 Quantization: Smaller Model, Same Accuracy

By reducing precision from 16-bit to 8-bit, Sarvam-M became 50% smaller with almost no loss in performance.

🔹 Key Insight: Calibration data must match real-world usage—bad samples hurt accuracy!

B. Lookahead Decoding: 2x Faster Responses

This technique predicts multiple tokens at once, speeding up replies. However:⚠ Limitation: Struggles with high user loads due to batch constraints.

C. Dynamic Batching & KV Caching

Dynamic Batching – Groups multiple requests to maximize GPU usage.
KV Caching – Stores past computations to avoid re-processing.

These tweaks help Sarvam-M handle more users with lower costs.

D. CPU-Friendly AI: Reducing GPU Dependency

Sarvam AI optimized the model to run efficiently even on CPUs, making AI more accessible in India.

Boosting Accuracy with Retrieval-Augmented Generation (RAG)

Since LLMs can’t know everything, Sarvam-M integrates Wikipedia lookups for factual answers. Results:

✅ SimpleQA Benchmark: Accuracy jumped from 5% → 72%

✅ Indic Language QA: Beat OpenAI’s models by 12-48%

Optimizations:

Better Chunking – Keeps tables and structured data intact.
Multilingual Embeddings – Used bge-multilingual-gemma2 for improved search.

Lessons from Failed Experiments

❌ Tokenizer Expansion for Indian Languages – Adding new tokens hurt model knowledge.

❌ LLM-Based Rewards in Reinforcement Learning – Unstable feedback made training unreliable.

Key Takeaway: Some optimizations need fundamental changes, not quick fixes.

Conclusion: A Blueprint for Efficient, Localized AI

Sarvam-M proves that smaller, specialized models can outperform larger ones when optimized correctly. Key wins:

✔ Faster inference with FP8 quantization & lookahead decoding.

✔ Better Indic performance through fine-tuning & RAG.

✔ Cost-effective deployment on CPUs & optimized GPUs.

This approach makes AI more accessible for India—and could inspire similar models worldwide!