Skip to content
Inventory

What languages are live right now

This page shows what is available now and what is still being built. Pipeline work is visible for planning, but it is not counted as live.

Live vs pipelineLast updated · May 20, 2026

Languages now

11

Available now

Dialect clusters

37

With dialect-level metadata

Text tokens

2.4B

Across released packages

Speech hours

18.6k

Segmented + transcript-audited

Regional map

Where we are live, and where we are still building

Five regions are tracked today, with East and Northeast as home-strength coverage.

Available now

East

Bengali, Odia, Maithili, Bhojpuri, Santali

Available now

Northeast

Assamese, Meitei, Khasi pipeline, Bodo pipeline

Available now

North

Hindi, Punjabi pipeline

Available now

West

Marathi, Gujarati pipeline

Available now

South

Tamil, Telugu, Kannada pipeline, Malayalam pipeline

Available now

Languages you can review now

For each one we show format, size, tier range, and QA status.

LanguageBengali
Dialect / VariantStandard · Sylheti · Chittagonian · Rangpuri
RegionEast
ModalityText · Speech · Hybrid
Data volume1.1B tokens · 5.2k hrs
Risk tierTier 1-2
QA statusRelease-ready
LanguageAssamese
Dialect / VariantStandard · Kamrupi · Goalpariya
RegionNortheast
ModalityText · Speech
Data volume220M tokens · 2.8k hrs
Risk tierTier 1-2
QA statusRelease-ready
LanguageOdia
Dialect / VariantStandard · Sambalpuri · Koraputia
RegionEast
ModalityText · Speech
Data volume340M tokens · 1.9k hrs
Risk tierTier 1-2
QA statusRelease-ready
LanguageMaithili
Dialect / VariantStandard cluster
RegionEast
ModalityText · Hybrid
Data volume95M tokens
Risk tierTier 1-3
QA statusSampling complete
LanguageMeitei (Manipuri)
Dialect / VariantStandard · Sub-cluster mix
RegionNortheast
ModalityText · Speech
Data volume70M tokens · 1.4k hrs
Risk tierTier 2-3
QA statusControlled release
LanguageSantali
Dialect / VariantStandard · Ol Chiki
RegionEast
ModalitySpeech · Hybrid
Data volume1.1k hrs
Risk tierTier 2-3
QA statusSampling complete
LanguageBhojpuri
Dialect / VariantRegional clusters
RegionEast
ModalityText · Speech
Data volume140M tokens · 0.9k hrs
Risk tierTier 1-3
QA statusRelease-ready
LanguageHindi
Dialect / VariantStandard + mixed metro variants
RegionNorth
ModalityText · Hybrid
Data volume420M tokens
Risk tierTier 1-2
QA statusRelease-ready
LanguageMarathi
Dialect / VariantUrban + regional mix
RegionWest
ModalityText · Hybrid
Data volume210M tokens
Risk tierTier 1-2
QA statusRelease-ready
LanguageTamil
Dialect / VariantStandard + colloquial
RegionSouth
ModalityText · Speech
Data volume180M tokens · 1.2k hrs
Risk tierTier 1-3
QA statusSampling complete
LanguageTelugu
Dialect / VariantRegional cluster mix
RegionSouth
ModalityText · Speech
Data volume160M tokens · 0.9k hrs
Risk tierTier 1-3
QA statusSampling complete
In pipeline

Languages planned next

Useful for planning, but not a promise until source review is done.

Language groupKhasi · Garo expansion
RegionNortheast
Current stagePartner review · field collection scaling
Expected release windowQ3 2026 window
Language groupMizo · Bodo
RegionNortheast
Current stageProvenance refresh and dialect mapping
Expected release windowQ4 2026 window
Language groupSylheti speech expansion
RegionEast
Current stageNative review sampling
Expected release windowNext speech release
Language groupKannada · Malayalam
RegionSouth
Current stageSource mapping + dialect review
Expected release windowPending review
Language groupGujarati · Punjabi
RegionWest · North
Current stageSelective cluster build-out
Expected release windowPhased rollout

How a language enters the catalog

  • Collected from partners or consented contributors
  • Dialect labels added where they matter
  • Source docs tied to the release record
  • Real sample review before a contract is signed

How releases are versioned

  • Every release receives a version ID and manifest hash
  • Corrections ship as new versions, never silent overwrites
  • Buyers get a change note when affected fields or files change
  • Older versions stay visible for comparison and audit

Request coverage for your exact language mix.

We send a CSV with languages, format, volume, license tier, and QA status.