Skip to content
Indian language data for model teams

Licensed Indic text and speech data for training and evals.

We collect, clean, and deliver Bengali-led datasets with the license paperwork attached, so legal, model, and data teams can review the same package.

Every dataset has a real licenseNative speakers check every releaseDelivered straight to your S3 / GCS / Azure bucket

Snapshot

Where the catalog stands today

Live

Languages now

11

Available now

Dialect clusters

37

With dialect-level metadata

Text tokens

2.4B

Across released packages

Speech hours

18.6k

Segmented + transcript-audited

Updated each releaseFull coverage map
Bengali · BanglaAssamese · AsomiyaOdiaMeitei · Meetei MayekMaithili · DevanagariSylhetiBhojpuriGujaratiTamilTeluguKannadaMalayalam
Real licenses, not scraped
Tiered by training risk
Manifest + checksum on every release
Dialects labelled, not flattened
Why teams get stuck

Why Indic data deals slow down

Most teams do not get blocked on model quality. They get blocked on source quality, licenses, or unclear delivery. These are the four problems we fix.

Bengali is huge, and missing from most models

230M+ speakers. Most foundation models ship with thin Bengali coverage and barely any Assamese, Odia, or Northeast languages at all. The catalog starts there.

Web-scraped data makes lawyers nervous

If you cannot say where the data came from or what the license allows, the deal stalls. Every dataset ships with the paper trail to prove it.

Your model cannot tell Sylheti from Standard Bengali

When all Bengali looks the same in training, real users suffer. Dialects such as Sylheti, Chittagonian, and Rangpuri are labelled.

Sourcing data should not take six months

Legal review, vendor onboarding, and sample exchange should not burn a quarter. The workflow ships what legal needs upfront.

How the catalog is built

Collect. Clean. Deliver.

We keep the process simple: gather the data, review it hard, then ship it with the docs your team needs.

01

Collect

Field teams in Kolkata, Guwahati, Bhubaneswar, and Imphal partner with universities, archives, and consenting speakers. Source provenance is logged when data is collected.

02

Clean

Each source gets license review, dialect labelling, script normalization, and a native-speaker QA pass. Failed batches stay out.

03

Deliver

Datasets land in your bucket with a manifest, checksum, license paperwork, and a version number. No silent overwrites.

Coverage snapshot

What is live now, and what is coming next

We separate live catalog from pipeline work so you know what you can review today.

EASTNORTHEASTNORTHWESTSOUTHKolkata · HQGuwahatiBhubaneswarImphalPatnaN

Available now

Released
  • Bengali, Assamese, Odia, Maithili, Bhojpuri
  • Hindi, Marathi, Tamil, and Telugu also live
  • Text, speech, or mixed packages depending on language

In pipeline

Staged
  • Meitei, Santali, Khasi, Garo, and Bodo work in progress
  • More Bengali speech and dialect expansion
  • Kannada, Malayalam, Gujarati, and Punjabi build-out
Modalities and packaging

Three ways to buy the data

IDX-TXT-26Q2

Text datasets

Large text datasets for pretraining, evals, instruction work, and retrieval across Bengali, Odia, Assamese, Maithili, Hindi, and more.

  • UTF-normalized text
  • Dialect labels where needed
  • Manifest with source docs
IDX-SPH-26Q2

Speech datasets

Speech datasets for ASR training, alignment, and quality checks, starting with Bengali and nearby dialect clusters.

  • QC on each segment
  • Speaker and accent metadata
  • Native-speaker transcript review
IDX-HYB-26Q2

Text + speech bundles

Combined packages for teams that want text and speech in one release, with one manifest and one license summary.

  • One manifest
  • One license summary
  • Versioned changelog
License tiers

A simple answer to “Can we train on this?”

Every source gets a tier. You see what is allowed, what is blocked, and what docs back it up.

Tier 1Green light — train on itWe have a signed license that explicitly allows training. Use it the way the contract says.
Tier 2Train with guardrailsTraining is fine, but the license adds limits — like which markets, which fields, or what you can ship downstream.
Tier 3Talk to legal firstUseful for research and small experiments. Production training needs written approval from both legal teams.
Tier 4Do not train on itEither the license doesn't allow training or the paper trail isn't strong enough. You can still review it internally.
Tier 5BlockedWe don't ship it. The source has a problem we can't resolve — unclear origin, broken license, or a hard restriction.
Quality framework

How we keep bad data out

Every batch goes through four checks before it ships.

01

Does the file even look right?

Required fields, file structure, and metadata are checked before the batch enters deeper review.

02

Automated checks

Deduplication, outlier detection, language ID, and transcript sanity checks run across every segment.

03

Native speakers review samples

Bengali, Assamese, Odia, and Meitei reviewers listen, read, and flag what machines miss.

04

Release gate

Passing batches receive a version number and scorecard. Failing batches do not ship.

97.4%

Transcription accuracy

Bengali speech median across recent releases

91.2%

Dialect label confidence

Checked on each release

4.8%

Rejected segments removed

Removed before buyer export

Access workflow

How delivery works

Five steps from first call to files in your bucket.

01

30-minute call to scope it out

Languages, formats, use case, and buyer constraints are clarified before anything sensitive moves.

02

Sign an NDA, swap paperwork

License, security, buyer requirements, and source evidence are exchanged.

03

Look at sample data

A real slice arrives with a data card, dialect breakdown, license summary, and QA notes.

04

Sign the contract

Scope, price, delivery dates, and commercial limits go into the contract.

05

Data lands in your bucket

The dataset lands in S3, GCS, Azure, or SFTP with manifest, checksums, and source docs.

Use cases

Built for teams training, evaluating, and shipping multilingual systems

Pretraining enrichment

Add Bengali, Odia, Assamese, Maithili, and adjacent coverage in volumes large enough to matter.

ASR robustness

Train and evaluate on dialect-labelled speech for Sylheti, Chittagonian, Rangpuri, Kamrupi, and more.

Stable benchmarks

Use versioned benchmark slices and release notes so model comparisons remain inspectable over time.

Fits your stack

Use the storage and formats you already have

We deliver into the buckets, file formats, and training stacks your team already uses.

Delivery targets
  • AWS S3 + presigned URLs
  • Google Cloud Storage
  • Azure Blob Storage
  • SFTP with key-based access
Training formats
  • HuggingFace dataset format
  • NVIDIA NeMo Curator schema
  • Mosaic / Composer-compatible JSONL
  • WebDataset shards
Supporting files
  • Manifest CSV + JSON
  • SHA-256 checksum bundles
  • Evidence-reference PDFs
  • Audit-log JSONL

Languages live

11

Available now

Active pilots

4

Model teams in scoping or pilot work

Release windows shipped

9

Shipped in the last 12 months

Native-speaker reviewers

28

Across Bengali, Assamese, Odia, Meitei

Ready when you are

Tell us what languages you need.

Send the languages, format, and timeline. We reply within two business days.