Licensed Indic text and speech data for training and evals.
We collect, clean, and deliver Bengali-led datasets with the license paperwork attached, so legal, model, and data teams can review the same package.
Snapshot
Where the catalog stands today
Languages now
11
Available now
Dialect clusters
37
With dialect-level metadata
Text tokens
2.4B
Across released packages
Speech hours
18.6k
Segmented + transcript-audited
Why Indic data deals slow down
Most teams do not get blocked on model quality. They get blocked on source quality, licenses, or unclear delivery. These are the four problems we fix.
Bengali is huge, and missing from most models
230M+ speakers. Most foundation models ship with thin Bengali coverage and barely any Assamese, Odia, or Northeast languages at all. The catalog starts there.
Web-scraped data makes lawyers nervous
If you cannot say where the data came from or what the license allows, the deal stalls. Every dataset ships with the paper trail to prove it.
Your model cannot tell Sylheti from Standard Bengali
When all Bengali looks the same in training, real users suffer. Dialects such as Sylheti, Chittagonian, and Rangpuri are labelled.
Sourcing data should not take six months
Legal review, vendor onboarding, and sample exchange should not burn a quarter. The workflow ships what legal needs upfront.
Collect. Clean. Deliver.
We keep the process simple: gather the data, review it hard, then ship it with the docs your team needs.
Collect
Field teams in Kolkata, Guwahati, Bhubaneswar, and Imphal partner with universities, archives, and consenting speakers. Source provenance is logged when data is collected.
Clean
Each source gets license review, dialect labelling, script normalization, and a native-speaker QA pass. Failed batches stay out.
Deliver
Datasets land in your bucket with a manifest, checksum, license paperwork, and a version number. No silent overwrites.
What is live now, and what is coming next
We separate live catalog from pipeline work so you know what you can review today.
Available now
Released- Bengali, Assamese, Odia, Maithili, Bhojpuri
- Hindi, Marathi, Tamil, and Telugu also live
- Text, speech, or mixed packages depending on language
In pipeline
Staged- Meitei, Santali, Khasi, Garo, and Bodo work in progress
- More Bengali speech and dialect expansion
- Kannada, Malayalam, Gujarati, and Punjabi build-out
Three ways to buy the data
Text datasets
Large text datasets for pretraining, evals, instruction work, and retrieval across Bengali, Odia, Assamese, Maithili, Hindi, and more.
- UTF-normalized text
- Dialect labels where needed
- Manifest with source docs
Speech datasets
Speech datasets for ASR training, alignment, and quality checks, starting with Bengali and nearby dialect clusters.
- QC on each segment
- Speaker and accent metadata
- Native-speaker transcript review
Text + speech bundles
Combined packages for teams that want text and speech in one release, with one manifest and one license summary.
- One manifest
- One license summary
- Versioned changelog
A simple answer to “Can we train on this?”
Every source gets a tier. You see what is allowed, what is blocked, and what docs back it up.
How we keep bad data out
Every batch goes through four checks before it ships.
Does the file even look right?
Required fields, file structure, and metadata are checked before the batch enters deeper review.
Automated checks
Deduplication, outlier detection, language ID, and transcript sanity checks run across every segment.
Native speakers review samples
Bengali, Assamese, Odia, and Meitei reviewers listen, read, and flag what machines miss.
Release gate
Passing batches receive a version number and scorecard. Failing batches do not ship.
97.4%
Transcription accuracy
Bengali speech median across recent releases
91.2%
Dialect label confidence
Checked on each release
4.8%
Rejected segments removed
Removed before buyer export
How delivery works
Five steps from first call to files in your bucket.
30-minute call to scope it out
Languages, formats, use case, and buyer constraints are clarified before anything sensitive moves.
Sign an NDA, swap paperwork
License, security, buyer requirements, and source evidence are exchanged.
Look at sample data
A real slice arrives with a data card, dialect breakdown, license summary, and QA notes.
Sign the contract
Scope, price, delivery dates, and commercial limits go into the contract.
Data lands in your bucket
The dataset lands in S3, GCS, Azure, or SFTP with manifest, checksums, and source docs.
Built for teams training, evaluating, and shipping multilingual systems
Pretraining enrichment
Add Bengali, Odia, Assamese, Maithili, and adjacent coverage in volumes large enough to matter.
ASR robustness
Train and evaluate on dialect-labelled speech for Sylheti, Chittagonian, Rangpuri, Kamrupi, and more.
Stable benchmarks
Use versioned benchmark slices and release notes so model comparisons remain inspectable over time.
Use the storage and formats you already have
We deliver into the buckets, file formats, and training stacks your team already uses.
- AWS S3 + presigned URLs
- Google Cloud Storage
- Azure Blob Storage
- SFTP with key-based access
- HuggingFace dataset format
- NVIDIA NeMo Curator schema
- Mosaic / Composer-compatible JSONL
- WebDataset shards
- Manifest CSV + JSON
- SHA-256 checksum bundles
- Evidence-reference PDFs
- Audit-log JSONL
Languages live
11
Available now
Active pilots
4
Model teams in scoping or pilot work
Release windows shipped
9
Shipped in the last 12 months
Native-speaker reviewers
28
Across Bengali, Assamese, Odia, Meitei
Ready when you are
Tell us what languages you need.
Send the languages, format, and timeline. We reply within two business days.