Built in Kolkata for hard-to-find Indian language data
Indic Data Exchange Labs Pvt. Ltd. collects and delivers Bengali-led text and speech datasets for model teams that need cleaner sourcing and clearer licenses.
Why we started with Bengali and the East
These languages are not small. They are just badly served by the usual data pipeline: too much web junk, too little licensing clarity, and too much time lost in legal review.
We started in Kolkata because that is where the field teams, partner network, and language expertise already existed. From there we expanded into Assamese, Odia, Meitei, and nearby language groups that most global datasets still ignore.
The thesis
If the source story is messy, the deal dies.
A dataset only matters if a buyer can actually use it. That means clear licenses, traceable sources, and versioned releases that legal and engineering can both inspect.
Built for model, ASR, and eval teams. Data you can use, not just data that looks good in a deck.
Languages live
11
Available now
Active pilots
4
Model teams in scoping or pilot work
Release windows shipped
9
Shipped in the last 12 months
Native-speaker reviewers
28
Across Bengali, Assamese, Odia, Meitei
The people running the catalog
Collection and review are run from Kolkata, Guwahati, Bhubaneswar, and Imphal.
Founder · CEO
Kolkata
Partnerships, catalog ops, deals
Builds the catalog and runs partnerships. Background in multilingual data engineering.
Full profile shared under NDA.
Co-founder · CTO
Kolkata · remote
Data pipeline, release records, delivery
Owns the data pipeline, release schema, and delivery systems.
Full profile shared under NDA.
Head of QA · Linguistics
Kolkata
QA, reviewer operations, dialect labels
Leads Bengali, Assamese, and Meitei review work and native-speaker reviewers.
Full profile shared under NDA.
Compliance lead
Kolkata
Licensing review, buyer packages
Runs tier reviews and prepares docs buyers use to clear data with legal.
Full profile shared under NDA.
What we've shipped so far
Founded in Kolkata
Started with Bengali and Assamese: licensed Indic data, not random web dumps.
First university partnerships
Field-team agreements with three regional universities for text corpora and consented speech.
5-tier compliance rubric published
Published the system used to say what can and cannot be used for training.
First release cycle for design partners
Completed the first NDA-to-delivery cycle with manifests, scorecards, and license docs.
Northeast catalog expansion
Added Meitei, Khasi, and Bodo pipelines. Reviewer pool grew to 28 native speakers.
9th release shipped
Bengali 26Q2 R3 included 1.1B tokens and 5.2k hours across Tier 1-2 sources.
Fits the rest of your training stack
We deliver into the buckets and formats teams already use.
Delivery targets
- AWS S3 + presigned URLs
- Google Cloud Storage
- Azure Blob Storage
- SFTP with key-based access
Training formats
- HuggingFace dataset format
- NVIDIA NeMo Curator schema
- Mosaic / Composer-compatible JSONL
- WebDataset shards
Supporting files
- Manifest CSV + JSON
- SHA-256 checksum bundles
- Evidence-reference PDFs
- Audit-log JSONL
Raising pre-seed
Building the licensed Indic data layer model teams actually need.
$2-3M pre-seed to fund the next 18 months.
Use of funds
- Expand reviewer network across more Northeast languages
- Grow Bengali speech, Sylheti, Chittagonian, and Kamrupi field collection
- Strengthen data pipeline and delivery systems
- Hire in speech engineering, compliance, and partnerships