Skip to content
About · Kolkata, India

Built in Kolkata for hard-to-find Indian language data

Indic Data Exchange Labs Pvt. Ltd. collects and delivers Bengali-led text and speech datasets for model teams that need cleaner sourcing and clearer licenses.

Founded 2024Kolkata, IndiaCIN · U72900WB2024PTC0XXXXX
Why this exists

Why we started with Bengali and the East

These languages are not small. They are just badly served by the usual data pipeline: too much web junk, too little licensing clarity, and too much time lost in legal review.

We started in Kolkata because that is where the field teams, partner network, and language expertise already existed. From there we expanded into Assamese, Odia, Meitei, and nearby language groups that most global datasets still ignore.

The thesis

If the source story is messy, the deal dies.

A dataset only matters if a buyer can actually use it. That means clear licenses, traceable sources, and versioned releases that legal and engineering can both inspect.

Built for model, ASR, and eval teams. Data you can use, not just data that looks good in a deck.

Languages live

11

Available now

Active pilots

4

Model teams in scoping or pilot work

Release windows shipped

9

Shipped in the last 12 months

Native-speaker reviewers

28

Across Bengali, Assamese, Odia, Meitei

Team

The people running the catalog

Collection and review are run from Kolkata, Guwahati, Bhubaneswar, and Imphal.

RB

Founder · CEO

Kolkata

Partnerships, catalog ops, deals

Builds the catalog and runs partnerships. Background in multilingual data engineering.

Full profile shared under NDA.

ML

Co-founder · CTO

Kolkata · remote

Data pipeline, release records, delivery

Owns the data pipeline, release schema, and delivery systems.

Full profile shared under NDA.

SC

Head of QA · Linguistics

Kolkata

QA, reviewer operations, dialect labels

Leads Bengali, Assamese, and Meitei review work and native-speaker reviewers.

Full profile shared under NDA.

AD

Compliance lead

Kolkata

Licensing review, buyer packages

Runs tier reviews and prepares docs buyers use to clear data with legal.

Full profile shared under NDA.

Milestones

What we've shipped so far

2024 Q3

Founded in Kolkata

Started with Bengali and Assamese: licensed Indic data, not random web dumps.

2024 Q4

First university partnerships

Field-team agreements with three regional universities for text corpora and consented speech.

2025 Q1

5-tier compliance rubric published

Published the system used to say what can and cannot be used for training.

2025 Q3

First release cycle for design partners

Completed the first NDA-to-delivery cycle with manifests, scorecards, and license docs.

2026 Q1

Northeast catalog expansion

Added Meitei, Khasi, and Bodo pipelines. Reviewer pool grew to 28 native speakers.

2026 Q2

9th release shipped

Bengali 26Q2 R3 included 1.1B tokens and 5.2k hours across Tier 1-2 sources.

Integrations

Fits the rest of your training stack

We deliver into the buckets and formats teams already use.

Delivery targets

  • AWS S3 + presigned URLs
  • Google Cloud Storage
  • Azure Blob Storage
  • SFTP with key-based access

Training formats

  • HuggingFace dataset format
  • NVIDIA NeMo Curator schema
  • Mosaic / Composer-compatible JSONL
  • WebDataset shards

Supporting files

  • Manifest CSV + JSON
  • SHA-256 checksum bundles
  • Evidence-reference PDFs
  • Audit-log JSONL
For investors

Raising pre-seed

Building the licensed Indic data layer model teams actually need.

$2-3M pre-seed to fund the next 18 months.

Use of funds

  • Expand reviewer network across more Northeast languages
  • Grow Bengali speech, Sylheti, Chittagonian, and Kamrupi field collection
  • Strengthen data pipeline and delivery systems
  • Hire in speech engineering, compliance, and partnerships

Talk to the team.

Reach the team directly at partnerships@indicdata.exchange.