Step 2

Datasets

Assemble training datasets from the QA pairs auto-generated when your pages were crawled.

How it works

When you crawl a page, Kanha automatically generates question-answer pairs from its content using an LLM. These page-level QA pairs are stored and ready to use immediately.

A dataset is an assembly of these existing QA pairs into a single JSONL file for training. Assembling a dataset is fast (no LLM calls) — it simply collects the pairs that already exist for your selected pages.

{"instruction": "What pricing plans do you offer?",
"response": "We offer three plans: Starter, Pro, and Scale...",
"system_prompt": "You are a helpful assistant for Example Corp.",
"source_url": "https://example.com/pricing"}

Assemble a dataset

1Go to Dashboard → Datasets and select the site you want to create a dataset from. Make sure the site is verified and has crawled pages with QA pairs.
2Optionally select specific pages to include. By default, all indexed pages are included. Use the page selector to exclude pages you don't want in this dataset.
3Click Generate. The dataset is assembled quickly from existing QA pairs — no LLM processing needed.

Important: Datasets are assembled per-site, not per-bot. One dataset covers the QA pairs from selected pages on a site. You'll choose which dataset to use when you start training a bot.

Manage QA pairs

Click on any completed dataset to see its QA pairs. You can review each question-answer pair and manage which ones are included in training.

Exclude / include pairs

Toggle individual QA pairs as excluded. Excluded pairs are not used when training or creating new dataset versions. Use this to remove low-quality or irrelevant pairs.

Filter by page

See which QA pairs came from which page. Click a page in the breakdown to filter the pair list to just that page's contributions.

Dataset versioning

After excluding pairs from a dataset, click Create New Version to snapshot the current non-excluded pairs into a new, immutable dataset. This lets you iterate on your training data while keeping a history of what each bot was trained on.

Each version links back to its parent and shows a version number (v1, v2, etc.). You can train a bot on any version and compare results.

Download

Click Download on any completed dataset to get the raw JSONL file. You can inspect it to verify the quality of the QA pairs before starting a training job.

Updating datasets

If you update your site content (recrawl pages, add new pages, remove old ones), the page-level QA pairs are automatically regenerated on recrawl. Assemble a new dataset to incorporate those changes. Each assembly creates a new dataset entry — you choose which one to use when training.

With your dataset ready, the next step is to train a model on it.

Next: Training