Step 1

Sites & Pages

Add your website, crawl its pages, and build the content index that powers your bot.

Add a site

Navigate to Dashboard → Sites and click Add Site. Enter a name and your website's domain (e.g. docs.example.com).

Once created, the site appears in your list. You can add multiple sites — each site groups its own set of indexed pages.

Crawl via sitemap

The fastest way to index your content. Click Crawl Sitemap on a site and provide your sitemap URL (e.g. https://example.com/sitemap.xml).

Kanha parses both <urlset> sitemaps (direct page URLs) and <sitemapindex> files (nested sitemaps), recursing up to 3 levels deep. All discovered URLs are queued for crawling automatically.

Auto QA generation: When a page is crawled, Kanha automatically generates question-answer pairs from its content. This means your training data is ready as soon as crawling completes — no extra step needed.

Tip: Most CMS platforms (WordPress, Shopify, Ghost, etc.) generate a sitemap at /sitemap.xml automatically.

Add individual pages

Don't have a sitemap? Click Add Page and paste a URL. The page is queued for crawling and will appear in your page list once indexed.

This is useful for adding specific pages that aren't in your sitemap, or for testing the crawler on a single page before doing a full sitemap crawl.

JavaScript-rendered pages

Kanha auto-detects pages that require JavaScript rendering. If the static HTML yields less than 500 characters of content (or the text-to-HTML ratio is below 2%), the crawler automatically retries with a headless browser.

Pages rendered via JS are marked with a JS badge in your page list. You can also manually toggle a page's render mode if auto-detection doesn't catch it.

Note: JS rendering uses more resources and takes longer. It's only needed for SPAs (React, Vue, Angular) or pages that load content dynamically. Most static sites and blogs don't need it.

Preview & manage pages

Click any page in your list to preview the extracted content. This shows you exactly what the bot will learn from — the cleaned text content, not the raw HTML.

You can also download the content as a text file, or delete pages you don't want included in training.

Recrawling

Content changes? Recrawl a page to update the index. Both new crawls and recrawls count against your page scrape quota (Free: 50 total, Starter: 500/mo, Pro: 2,500/mo, Business: 10,000/mo). On paid plans, overages are billed at $0.10/extra page.

When a page is recrawled, its QA pairs are automatically regenerated from the new content.

Deleting pages & sites

Delete individual pages from the page list. Deleting a page removes it from the index — it won't be included in future dataset generation or training.

You can also delete an entire site from the sites list. This removes all of its indexed pages as well.

Once you have pages indexed, the next step is to generate a training dataset from them.

Next: Datasets