Every cloud-based AI chatbot works the same way: your customer types a question, it leaves their device, travels to a server, gets processed by a model you don't control, and the response comes back. The round-trip takes 1–3 seconds. The data touches at least one third party. And you're trusting their retention policies, their security posture, their compliance promises.

This is fine until it isn't.

The data flow nobody thinks about

When a customer asks your chatbot "What's your return policy for the order I placed last Tuesday?", that question contains:

The fact that they're a customer
That they made a purchase recently
That they might be unhappy with it
The context of your return policy

With a cloud chatbot, this goes to OpenAI, Anthropic, or whatever model your chatbot vendor is using under the hood. Maybe they don't log it. Maybe they do. Maybe they use it for training. Maybe their subprocessor does. You signed a DPA that says the right things, but the data still left your customer's device and traveled through infrastructure you don't own.

With on-device inference, the question never leaves the browser. The model runs locally via WebGPU. The response is generated on the customer's hardware. There is no server in the loop. Not yours, not ours, not anyone's.

This isn't a privacy policy. It's an architecture.

What "on-device" actually means technically

When a visitor opens your Kanha-powered chat widget, here's what happens:

The SDK checks for WebGPU support in the browser (Chrome 113+, Edge 113+, Safari 18+, most modern devices)
It fetches the model weights from Hugging Face Hub, a one-time download, cached by the browser for subsequent visits
The model loads into GPU memory via WebAssembly
Every message the customer types is processed locally. Tokens are generated on their GPU. The response appears in the chat

At no point does the customer's question leave their device. The model files are static assets, like images or fonts. Once downloaded, the bot works offline.

Why this matters beyond "privacy"

Compliance without complexity

GDPR Article 5 requires data minimization, don't collect personal data you don't need. CCPA gives consumers the right to know what data you've collected. HIPAA restricts how protected health information flows between systems.

If the data never leaves the device, you don't need to account for it in your data processing agreements. You don't need to list a new subprocessor. You don't need to update your privacy policy when your chatbot vendor changes their model provider.

For companies in healthcare, finance, legal, or education, this removes an entire category of compliance burden. Your chatbot is a static asset, not a data pipeline.

Latency

Cloud chatbots have a floor: network round-trip + model inference + response streaming. Even with fast APIs, that's 500ms–2s for the first token.

On-device inference starts generating tokens in under 200ms after the model is warm. No network hop. The conversation feels instant, more like autocomplete than waiting for a server.

Availability

Cloud chatbots go down. APIs have rate limits, outages, and degraded performance during peak hours. If your chatbot vendor's infrastructure has a bad day, your support experience degrades.

An on-device model doesn't have this problem. Once the model is cached in the browser, it works. No internet required for inference. Your support bot is as reliable as the customer's browser.

Cost at scale

This is the big one. Every cloud chatbot charges per query, per token, per resolution, or per "credit." The unit economics work at low volume. At scale, they break.

On-device inference has zero marginal cost per query. The customer's hardware does the work. Whether you serve 100 conversations or 100,000, your Kanha bill is the same flat monthly rate based on how many pages you've indexed and how often you retrain.

The tradeoffs (being honest)

On-device isn't a silver bullet. There are real limitations:

Model size. On-device models are small, 0.6B to 4B parameters. They can't match GPT-4's general reasoning ability. But they don't need to. A fine-tuned 0.6B model that has learned your product catalog answers product questions better than a 100B model working from a generic prompt. Domain specificity beats raw scale for focused use cases.

First load. The model download is 200MB–2GB depending on size. On a fast connection, that's 2–10 seconds. On slow mobile data, it's longer. The SDK shows a loading indicator, and subsequent visits use the cached model. But the first visit has a cold start that cloud chatbots don't.

Device requirements. WebGPU requires a relatively modern browser and GPU. This covers about 80% of desktop and 60% of mobile traffic today, and the numbers improve every month. For visitors without WebGPU support, you can fall back to a cloud model or simply not show the widget.

No server-side analytics (by default). If you want to know what your customers are asking, query trends, common questions, knowledge gaps, on-device inference means you don't see that data by default. We're building an opt-in server-side mode for businesses that want these insights. But the default is private.

These are real tradeoffs. For most support chatbot use cases, product questions, FAQs, shipping policies, documentation, the on-device approach wins on every axis that matters: cost, privacy, latency, and reliability. The tradeoffs apply to edge cases that most businesses won't hit.

Who should care

Any business with European customers. GDPR enforcement is increasing. On-device inference is the simplest path to a chatbot that doesn't require a new entry in your Record of Processing Activities.

Healthcare and fintech companies. If your chatbot might encounter PHI or financial data in customer questions, keeping that data on-device eliminates an entire risk surface.

High-traffic sites. If you're serving 50,000+ conversations a month, per-query pricing is a scaling tax. On-device inference makes that line item disappear.

Privacy-conscious brands. If your brand promise includes respecting customer data, your chatbot should match. "We use AI support that never sees your data" is a real differentiator.

The direction this is heading

WebGPU support is expanding. Browser vendors are shipping better GPU access. Model compression techniques are improving, smaller models with better quality. Apple, Google, and Microsoft are all investing in on-device AI capabilities.

Today, on-device inference is an advantage. In two years, it'll be expected. The companies that build this capability now will have a head start on the companies that realize later they need to unwind their dependency on cloud inference.

Kanha is built for this future. Your bot, your customers' devices, zero data in transit.

Try it at kanha.ai. Free tier, no credit card, on-device by default.

Why On-Device AI Matters for Customer Support