Case studyAI · Privacy-first data cleaning

DataScrub.

A browser-only data cleaning utility that pairs deterministic rule-based automation with optional LLM intelligence. The whole app is a single static file — no backend, no server, no data ever leaves your machine in Local Mode.

9Color themes
4LLM providers
IQROutlier detection
Undo / RedoFull history stack
/ Stack
JavaScript ES6+PapaParseSheetJSOllamaGroqOpenAIOpenRouterStatic-page deploy

01 / Two modes.

DataScrub doesn't force a privacy/intelligence trade-off — you pick which side of the fence each dataset sits on. Sensitive data stays local; non-sensitive data gets frontier-model reasoning.

Local mode (Ollama)

  • Runs against a locally-installed Ollama instance.
  • No data crosses the network — once Ollama is running, the internet can be off.
  • Recommended models: qwen2.5:7b, llama3.1:8b, mistral, gemma2.
  • Best for healthcare, finance, HR, anything regulated.

Cloud mode

  • Connects to Groq (low latency), OpenAI, or OpenRouter.
  • Only snippets transit the wire — column schemas, summaries, specific queries — never the full dataset.
  • API keys live in browser localStorage; clear at any time.
  • Best for general analytical work where regulatory constraints don't apply.

02 / Privacy by design.

Three architectural decisions remove entire categories of risk from a typical "AI for data" tool.

No DataScrub servers

The project has no backend. The whole application is a set of static files (HTML/CSS/JS) — open them in any modern browser. There's nothing to log, nothing to breach.

No telemetry

No analytics, no error reporting, no usage pings. The team has zero visibility into who runs it or what data they clean.

API keys stay local

When Cloud mode is enabled, the user's OpenAI/Groq/OpenRouter key lives in localStorage only. Requests originate from the user's browser directly to the provider — DataScrub is not a man-in-the-middle.

03 / What the agent does.

Rule-based automation handles the boring 80% (whitespace, NULL/N/A/None standardisation, exact duplicates, IQR outliers). The AI layer handles the judgment calls.

  1. 01
    Profile

    The agent inspects each column — type inference, distribution, missingness rate, suspected role (key / numeric / categorical / freeform). Output is a one-page summary, not a table dump.

  2. 02
    Suggest imputations

    Per-column: mean / median / mode / "leave as null" / "drop the row" — with statistical justification per choice. Users can supply domain context ("this is healthcare data, prefer median for age") so suggestions align with the actual problem.

  3. 03
    Flag outliers

    IQR-based detection with cap or remove options. The agent annotates each flagged row with a reason rather than a binary outlier tag.

  4. 04
    Natural-language queries

    "How many rows have invalid postcodes?" "Show me the 10 worst missingness columns." The agent translates these into operations on the dataset and returns the result inline.

04 / Local setup.

Local mode needs Ollama running on the same machine. The connection is browser → localhost:11434, which means a one-time CORS configuration step.

# 1. Install Ollama from https://ollama.com

# 2. Pull a model
ollama pull qwen2.5:7b

# 3. Start the server with CORS open for the DataScrub page
OLLAMA_ORIGINS="*" ollama serve

# 4. In DataScrub settings, select "Ollama" and point at
#    http://localhost:11434

# Done — no data leaves your machine from this point on.

Try it.

MIT licensed. Open the index.html locally, or fork and host on any static-page provider (GitHub Pages, Cloudflare Pages, Netlify).