DataScrub
AI-powered data cleaning agent
DataScrub is a client-side data cleaning utility that pairs deterministic rule-based automation with optional LLM intelligence. The whole application runs in the browser as a single self-contained static file - no backend, no server, no data ever leaves your machine in Local Mode. Users can upload raw datasets, automatically rectify common inconsistencies (whitespace, missing-value indicators, duplicates), and engage an AI agent for advanced profiling, imputation suggestions, and outlier handling. Multi-provider support (Ollama for local privacy-first inference, Groq for low-latency cloud, OpenAI/OpenRouter for frontier reasoning) means it adapts to whatever environment a data scientist needs.
Key Features
Privacy-first by design
DataScrub is architected as a pure client-side application. In Local Mode (Ollama), data processing happens entirely on the user's device and no data crosses the network. In Cloud Mode (third-party APIs), only necessary snippets - schema, summaries, or specific queries - are transmitted to the chosen provider for processing. API keys live in the browser's localStorage for convenience and can be cleared at any time. The application has no telemetry and no server backend.
Local inference with Ollama
For sensitive data, DataScrub recommends running a local LLM via Ollama. After installing Ollama and pulling a model (qwen2.5:7b, llama3.1:8b, mistral, or gemma2), users configure CORS and connect DataScrub to the local Ollama endpoint. From that moment forward, no internet connection is required and no data leaves the machine.
- Install Ollama from the official site
- Run: ollama pull qwen2.5:7b
- Configure CORS: OLLAMA_ORIGINS="*" ollama serve
- Select Ollama in DataScrub settings
AI agent capabilities
Beyond rule-based automation, the AI agent layer offers deeper analytical assistance. It can profile a dataset's structure, suggest column-by-column imputation strategies based on data type and distribution, flag outliers with statistical justification, and respond to natural language queries about the data. Users can supply project-specific context (e.g., "this is healthcare data, prefer median imputation for age") so the agent's recommendations align with domain rules.
Self-contained deployment
DataScrub ships as a single self-contained set of static files. No build step, no server, no database. Users can run it by opening index.html directly, or fork the repo to host on any static-page provider (GitHub Pages, Cloudflare Pages, Netlify, etc). The whole project is MIT licensed for personal or commercial use.
