KatCore — Docs

01 · Start here

Getting started

KatCore is a browser-based workspace. There's nothing to install and no API to wire up — sign up, upload a file, and start asking questions.

1Sign up for a free account. The Free tier never expires and needs no credit card.

2Create a collection — a typed container for related files (e.g. Sales or Support).

3Upload your first file. KatCore parses it, labels every column, and runs an automatic audit in the background.

02 · Bring data in

Ingesting data

Drag and drop up to 50 files per batch with real per-file progress. Each file can be up to 100 MB and is normalized, cleaned, and stored automatically.

Supported formats

CSVJSONXLSXPDFTXTDOCXMDParquetHTML

Multi-sheet XLSX lets you pick the sheet. Documents (PDF, DOCX, MD, TXT) are chunked and embedded for retrieval.

Four ways to bring data in

File

Drag-drop or browse. Batch up to 50 files at once.

URL

Pull from a public or authenticated URL.

API

REST GET/POST with Bearer, API-key, or Basic auth and JSON record-path extraction.

Database

PostgreSQL, MySQL, SQL Server. MongoDB coming soon.

Every file is versioned. Re-ingesting the same source creates a new version with full lineage — your history is never overwritten.

03 · The centerpiece

Understanding your data

On ingest, KatCore labels every column (id, email, date, currency, region, PII…) with a deterministic 5-step pipeline, then writes a natural-language description of each column in the background. These descriptions ground both the audit and the chat.

The Readiness Audit

The audit produces a fully-explainable 0–100 AI-Readiness Score with a letter grade. It's signal-based, not a vibe — and every point is traceable to a specific issue.

A ≥ 90 · ExcellentB ≥ 80 · GoodC ≥ 70 · Needs WorkD ≥ 60 · Needs WorkF < 60 · Not Ready

The six dimensions

Dimension	Weight	What it measures
Completeness	25%	How much data is actually present — missing (null) and blank/whitespace values.
Validity	25%	Whether values are well-formed and plausible — IQR outliers, malformed emails/phones, unparseable dates, business-rule violations.
Uniqueness	15%	Freedom from unintended duplicates — repeated identifiers and exact duplicate rows.
PII Exposure	15%	Sensitive personal data left in the clear — unmasked emails, phones, names, addresses.
Consistency	10%	One canonical convention — variant spellings ("USA" vs "U.S.A.") and non-standard column names.
Semantic Completeness	10%	Every column documented so AI and analysts understand it.

The fix-list

Each audit returns a severity-ranked fix-list (critical / high / medium / low) where every fix shows the exact points it will recover — and those points provably sum to 100 − your score. The checklist literally is your score, decomposed. Each fix carries sample evidence: the actual offending values and 0-based row indices (up to 20), IQR fences for outliers, and duplicate groups.

One-click cleaning

Each issue maps to a suggested action — entity resolution, schema standardization, PII masking, imputation, anomaly quarantine, smart date parsing, or fix all. Preview the before/after rows and impact counts, then apply to produce a cleaned new version of the file.

Unmasked PII is always scored critical. The UI shows a projected score after fixes so you know the payoff before you commit.

04 · Chat

Asking questions (Kat)

Ask a question in natural language. An intent classifier routes it, semantic search finds the right files (grounded by the auto-descriptions), and a plan → execute → synthesize loop over DuckDB returns a written answer with the numbers and the source file cited. No SQL required.

# You ask
"What was the growth trend of SaaS subscriptions in Q3 vs Q2?"

# Kat answers
SaaS subscriptions grew 14.2% in Q3, driven by the
Enterprise tier (+22% seats). Source: sales_report.csv

Phrase questions the way you'd ask a colleague. Every answer cites the file it came from, so you can trace the number back to its source.

05 · Keep it fresh

Schedules

Set up cron-driven recurring ingestion from a URL or API. Schedules are timezone-aware (IANA), can be paused and resumed, track failures, and auto-disable after repeated failures.

Smart polling

KatCore caches ETag and Last-Modified. If the upstream source hasn't changed, the run is a no-op — no duplicate data. Versioned lineage means each refresh is a new version of the same file.

A schedule like 0 6 * * * in UTC runs daily at 06:00 UTC — and silently skips when the source is unchanged.

06 · The payoff

Notebooks & Artifacts

An artifact is a Jupyter .ipynb notebook that lives inside KatCore: viewable, cell-by-cell editable (markdown + code, GitHub-flavored, syntax-highlighted), and downloadable as a real .ipynb to share or open anywhere.

The auto-generated Data Quality Report

Every time you run an audit, KatCore generates a Quality Report notebook — no setup. It contains:

Scorecard

Grade, per-dimension breakdown, and prioritized fixes.

Narrative

Natural-language per-column findings with evidence — null row indices, sample values, outlier ranges.

Remediation plan

Entity mappings, schema renames, PII masking rules, imputation strategies, date patterns.

DuckDB SQL

A single ready-to-run block that applies every fix.

It's interactive, not a dead PDF. Edit cells in place, tweak the remediation, re-run the audit to watch the score climb, then download to share with your team.

Bring your own

Upload your own notebooks (≤ 10 MB, validated nbformat 4) and attach them to any dataset file.

07 · Reference

FAQ

What's the maximum file size?

100 MB per file, up to 50 files per batch. Larger datasets are handled via scheduled ingestion and versioning.

Which formats are supported?

CSV, JSON, XLSX, PDF, TXT, DOCX, MD, Parquet, and HTML. Documents are chunked and embedded for retrieval.

How is my data isolated?

Every workspace is logically separated and encrypted at rest. Data normalizes to Parquet and is stored on Cloudflare R2.

How does the chat behave?

Answers are grounded in your files and the auto-generated column descriptions, and every answer cites its source file. The audit and score are deterministic and fully explainable.

Need help?

Questions, demos, or enterprise — we'll get back to you.