An AI agent platform that investigates incidents, proposes fixes,
and waits for your approval.
An alert fires at 2am. A service degrades. A build pipeline breaks. A memory leak grows. And because the one person who knows how to dig into it is unavailable — on PTO, in a meeting, swamped with other work — your team stares at dashboards, files tickets, and hopes.
KaizenCore changes that. It connects to your Kubernetes clusters, Linux and Windows servers, AWS accounts, PostgreSQL databases, and GitHub repositories. It watches for alerts. It digs into problems autonomously. It proposes fixes with full context. And it never executes a single write operation without a human reviewing exactly what will run and explicitly approving it.
The DevOps engineer is still in the loop — but now they approve a fix from their phone in 30 seconds instead of being woken up, spending 45 minutes investigating, and then having to type the commands themselves.
The KaizenCore dashboard: active agents, pending proposals, recent investigations, and platform health — all in one view.
Set an AI agent to run on a cron schedule against any platform. Every 5 minutes, every night at 2am, every Monday morning — the agent runs, reads your infrastructure state, and reports. When something is wrong, it proposes a fix. You review, you approve, it executes.
Describe a problem in plain English. KaizenCore runs an AI investigator across every connected platform simultaneously. It reads logs, queries databases, inspects pods, checks AWS resources, and searches your codebase — all in parallel. You watch the tool calls stream in live.
Connect Prometheus, Grafana, or PagerDuty to KaizenCore's webhook endpoint. The moment an alert fires, an investigation starts — automatically. By the time your on-call engineer looks at their phone, KaizenCore has already been investigating for 3 minutes.
Every agent is tied to a platform, a schedule, a tool set, and a system prompt. Admins approve agents before they go live.
Every write operation in KaizenCore — restarting a service, running a kubectl command, executing a SQL query, pushing a commit — creates a Proposal. The proposal shows the reviewer:
The reviewer clicks Approve — the command runs, the result is captured. The reviewer clicks Reject — the command never runs, the agent adapts. No commands execute in the background. No surprises.
If optional TOTP 2FA is enabled, high-risk approvals require a 6-digit authenticator code before execution. Multi-step proposals let the agent propose a complete remediation plan — restart service, verify health, roll back if needed — as a sequential series of steps reviewers see in full before approving step one.
A multi-step proposal: the AI describes its full remediation plan. Step-by-step execution only begins after the reviewer approves.
KaizenCore connects to the platforms you already run. Add credentials once; every agent and investigation you create automatically has access to the tools for that platform.
Connect Kubernetes, Linux, Windows, PostgreSQL, AWS, and GitHub — all from the same UI.
Pod status, logs, node metrics, deployment events. Proposes any kubectl command — exec, apply, scale, rollout undo.
kubectl rollout undo.CPU/memory/disk/IO, process lists, failed services, port bindings, application logs. Proposes shell commands or service restarts.
System info, Windows services, Event Log errors, port bindings. Proposes PowerShell via OpenSSH. Quick Setup handles authorized_keys automatically.
Slow queries via pg_stat_statements, table sizes, replication lag, lock conflicts, index health. Proposes SQL as a reviewed proposal.
EC2, ECS, EKS, RDS, Lambda, CloudWatch, S3, ELB. Reads across services; proposals for reboot, scale, and force-deploy.
Repository search, file read, diff analysis. Proposes branch creation, file writes, and pull requests — all human-gated.
You type a problem description — "nginx is returning 502 errors" or "the nightly ETL job failed" — and select one or more platforms. KaizenCore creates a run, builds a context-aware agent, and starts executing.
You watch it happen in real time. Every tool call streams to the timeline as it happens — the exact input, the exact output, no black box. You see the AI read your logs, query your database, check your pods, search your code.
The tool permission system extends investigations into territory the AI can't read by default. If the agent needs to run netstat -tlnp or kubectl exec, it pauses, tells you exactly what it wants to run and why, and waits. You approve or deny. Follow-up chat keeps investigations alive after the initial run — type a follow-up and a new run picks up with full context of the previous investigation.
An active investigation: tool calls stream in real time. Each call shows its exact input and output. A permission request banner appears when the AI needs elevated access.
KaizenCore exposes an inbound webhook endpoint. Point your alerting system at it and connect an investigation to fire automatically every time an alert arrives. Supported sources: Prometheus AlertManager, Grafana, PagerDuty, and any generic JSON webhook.
POST /api/webhooks/inbound/{your-token}namespace=production → your production K8s cluster)By the time your on-call engineer opens the Slack notification, KaizenCore has already spent 3–5 minutes reading logs, checking services, and investigating the blast radius.
Webhook source configuration: map alert labels to platforms, choose your AI model, and write a description template using alert data.
Every environment has tools that aren't in any default toolset. A custom kubectl plugin. An internal health check script. A deployment trigger API. KaizenCore's custom tools let you teach the AI about these one time, and it uses them naturally alongside built-in tools.
Define a shell or kubectl command with parameter placeholders. Mark as read (executes immediately) or write (creates a proposal for approval).
The AI POSTs a JSON payload to your endpoint. Useful for triggering internal systems, sending structured data to custom APIs, or integrating with tools that have HTTP APIs.
Every custom tool is scoped to a platform type so it only appears for relevant agents and investigations.
Custom tools extend the AI's built-in capabilities. Command tools run shell or kubectl commands; webhook tools POST to your endpoints.
KaizenCore is provider-agnostic. Configure named provider profiles once; use them everywhere. Different agents can use different models — routine health checks on a cost-efficient model, critical incident investigations on the most powerful one available.
Create a named profile (e.g., "Production — Claude Sonnet" or "Cost-Optimized — GPT-3.5"). Assign it to agents and investigations. When you want to switch the model, update the profile — every agent using it updates automatically.
Named AI provider profiles: configure once, assign to any number of agents and investigations. Switch models in one place.
All platform credentials are encrypted before database storage. Production deployments can use HashiCorp Vault KV v2 — credentials live in Vault, not the database. One-button migration with automatic rollback on failure.
An agent running against your Kubernetes cluster cannot access the SSH key for your PostgreSQL server. Every run can only read credentials for its assigned platform — enforced at the executor level.
Every mutation is logged with actor ID, timestamp, action type, and resource. The log is enforced at the database level as INSERT-only — no record can be modified or deleted. Export for compliance audits.
Three roles: Admin, Editor, Viewer. OIDC/SSO supported with configurable default role for new users. Multiple OAuth 2.0 providers.
Reviewers can enable authenticator app 2FA. Approving a proposal requires a valid 6-digit TOTP code — a stolen session cannot execute infrastructure changes.
Nothing that modifies infrastructure runs without creating a visible Proposal record first. Every write operation goes through the same review gate. This is enforced in code, not just the UI.
kubectl rollout undo deployment/api -n productionidx_orders_statusCREATE INDEX CONCURRENTLY to restore the index"The goal isn't to remove the engineer. It's to remove the 45-minute investigation phase so that by the time the engineer looks at the incident, all the context they need to make a decision is already assembled."
KaizenCore is built around one conviction: the human must always be in the loop for writes.
Investigations are fully autonomous — the AI reads everything, diagnoses everything, explains everything. That's safe. Reading production logs and querying databases doesn't break anything.
But when it comes to changing something — restarting a service, running a command, pushing code, executing SQL — a human sees exactly what will run, understands why the AI thinks it should run, and makes the call. This isn't a limitation. It's the feature.
The AI does the work. The engineer does the thinking. Everyone goes home on time.
docs/deployment-kubernetes.md| Feature | Description |
|---|---|
| Scheduled Agents | Cron-based agents with configurable tools and system prompts |
| On-Demand Investigations | Plain-English problem → AI investigation → findings summary |
| Alert-Triggered Investigations | Prometheus/Grafana/PagerDuty webhooks auto-start investigations |
| Proposal System | All writes require human review before execution |
| Multi-Step Proposals | Complex fixes bundled as sequential steps with human approval |
| TOTP 2FA on Approvals | Authenticator code required for high-risk proposal approvals |
| Proposal Expiry | Configurable TTL per agent; auto-expire stale pending proposals |
| Tool Permission Requests | AI pauses to request elevated access; human approves/denies |
| Follow-Up Chat | Continue investigations with new messages; full context preserved |
| Custom Tools | User-defined command and webhook tools per platform type |
| Kubernetes Platform | Full read + kubectl proposals |
| Linux Platform | Full read + shell proposals + Quick SSH setup |
| Windows Platform | Full read + PowerShell proposals + Quick SSH setup |
| PostgreSQL Platform | Full read + SQL proposals |
| AWS Platform | EC2/ECS/EKS/RDS/Lambda/CloudWatch/S3/ELB read + write proposals |
| GitHub Platform | Code read/search + branch/PR/file write proposals |
| HashiCorp Vault | Optional secrets backend; Kubernetes auth; one-way migration |
| RBAC | Admin / Editor / Viewer roles |
| OIDC/SSO | Multiple OAuth 2.0 providers; configurable default role |
| Immutable Audit Log | DB-enforced INSERT-only log of all mutations |
| Real-Time WebSocket | Live updates for runs, proposals, and permission requests |
| ntfy Push Notifications | Proposal alerts with optional approve/reject action buttons |
| Multi-Platform Investigations | Investigate Kubernetes + AWS + GitHub simultaneously in one run |
| Agent Status Workflow | Draft → Pending Review → Active → Disabled (with admin approval gate) |
| Agent Cloning | Copy any agent as a starting point for a new one |
| Provider Profiles | Named AI provider configs reused across all agents and investigations |
| Database Backups | Scheduled PostgreSQL backups with configurable retention |
| AES-256-GCM Encryption | All credentials encrypted at rest before database storage |