KaizenCore · Orchestration Platform

Your infrastructure
never sleeps.
Neither does
your AI team.

An AI agent platform that investigates incidents, proposes fixes,
and waits for your approval.

An alert fires at 2am. A service degrades. A build pipeline breaks. A memory leak grows. And because the one person who knows how to dig into it is unavailable — on PTO, in a meeting, swamped with other work — your team stares at dashboards, files tickets, and hopes.

KaizenCore changes that. It connects to your Kubernetes clusters, Linux and Windows servers, AWS accounts, PostgreSQL databases, and GitHub repositories. It watches for alerts. It digs into problems autonomously. It proposes fixes with full context. And it never executes a single write operation without a human reviewing exactly what will run and explicitly approving it.

The DevOps engineer is still in the loop — but now they approve a fix from their phone in 30 seconds instead of being woken up, spending 45 minutes investigating, and then having to type the commands themselves.

KaizenCore Dashboard

The KaizenCore dashboard: active agents, pending proposals, recent investigations, and platform health — all in one view.

Three things that used to require a human expert. Now automated, audited, and human-approved.

Scheduled Health Agents

Set an AI agent to run on a cron schedule against any platform. Every 5 minutes, every night at 2am, every Monday morning — the agent runs, reads your infrastructure state, and reports. When something is wrong, it proposes a fix. You review, you approve, it executes.

🔍

On-Demand Investigations

Describe a problem in plain English. KaizenCore runs an AI investigator across every connected platform simultaneously. It reads logs, queries databases, inspects pods, checks AWS resources, and searches your codebase — all in parallel. You watch the tool calls stream in live.

Alert-Triggered Auto-Investigations

Connect Prometheus, Grafana, or PagerDuty to KaizenCore's webhook endpoint. The moment an alert fires, an investigation starts — automatically. By the time your on-call engineer looks at their phone, KaizenCore has already been investigating for 3 minutes.

Scheduled Agents

Every agent is tied to a platform, a schedule, a tool set, and a system prompt. Admins approve agents before they go live.

The AI investigates. Humans decide. Nothing executes without approval.

Every write operation in KaizenCore — restarting a service, running a kubectl command, executing a SQL query, pushing a commit — creates a Proposal. The proposal shows the reviewer:

01
Exactly what will run The tool name, the command, every argument — nothing hidden.
02
Why the agent wants to run it The AI's full reasoning in plain language, not just the command.
03
What platform it targets Linux server, Kubernetes cluster, AWS account — no ambiguity.
04
A live countdown timer Proposals expire. Stale approvals don't run unexpectedly.

The reviewer clicks Approve — the command runs, the result is captured. The reviewer clicks Reject — the command never runs, the agent adapts. No commands execute in the background. No surprises.

If optional TOTP 2FA is enabled, high-risk approvals require a 6-digit authenticator code before execution. Multi-step proposals let the agent propose a complete remediation plan — restart service, verify health, roll back if needed — as a sequential series of steps reviewers see in full before approving step one.

Multi-step Proposal

A multi-step proposal: the AI describes its full remediation plan. Step-by-step execution only begins after the reviewer approves.

Six platforms. One interface. One AI team.

KaizenCore connects to the platforms you already run. Add credentials once; every agent and investigation you create automatically has access to the tools for that platform.

Platform Options

Connect Kubernetes, Linux, Windows, PostgreSQL, AWS, and GitHub — all from the same UI.

Kubernetes

Full cluster visibility

Pod status, logs, node metrics, deployment events. Proposes any kubectl command — exec, apply, scale, rollout undo.

CrashLoopBackOff at 2am? KaizenCore reads the logs, traces the bad env var, proposes kubectl rollout undo.
Linux

Deep system introspection

CPU/memory/disk/IO, process lists, failed services, port bindings, application logs. Proposes shell commands or service restarts.

Disk at 94%? Finds the runaway log file, proposes truncate + service restart in one step.
Windows

PowerShell-native proposals

System info, Windows services, Event Log errors, port bindings. Proposes PowerShell via OpenSSH. Quick Setup handles authorized_keys automatically.

IIS app pool recycling? Reads Event Log, identifies memory threshold, proposes PowerShell fix.
PostgreSQL

Query-level diagnostics

Slow queries via pg_stat_statements, table sizes, replication lag, lock conflicts, index health. Proposes SQL as a reviewed proposal.

Query time tripled? Identifies sequential scan, proposes CREATE INDEX CONCURRENTLY on the 50M-row table.
AWS

Multi-service coverage

EC2, ECS, EKS, RDS, Lambda, CloudWatch, S3, ELB. Reads across services; proposals for reboot, scale, and force-deploy.

ECS service unhealthy? Correlates CloudWatch logs, task failures, and ELB target health into one root-cause summary.
GitHub

Code read + write proposals

Repository search, file read, diff analysis. Proposes branch creation, file writes, and pull requests — all human-gated.

Bad deploy? Clones repo, reads the diff, identifies the breaking change, opens a corrected PR for review.

Describe the problem. Watch the AI work. Get an answer.

You type a problem description — "nginx is returning 502 errors" or "the nightly ETL job failed" — and select one or more platforms. KaizenCore creates a run, builds a context-aware agent, and starts executing.

You watch it happen in real time. Every tool call streams to the timeline as it happens — the exact input, the exact output, no black box. You see the AI read your logs, query your database, check your pods, search your code.

The tool permission system extends investigations into territory the AI can't read by default. If the agent needs to run netstat -tlnp or kubectl exec, it pauses, tells you exactly what it wants to run and why, and waits. You approve or deny. Follow-up chat keeps investigations alive after the initial run — type a follow-up and a new run picks up with full context of the previous investigation.

Investigate & Plan

An active investigation: tool calls stream in real time. Each call shows its exact input and output. A permission request banner appears when the AI needs elevated access.

Alert fires. Investigation starts. Results waiting before you finish reading the alert.

KaizenCore exposes an inbound webhook endpoint. Point your alerting system at it and connect an investigation to fire automatically every time an alert arrives. Supported sources: Prometheus AlertManager, Grafana, PagerDuty, and any generic JSON webhook.

Alert fires in your monitoring system
Payload POSTs to POST /api/webhooks/inbound/{your-token}
KaizenCore extracts alert name, status, labels, and annotations
Maps alert labels to the target platform (e.g., namespace=production → your production K8s cluster)
Creates an investigation with an auto-generated description from a configurable template
Starts the run immediately — no human required to kick it off
Sends a push notification via ntfy with a deep link to findings

By the time your on-call engineer opens the Slack notification, KaizenCore has already spent 3–5 minutes reading logs, checking services, and investigating the blast radius.

Webhook Alerts

Webhook source configuration: map alert labels to platforms, choose your AI model, and write a description template using alert data.

The AI knows your infrastructure's custom commands too.

Every environment has tools that aren't in any default toolset. A custom kubectl plugin. An internal health check script. A deployment trigger API. KaizenCore's custom tools let you teach the AI about these one time, and it uses them naturally alongside built-in tools.

Command Tools

Define a shell or kubectl command with parameter placeholders. Mark as read (executes immediately) or write (creates a proposal for approval).

# Example
Name: get_pod_resource_usage
Cmd: kubectl top pod {{pod_name}}
Type: read

Webhook Tools

The AI POSTs a JSON payload to your endpoint. Useful for triggering internal systems, sending structured data to custom APIs, or integrating with tools that have HTTP APIs.

Every custom tool is scoped to a platform type so it only appears for relevant agents and investigations.

Tools Page

Custom tools extend the AI's built-in capabilities. Command tools run shell or kubectl commands; webhook tools POST to your endpoints.

Bring your own AI. Switch models without reconfiguring anything.

KaizenCore is provider-agnostic. Configure named provider profiles once; use them everywhere. Different agents can use different models — routine health checks on a cost-efficient model, critical incident investigations on the most powerful one available.

Anthropic Claude
OpenAI GPT-4
Any OpenAI-compatible endpoint
Local / self-hosted models

Create a named profile (e.g., "Production — Claude Sonnet" or "Cost-Optimized — GPT-3.5"). Assign it to agents and investigations. When you want to switch the model, update the profile — every agent using it updates automatically.

AI Providers

Named AI provider profiles: configure once, assign to any number of agents and investigations. Switch models in one place.

Enterprise-grade security. Nothing trusts nothing.

AES-256-GCM Encryption at Rest

All platform credentials are encrypted before database storage. Production deployments can use HashiCorp Vault KV v2 — credentials live in Vault, not the database. One-button migration with automatic rollback on failure.

Agent-Scoped Credential Access

An agent running against your Kubernetes cluster cannot access the SSH key for your PostgreSQL server. Every run can only read credentials for its assigned platform — enforced at the executor level.

Immutable Audit Log

Every mutation is logged with actor ID, timestamp, action type, and resource. The log is enforced at the database level as INSERT-only — no record can be modified or deleted. Export for compliance audits.

RBAC + OIDC/SSO

Three roles: Admin, Editor, Viewer. OIDC/SSO supported with configurable default role for new users. Multiple OAuth 2.0 providers.

TOTP 2FA on Approvals

Reviewers can enable authenticator app 2FA. Approving a proposal requires a valid 6-digit TOTP code — a stolen session cannot execute infrastructure changes.

Zero Surprise Execution

Nothing that modifies infrastructure runs without creating a visible Proposal record first. Every write operation goes through the same review gate. This is enforced in code, not just the UI.

What this looks like when your team actually uses it.

Scenario 01 It's 3am. Production is down. The on-call engineer is asleep.
  • KaizenCore received the PagerDuty alert 4 minutes ago
  • Listed all pods in production namespace, found 3 in CrashLoopBackOff
  • Read container logs, checked recent deployment events
  • Identified the commit that introduced the breaking change
  • Proposed: kubectl rollout undo deployment/api -n production
  • Engineer wakes up, reads 3-sentence summary, taps Approve. Goes back to sleep.
Scenario 02 A developer merged a bad migration. The database is degrading.
  • Alert fires on slow query time
  • pg_stat_statements reveals a new sequential scan on a 50M-row table
  • GitHub diff confirms the migration dropped idx_orders_status
  • Proposes CREATE INDEX CONCURRENTLY to restore the index
  • Creates a GitHub PR to add the index back to the migration file
  • Both proposals reviewed and approved. Index restored in minutes.
Scenario 03 A new engineer needs to debug an unfamiliar service.
  • Types: "The payment service is returning 500 errors. I don't know this codebase."
  • KaizenCore reads app logs across Linux and Kubernetes
  • Finds the exception stacktrace, clones the repo, reads the relevant code path
  • Identifies a third-party API key that expired
  • Proposes updating the secret in Kubernetes
  • Engineer approves the fix. Learned something about the system in the process.
Scenario 04 Weekly infrastructure health review. Nobody wants to do it manually.
  • Scheduled agent runs every Monday at 8am across all platforms
  • Checks for pods with high restart counts
  • Queries PostgreSQL for growing table bloat
  • Reads failed systemd services on all Linux hosts
  • Checks CloudWatch for Lambda error rate trends
  • Summary lands in Slack before standup. Issues that would hide for weeks get surfaced early.

AI that augments engineers. Not AI that replaces them.

"The goal isn't to remove the engineer. It's to remove the 45-minute investigation phase so that by the time the engineer looks at the incident, all the context they need to make a decision is already assembled."

KaizenCore is built around one conviction: the human must always be in the loop for writes.

Investigations are fully autonomous — the AI reads everything, diagnoses everything, explains everything. That's safe. Reading production logs and querying databases doesn't break anything.

But when it comes to changing something — restarting a service, running a command, pushing code, executing SQL — a human sees exactly what will run, understands why the AI thinks it should run, and makes the call. This isn't a limitation. It's the feature.

The AI does the work. The engineer does the thinking. Everyone goes home on time.

Running in under 10 minutes.

# Clone and start development environment
git clone <repo>
docker-compose -f docker-compose.dev.yml up

# Frontend: http://localhost:3000
# Backend: http://localhost:8080
# Set ADMIN_EMAIL and ADMIN_PASSWORD in compose env
01Add a platform — Settings → Platforms → Add Platform
02Test connectivity — click "Test Connection" after saving
03Configure an AI provider — Settings → AI Providers → Add Provider
04Create your first investigation — Investigate → New Investigation
05Watch the tool call timeline stream in
For production deployments
Use the Kubernetes deployment manifests in docs/deployment-kubernetes.md
Enable Vault for secrets storage — Settings → Secrets Storage
Configure OIDC for SSO — Settings → OIDC
Set up ntfy for push notifications — Settings → Notifications
Configure webhook sources for alert-triggered investigations — Settings → Webhooks

Complete capability overview.

FeatureDescription
Scheduled AgentsCron-based agents with configurable tools and system prompts
On-Demand InvestigationsPlain-English problem → AI investigation → findings summary
Alert-Triggered InvestigationsPrometheus/Grafana/PagerDuty webhooks auto-start investigations
Proposal SystemAll writes require human review before execution
Multi-Step ProposalsComplex fixes bundled as sequential steps with human approval
TOTP 2FA on ApprovalsAuthenticator code required for high-risk proposal approvals
Proposal ExpiryConfigurable TTL per agent; auto-expire stale pending proposals
Tool Permission RequestsAI pauses to request elevated access; human approves/denies
Follow-Up ChatContinue investigations with new messages; full context preserved
Custom ToolsUser-defined command and webhook tools per platform type
Kubernetes PlatformFull read + kubectl proposals
Linux PlatformFull read + shell proposals + Quick SSH setup
Windows PlatformFull read + PowerShell proposals + Quick SSH setup
PostgreSQL PlatformFull read + SQL proposals
AWS PlatformEC2/ECS/EKS/RDS/Lambda/CloudWatch/S3/ELB read + write proposals
GitHub PlatformCode read/search + branch/PR/file write proposals
HashiCorp VaultOptional secrets backend; Kubernetes auth; one-way migration
RBACAdmin / Editor / Viewer roles
OIDC/SSOMultiple OAuth 2.0 providers; configurable default role
Immutable Audit LogDB-enforced INSERT-only log of all mutations
Real-Time WebSocketLive updates for runs, proposals, and permission requests
ntfy Push NotificationsProposal alerts with optional approve/reject action buttons
Multi-Platform InvestigationsInvestigate Kubernetes + AWS + GitHub simultaneously in one run
Agent Status WorkflowDraft → Pending Review → Active → Disabled (with admin approval gate)
Agent CloningCopy any agent as a starting point for a new one
Provider ProfilesNamed AI provider configs reused across all agents and investigations
Database BackupsScheduled PostgreSQL backups with configurable retention
AES-256-GCM EncryptionAll credentials encrypted at rest before database storage