KaizenCore · Orchestration Platform

Your infrastructure
never sleeps.
Neither does
your AI team.

An AI agent platform that investigates incidents, proposes fixes,
and waits for your approval.

An alert fires at 2am. A service degrades. A build pipeline breaks. A memory leak grows. And because the one person who knows how to dig into it is unavailable — on PTO, in a meeting, swamped with other work — your team stares at dashboards, files tickets, and hopes.

KaizenCore changes that. It connects to your Kubernetes clusters, Linux and Windows servers, AWS accounts, PostgreSQL databases, and GitHub repositories. It watches for alerts. It digs into problems autonomously. It proposes fixes with full context. And it never executes a single write operation without a human reviewing exactly what will run and explicitly approving it.

The DevOps engineer is still in the loop — but now they approve a fix from their phone in 30 seconds instead of being woken up, spending 45 minutes investigating, and then having to type the commands themselves.

The KaizenCore dashboard: active agents, pending proposals, recent investigations, and platform health — all in one view.

Core Capabilities

Three things that used to require a human expert. Now automated, audited, and human-approved.

⏱

Scheduled Health Agents

Set an AI agent to run on a cron schedule against any platform. Every 5 minutes, every night at 2am, every Monday morning — the agent runs, reads your infrastructure state, and reports. When something is wrong, it proposes a fix. You review, you approve, it executes.

🔍

On-Demand Investigations

Describe a problem in plain English. KaizenCore runs an AI investigator across every connected platform simultaneously. It reads logs, queries databases, inspects pods, checks AWS resources, and searches your codebase — all in parallel. You watch the tool calls stream in live.

⚡

Alert-Triggered Auto-Investigations

Connect Prometheus, Grafana, or PagerDuty to KaizenCore's webhook endpoint. The moment an alert fires, an investigation starts — automatically. By the time your on-call engineer looks at their phone, KaizenCore has already been investigating for 3 minutes.

Every agent is tied to a platform, a schedule, a tool set, and a system prompt. Admins approve agents before they go live.

The Safety Guarantee

The AI investigates. Humans decide. Nothing executes without approval.

Every write operation in KaizenCore — restarting a service, running a kubectl command, executing a SQL query, pushing a commit — creates a Proposal. The proposal shows the reviewer:

Exactly what will run The tool name, the command, every argument — nothing hidden.

Why the agent wants to run it The AI's full reasoning in plain language, not just the command.

What platform it targets Linux server, Kubernetes cluster, AWS account — no ambiguity.

A live countdown timer Proposals expire. Stale approvals don't run unexpectedly.

The reviewer clicks Approve — the command runs, the result is captured. The reviewer clicks Reject — the command never runs, the agent adapts. No commands execute in the background. No surprises.

If optional TOTP 2FA is enabled, high-risk approvals require a 6-digit authenticator code before execution. Multi-step proposals let the agent propose a complete remediation plan — restart service, verify health, roll back if needed — as a sequential series of steps reviewers see in full before approving step one.

A multi-step proposal: the AI describes its full remediation plan. Step-by-step execution only begins after the reviewer approves.

Platform Integrations

Six platforms. One interface. One AI team.

KaizenCore connects to the platforms you already run. Add credentials once; every agent and investigation you create automatically has access to the tools for that platform.

Connect Kubernetes, Linux, Windows, PostgreSQL, AWS, and GitHub — all from the same UI.

Kubernetes

Full cluster visibility

Pod status, logs, node metrics, deployment events. Proposes any kubectl command — exec, apply, scale, rollout undo.

CrashLoopBackOff at 2am? KaizenCore reads the logs, traces the bad env var, proposes kubectl rollout undo.

Linux

Deep system introspection

CPU/memory/disk/IO, process lists, failed services, port bindings, application logs. Proposes shell commands or service restarts.

Disk at 94%? Finds the runaway log file, proposes truncate + service restart in one step.

Windows

PowerShell-native proposals

System info, Windows services, Event Log errors, port bindings. Proposes PowerShell via OpenSSH. Quick Setup handles authorized_keys automatically.

IIS app pool recycling? Reads Event Log, identifies memory threshold, proposes PowerShell fix.

PostgreSQL

Query-level diagnostics

Slow queries via pg_stat_statements, table sizes, replication lag, lock conflicts, index health. Proposes SQL as a reviewed proposal.

Query time tripled? Identifies sequential scan, proposes CREATE INDEX CONCURRENTLY on the 50M-row table.

AWS

Multi-service coverage

EC2, ECS, EKS, RDS, Lambda, CloudWatch, S3, ELB. Reads across services; proposals for reboot, scale, and force-deploy.

ECS service unhealthy? Correlates CloudWatch logs, task failures, and ELB target health into one root-cause summary.

GitHub

Code read + write proposals

Repository search, file read, diff analysis. Proposes branch creation, file writes, and pull requests — all human-gated.

Bad deploy? Clones repo, reads the diff, identifies the breaking change, opens a corrected PR for review.

Investigations

Describe the problem. Watch the AI work. Get an answer.

You type a problem description — "nginx is returning 502 errors" or "the nightly ETL job failed" — and select one or more platforms. KaizenCore creates a run, builds a context-aware agent, and starts executing.

You watch it happen in real time. Every tool call streams to the timeline as it happens — the exact input, the exact output, no black box. You see the AI read your logs, query your database, check your pods, search your code.

The tool permission system extends investigations into territory the AI can't read by default. If the agent needs to run netstat -tlnp or kubectl exec, it pauses, tells you exactly what it wants to run and why, and waits. You approve or deny. Follow-up chat keeps investigations alive after the initial run — type a follow-up and a new run picks up with full context of the previous investigation.

An active investigation: tool calls stream in real time. Each call shows its exact input and output. A permission request banner appears when the AI needs elevated access.

Alert Integration

Alert fires. Investigation starts. Results waiting before you finish reading the alert.

KaizenCore exposes an inbound webhook endpoint. Point your alerting system at it and connect an investigation to fire automatically every time an alert arrives. Supported sources: Prometheus AlertManager, Grafana, PagerDuty, and any generic JSON webhook.

Alert fires in your monitoring system

Payload POSTs to POST /api/webhooks/inbound/{your-token}

KaizenCore extracts alert name, status, labels, and annotations

Maps alert labels to the target platform (e.g., namespace=production → your production K8s cluster)

Creates an investigation with an auto-generated description from a configurable template

Starts the run immediately — no human required to kick it off

Sends a push notification via ntfy with a deep link to findings

By the time your on-call engineer opens the Slack notification, KaizenCore has already spent 3–5 minutes reading logs, checking services, and investigating the blast radius.

Webhook source configuration: map alert labels to platforms, choose your AI model, and write a description template using alert data.

Extensibility

The AI knows your infrastructure's custom commands too.

Every environment has tools that aren't in any default toolset. A custom kubectl plugin. An internal health check script. A deployment trigger API. KaizenCore's custom tools let you teach the AI about these one time, and it uses them naturally alongside built-in tools.

Command Tools

Define a shell or kubectl command with parameter placeholders. Mark as read (executes immediately) or write (creates a proposal for approval).

          # Example

          Name: get_pod_resource_usage

          Cmd: kubectl top pod {{pod_name}}

          Type: read

Webhook Tools

The AI POSTs a JSON payload to your endpoint. Useful for triggering internal systems, sending structured data to custom APIs, or integrating with tools that have HTTP APIs.

Every custom tool is scoped to a platform type so it only appears for relevant agents and investigations.

Custom tools extend the AI's built-in capabilities. Command tools run shell or kubectl commands; webhook tools POST to your endpoints.

AI Flexibility

Bring your own AI. Switch models without reconfiguring anything.

KaizenCore is provider-agnostic. Configure named provider profiles once; use them everywhere. Different agents can use different models — routine health checks on a cost-efficient model, critical incident investigations on the most powerful one available.

Anthropic Claude

OpenAI GPT-4

Any OpenAI-compatible endpoint

Local / self-hosted models

Create a named profile (e.g., "Production — Claude Sonnet" or "Cost-Optimized — GPT-3.5"). Assign it to agents and investigations. When you want to switch the model, update the profile — every agent using it updates automatically.

Named AI provider profiles: configure once, assign to any number of agents and investigations. Switch models in one place.

Security Architecture

Enterprise-grade security. Nothing trusts nothing.

AES-256-GCM Encryption at Rest

All platform credentials are encrypted before database storage. Production deployments can use HashiCorp Vault KV v2 — credentials live in Vault, not the database. One-button migration with automatic rollback on failure.

Agent-Scoped Credential Access

An agent running against your Kubernetes cluster cannot access the SSH key for your PostgreSQL server. Every run can only read credentials for its assigned platform — enforced at the executor level.

Immutable Audit Log

Every mutation is logged with actor ID, timestamp, action type, and resource. The log is enforced at the database level as INSERT-only — no record can be modified or deleted. Export for compliance audits.

RBAC + OIDC/SSO

Three roles: Admin, Editor, Viewer. OIDC/SSO supported with configurable default role for new users. Multiple OAuth 2.0 providers.

TOTP 2FA on Approvals

Reviewers can enable authenticator app 2FA. Approving a proposal requires a valid 6-digit TOTP code — a stolen session cannot execute infrastructure changes.

Zero Surprise Execution

Nothing that modifies infrastructure runs without creating a visible Proposal record first. Every write operation goes through the same review gate. This is enforced in code, not just the UI.

Real-World Use Cases

What this looks like when your team actually uses it.

Scenario 01 It's 3am. Production is down. The on-call engineer is asleep.

KaizenCore received the PagerDuty alert 4 minutes ago
Listed all pods in production namespace, found 3 in CrashLoopBackOff
Read container logs, checked recent deployment events
Identified the commit that introduced the breaking change
Proposed: kubectl rollout undo deployment/api -n production
Engineer wakes up, reads 3-sentence summary, taps Approve. Goes back to sleep.

Scenario 02 A developer merged a bad migration. The database is degrading.

Alert fires on slow query time
pg_stat_statements reveals a new sequential scan on a 50M-row table
GitHub diff confirms the migration dropped idx_orders_status
Proposes CREATE INDEX CONCURRENTLY to restore the index
Creates a GitHub PR to add the index back to the migration file
Both proposals reviewed and approved. Index restored in minutes.

Scenario 03 A new engineer needs to debug an unfamiliar service.

Types: "The payment service is returning 500 errors. I don't know this codebase."
KaizenCore reads app logs across Linux and Kubernetes
Finds the exception stacktrace, clones the repo, reads the relevant code path
Identifies a third-party API key that expired
Proposes updating the secret in Kubernetes
Engineer approves the fix. Learned something about the system in the process.

Scenario 04 Weekly infrastructure health review. Nobody wants to do it manually.

Scheduled agent runs every Monday at 8am across all platforms
Checks for pods with high restart counts
Queries PostgreSQL for growing table bloat
Reads failed systemd services on all Linux hosts
Checks CloudWatch for Lambda error rate trends
Summary lands in Slack before standup. Issues that would hide for weeks get surfaced early.

Philosophy

AI that augments engineers. Not AI that replaces them.

"The goal isn't to remove the engineer. It's to remove the 45-minute investigation phase so that by the time the engineer looks at the incident, all the context they need to make a decision is already assembled."

KaizenCore is built around one conviction: the human must always be in the loop for writes.

Investigations are fully autonomous — the AI reads everything, diagnoses everything, explains everything. That's safe. Reading production logs and querying databases doesn't break anything.

But when it comes to changing something — restarting a service, running a command, pushing code, executing SQL — a human sees exactly what will run, understands why the AI thinks it should run, and makes the call. This isn't a limitation. It's the feature.

The AI does the work. The engineer does the thinking. Everyone goes home on time.

Quick Start

Running in under 10 minutes.

      # Clone and start development environment

      git clone <repo>

      docker-compose -f docker-compose.dev.yml up

      # Frontend: http://localhost:3000

      # Backend:  http://localhost:8080

      # Set ADMIN_EMAIL and ADMIN_PASSWORD in compose env

01Add a platform — Settings → Platforms → Add Platform

02Test connectivity — click "Test Connection" after saving

03Configure an AI provider — Settings → AI Providers → Add Provider

04Create your first investigation — Investigate → New Investigation

05Watch the tool call timeline stream in

For production deployments

→Use the Kubernetes deployment manifests in docs/deployment-kubernetes.md

→Enable Vault for secrets storage — Settings → Secrets Storage

→Configure OIDC for SSO — Settings → OIDC

→Set up ntfy for push notifications — Settings → Notifications

→Configure webhook sources for alert-triggered investigations — Settings → Webhooks

Feature Reference

Complete capability overview.

Feature	Description
Scheduled Agents	Cron-based agents with configurable tools and system prompts
On-Demand Investigations	Plain-English problem → AI investigation → findings summary
Alert-Triggered Investigations	Prometheus/Grafana/PagerDuty webhooks auto-start investigations
Proposal System	All writes require human review before execution
Multi-Step Proposals	Complex fixes bundled as sequential steps with human approval
TOTP 2FA on Approvals	Authenticator code required for high-risk proposal approvals
Proposal Expiry	Configurable TTL per agent; auto-expire stale pending proposals
Tool Permission Requests	AI pauses to request elevated access; human approves/denies
Follow-Up Chat	Continue investigations with new messages; full context preserved
Custom Tools	User-defined command and webhook tools per platform type
Kubernetes Platform	Full read + kubectl proposals
Linux Platform	Full read + shell proposals + Quick SSH setup
Windows Platform	Full read + PowerShell proposals + Quick SSH setup
PostgreSQL Platform	Full read + SQL proposals
AWS Platform	EC2/ECS/EKS/RDS/Lambda/CloudWatch/S3/ELB read + write proposals
GitHub Platform	Code read/search + branch/PR/file write proposals
HashiCorp Vault	Optional secrets backend; Kubernetes auth; one-way migration
RBAC	Admin / Editor / Viewer roles
OIDC/SSO	Multiple OAuth 2.0 providers; configurable default role
Immutable Audit Log	DB-enforced INSERT-only log of all mutations
Real-Time WebSocket	Live updates for runs, proposals, and permission requests
ntfy Push Notifications	Proposal alerts with optional approve/reject action buttons
Multi-Platform Investigations	Investigate Kubernetes + AWS + GitHub simultaneously in one run
Agent Status Workflow	Draft → Pending Review → Active → Disabled (with admin approval gate)
Agent Cloning	Copy any agent as a starting point for a new one
Provider Profiles	Named AI provider configs reused across all agents and investigations
Database Backups	Scheduled PostgreSQL backups with configurable retention
AES-256-GCM Encryption	All credentials encrypted at rest before database storage

Your infrastructurenever sleeps.Neither doesyour AI team.