Kubernetes/OpenShift Deployment Architecture for NanoClaw

2026-02-23 Research

Author: Brenner Axiom, #B4mad Industries Date: 2026-02-23 Bead: nanoclaw-k8s-r1

Abstract

This paper investigates architectural approaches for deploying NanoClaw containers on Kubernetes and OpenShift platforms. NanoClaw currently uses Docker as its container runtime to execute Claude Agent SDK instances in isolated environments. We analyze the existing Docker-based architecture, propose three distinct Kubernetes deployment patterns, and provide detailed trade-off analysis for each approach. We recommend a Job-based architecture with PersistentVolumeClaims for initial implementation due to minimal code disruption, OpenShift compatibility, and clear evolution paths. This paper targets technical readers familiar with container orchestration and Kubernetes primitives.

1. Context: Why Kubernetes for NanoClaw?

NanoClaw is a lightweight personal AI assistant framework that runs Claude Code in isolated Linux containers. Each agent session spawns an ephemeral Docker container with filesystem isolation, supporting:

Multi-group isolation — Each WhatsApp/Telegram group gets its own container sandbox
Concurrent execution — Up to 5 containers running simultaneously (configurable)
Filesystem-based IPC — Host controller communicates with containers via polling
Security by isolation — Bind mounts for workspace access, secrets via stdin

Current Limitations

The Docker-based architecture works well for single-host deployments but lacks:

Multi-node scaling — Cannot distribute workload across multiple machines
Resource orchestration — No native quotas, limits, or priority scheduling
High availability — Single point of failure (Docker daemon on one host)
Enterprise security — OpenShift Security Context Constraints (SCC) not enforceable

Migrating to Kubernetes/OpenShift enables cloud-native deployment patterns while preserving NanoClaw’s simplicity and security model.

2. Current Architecture Analysis

2.1 Container Lifecycle

File: /workspace/project/src/container-runner.ts

Each agent session follows this lifecycle:

Spawn — docker run with bind mounts for workspace, IPC, sessions
Stream — Parse stdout for structured results (sentinel markers)
Idle — Container stays alive 30min after completion (handles follow-ups)
Cleanup — Graceful docker stop or force kill after timeout

Key characteristics:

Ephemeral containers (--rm flag, no persistent state)
Short-lived (30min max per session)
Named pattern: nanoclaw-{groupFolder}-{timestamp}

2.2 Volume Mount Strategy

File: /workspace/project/src/container-runner.ts (lines 53-179)

NanoClaw uses Docker bind mounts to provide filesystem isolation:

/workspace/project    → {projectRoot}              (read-only)
/workspace/group      → groups/{folder}/           (read-write)
/home/node/.claude    → data/sessions/{folder}     (read-write)
/workspace/ipc        → data/ipc/{folder}/         (read-write)
/workspace/extra/*    → {additionalMounts}         (validated)

Security boundaries:

Main group gets read-only access to project root (prevents code tampering)
Non-main groups forced read-only for extra mounts (security boundary)
Mount allowlist stored outside project (~/.config/nanoclaw/mount-allowlist.json)

2.3 IPC Mechanism

File: /workspace/project/container/agent-runner/src/index.ts

Communication between host controller and container uses filesystem polling:

Host → Container:

Write JSON files to /workspace/ipc/input/{timestamp}.json
Write sentinel _close to signal shutdown

Container → Host:

Write structured output to stdout (parsed by host)
Wrap results in ---NANOCLAW_OUTPUT_START--- markers

Why filesystem?

Simple, reliable, no network dependencies
Works across container runtimes (Docker, Apple Container, Kubernetes)
No port conflicts or service discovery

2.4 Concurrency Model

File: /workspace/project/src/group-queue.ts

A GroupQueue manages concurrent container execution:

Global limit: 5 containers (configurable via MAX_CONCURRENT_CONTAINERS)
Per-group state: Active process, idle flag, pending messages/tasks
Queue behavior: FIFO processing when slots become available
Preemption: Idle containers can be killed for pending high-priority tasks

2.5 Security Model

Secrets — Never written to disk:

Read from .env only where needed
Passed to container via stdin
Stripped from Bash subprocess environment

User isolation — UID/GID mapping:

Container runs as host user (not root)
Ensures bind-mounted files have correct permissions
Skipped for root (uid 0) or container default (uid 1000)

Mount security — Allowlist validation:

Blocked patterns: .ssh, .aws, .kube, .env, private keys
Enforced on host before container creation (tamper-proof)
Non-main groups forced read-only for extra mounts

3. Kubernetes Deployment Approaches

We propose three architectures, each with different trade-offs for complexity, performance, and multi-node support.

3.1 Approach 1: Job-Based with Persistent Volumes

Overview

Each agent session spawns a Kubernetes Job → one Pod → auto-cleanup after completion. State persists via PersistentVolumeClaims (PVC).

Architecture Diagram

┌─────────────────────────────────────────────────┐
│  Host Controller (Deployment)                   │
│  ┌─────────────────────────────────────────┐   │
│  │ GroupQueue                               │   │
│  │ - Queue pending messages/tasks           │   │
│  │ - Create Job when slot available         │   │
│  │ - Poll Job status for completion         │   │
│  └─────────────────────────────────────────┘   │
│                                                  │
│  Mounted PVCs:                                  │
│  - /data/ipc/{groupFolder}/  (IPC polling)     │
│  - /data/sessions/{groupFolder}/               │
└─────────────────────────────────────────────────┘
                    │
                    │ Creates Job
                    ▼
┌─────────────────────────────────────────────────┐
│  Kubernetes Job: nanoclaw-main-1708712345       │
│  ┌─────────────────────────────────────────┐   │
│  │ Pod (ephemeral)                          │   │
│  │                                           │   │
│  │ Volumes:                                  │   │
│  │ - PVC: nanoclaw-group-main → /workspace/group │
│  │ - PVC: nanoclaw-ipc-main → /workspace/ipc    │
│  │ - PVC: nanoclaw-sessions-main → /.claude     │
│  │ - PVC: nanoclaw-project-ro → /workspace/project │
│  │                                           │   │
│  │ securityContext:                          │   │
│  │   runAsUser: 1000                         │   │
│  │   fsGroup: 1000                           │   │
│  └─────────────────────────────────────────┘   │
│                                                  │
│  activeDeadlineSeconds: 1800  (30min timeout)  │
│  ttlSecondsAfterFinished: 300  (5min cleanup)  │
└─────────────────────────────────────────────────┘

Volume Strategy

PVC per resource type:

# Group workspace (read-write)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-group-main
spec:
  accessModes:
    - ReadWriteMany  # Multi-node requires RWX
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs  # Or cephfs, efs, etc.

# IPC directory (read-write)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-ipc-main
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

# Project root (read-only)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-project-ro
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi

Job manifest template:

apiVersion: batch/v1
kind: Job
metadata:
  name: nanoclaw-main-{{timestamp}}
spec:
  activeDeadlineSeconds: 1800
  ttlSecondsAfterFinished: 300
  template:
    spec:
      restartPolicy: Never
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: agent
        image: nanoclaw-agent:latest
        stdin: true
        stdinOnce: true
        volumeMounts:
        - name: group-workspace
          mountPath: /workspace/group
        - name: ipc
          mountPath: /workspace/ipc
        - name: sessions
          mountPath: /home/node/.claude
        - name: project
          mountPath: /workspace/project
          readOnly: true
      volumes:
      - name: group-workspace
        persistentVolumeClaim:
          claimName: nanoclaw-group-main
      - name: ipc
        persistentVolumeClaim:
          claimName: nanoclaw-ipc-main
      - name: sessions
        persistentVolumeClaim:
          claimName: nanoclaw-sessions-main
      - name: project
        persistentVolumeClaim:
          claimName: nanoclaw-project-ro

Implementation Changes

New file: /workspace/project/src/k8s-runtime.ts

import * as k8s from '@kubernetes/client-node';

export async function createAgentJob(
  groupFolder: string,
  timestamp: number,
  volumeMounts: VolumeMount[]
): Promise<string> {
  const kc = new k8s.KubeConfig();
  kc.loadFromDefault();

  const batchV1 = kc.makeApiClient(k8s.BatchV1Api);

  const jobName = `nanoclaw-${groupFolder}-${timestamp}`;
  const job = buildJobManifest(jobName, groupFolder, volumeMounts);

  await batchV1.createNamespacedJob('default', job);
  return jobName;
}

export async function pollJobStatus(
  jobName: string
): Promise<JobStatus> {
  // Poll Job.status.conditions for completion
  // Return exit code or error
}

Modified: /workspace/project/src/container-runtime.ts

export const CONTAINER_RUNTIME_TYPE =
  process.env.CONTAINER_RUNTIME || 'docker';  // 'docker' | 'kubernetes'

export function getRuntime(): ContainerRuntime {
  if (CONTAINER_RUNTIME_TYPE === 'kubernetes') {
    return new K8sRuntime();
  }
  return new DockerRuntime();
}

Modified: /workspace/project/src/container-runner.ts

const runtime = getRuntime();

if (runtime instanceof K8sRuntime) {
  const jobName = await runtime.createAgentJob(groupFolder, timestamp, mounts);
  const result = await runtime.pollJobStatus(jobName);
  // Parse result same as Docker output
} else {
  // Existing Docker spawn() logic
}

Pros & Cons

Aspect	Assessment
Code changes	✅ Low (abstraction layer only)
IPC mechanism	✅ Unchanged (filesystem polling works)
OpenShift compatible	✅ Yes (PVC + SCC friendly)
Latency	⚠️ Medium (Job creation ~2-5s vs Docker <1s)
Multi-node	⚠️ Requires ReadWriteMany PVCs (NFS, CephFS)
Resource usage	✅ Low (ephemeral Pods, auto-cleanup)
Complexity	✅ Low (native K8s primitives)
Rollback	✅ Easy (just switch runtime back to Docker)

3.2 Approach 2: StatefulSet with Sidecar Pattern

Overview

Replace ephemeral Jobs with long-lived Pods (one per group) that stay idle between sessions. Host controller sends work via IPC (unchanged).

Architecture Diagram

┌─────────────────────────────────────────────────┐
│  Host Controller (Deployment)                   │
│  - Sends IPC messages to wake idle Pods         │
│  - Scales StatefulSet to 0 after idle timeout   │
└─────────────────────────────────────────────────┘
                    │
                    │ IPC via PVC
                    ▼
┌─────────────────────────────────────────────────┐
│  StatefulSet: nanoclaw-main (1 replica)         │
│  ┌─────────────────────────────────────────┐   │
│  │ Pod: nanoclaw-main-0 (always running)    │   │
│  │                                           │   │
│  │ Container loops forever:                  │   │
│  │ 1. Poll /workspace/ipc/input/             │   │
│  │ 2. Process message if present             │   │
│  │ 3. Write output                            │   │
│  │ 4. Sleep 500ms, repeat                     │   │
│  │                                           │   │
│  │ Idle timeout: 30min → graceful shutdown   │   │
│  └─────────────────────────────────────────┘   │
│                                                  │
│  volumeClaimTemplate:                           │
│  - workspace (10Gi RWX)                         │
└─────────────────────────────────────────────────┘

Volume Strategy

StatefulSet automatically provisions PVCs via volumeClaimTemplates:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nanoclaw-main
spec:
  serviceName: nanoclaw
  replicas: 1
  selector:
    matchLabels:
      app: nanoclaw
      group: main
  template:
    spec:
      containers:
      - name: agent
        image: nanoclaw-agent:latest
        command: ["/app/entrypoint-loop.sh"]  # Modified entrypoint
        volumeMounts:
        - name: workspace
          mountPath: /workspace
  volumeClaimTemplates:
  - metadata:
      name: workspace
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Implementation Changes

Modified: /workspace/project/container/agent-runner/src/index.ts

// Replace single-shot execution with infinite loop
while (true) {
  const message = await pollIpcInput();
  if (message === '_close') {
    console.log('Shutdown signal received');
    break;
  }
  if (message) {
    await processQuery(message);
  }
  await sleep(500);

  // Idle timeout
  if (Date.now() - lastActivity > IDLE_TIMEOUT) {
    console.log('Idle timeout, shutting down');
    break;
  }
}

Modified: /workspace/project/src/group-queue.ts

// Instead of spawning new container, ensure StatefulSet exists
async ensureStatefulSet(groupFolder: string) {
  if (!await k8s.statefulSetExists(groupFolder)) {
    await k8s.createStatefulSet(groupFolder);
  }
  await k8s.waitForPodReady(groupFolder);
}

// Send IPC message to wake idle Pod
async enqueueMessageCheck(groupFolder: string, message: Message) {
  await ensureStatefulSet(groupFolder);
  await writeIpcMessage(groupFolder, message);
}

Pros & Cons

Aspect	Assessment
Code changes	⚠️ Medium (queue + agent-runner modifications)
Latency	✅ Low (Pod already running, no Job creation)
Resource usage	❌ High (idle Pods consume memory/CPU)
IPC mechanism	✅ Unchanged
OpenShift compatible	✅ Yes
Session reuse	✅ Claude SDK stays warm (faster startup)
Complexity	⚠️ Medium (StatefulSet lifecycle, idle timeout logic)
Multi-node	⚠️ Requires RWX PVCs

3.3 Approach 3: DaemonSet Controller + Job Workers

Overview

Host controller runs as DaemonSet on each K8s node. Jobs are node-affinited to the same node as their group’s PVC. Optimized for multi-node clusters with hostPath volumes (local disk speed).

Architecture Diagram

┌────────────────────────────────────────────────────────┐
│  Kubernetes Cluster (3 nodes)                          │
│                                                         │
│  Node 1                Node 2               Node 3     │
│  ┌─────────────┐      ┌─────────────┐     ┌──────┐   │
│  │ nanoclaw-   │      │ nanoclaw-   │     │ ... │   │
│  │ controller  │      │ controller  │     └──────┘   │
│  │ DaemonSet   │      │ DaemonSet   │                 │
│  │ Pod         │      │ Pod         │                 │
│  │             │      │             │                 │
│  │ Manages:    │      │ Manages:    │                 │
│  │ - group-a   │      │ - group-c   │                 │
│  │ - group-b   │      │ - group-d   │                 │
│  └─────────────┘      └─────────────┘                 │
│         │                     │                        │
│         │ Creates Job         │ Creates Job            │
│         │ with nodeSelector   │ with nodeSelector      │
│         ▼                     ▼                        │
│  ┌─────────────┐      ┌─────────────┐                │
│  │ Job: group-a│      │ Job: group-c│                │
│  │ (Node 1)    │      │ (Node 2)    │                │
│  │             │      │             │                │
│  │ hostPath:   │      │ hostPath:   │                │
│  │ /var/       │      │ /var/       │                │
│  │ nanoclaw/   │      │ nanoclaw/   │                │
│  │ group-a/    │      │ group-c/    │                │
│  └─────────────┘      └─────────────┘                │
└────────────────────────────────────────────────────────┘

Group → Node Assignment

Use consistent hashing to assign groups to nodes:

function getNodeForGroup(groupFolder: string, nodes: Node[]): string {
  const hash = createHash('sha256')
    .update(groupFolder)
    .digest('hex');
  const index = parseInt(hash.slice(0, 8), 16) % nodes.length;
  return nodes[index].metadata.name;
}

Store mapping in ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nanoclaw-group-assignments
data:
  group-main: "node-1"
  group-family: "node-2"
  group-work: "node-1"

Volume Strategy

hostPath volumes for zero network latency:

apiVersion: batch/v1
kind: Job
metadata:
  name: nanoclaw-main-{{timestamp}}
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/hostname: node-1  # Pinned to same node as controller
      containers:
      - name: agent
        volumeMounts:
        - name: ipc
          mountPath: /workspace/ipc
        - name: group
          mountPath: /workspace/group
      volumes:
      - name: ipc
        hostPath:
          path: /var/nanoclaw/ipc/main
          type: Directory
      - name: group
        hostPath:
          path: /var/nanoclaw/groups/main
          type: Directory

Implementation Changes

New file: /workspace/project/src/k8s-daemonset.ts

export async function assignGroupToNode(groupFolder: string): Promise<string> {
  const nodes = await k8s.listNodes();
  const nodeName = getNodeForGroup(groupFolder, nodes);

  // Store in ConfigMap
  await k8s.updateConfigMap('nanoclaw-group-assignments', {
    [groupFolder]: nodeName
  });

  return nodeName;
}

export async function createJobWithAffinity(
  groupFolder: string,
  nodeName: string
): Promise<string> {
  const job = buildJobManifest(groupFolder, {
    nodeSelector: {
      'kubernetes.io/hostname': nodeName
    },
    volumes: buildHostPathVolumes(groupFolder)
  });
  await k8s.createJob(job);
}

Pros & Cons

Aspect	Assessment
Performance	✅ Best (local disk I/O, no network mounts)
Multi-node	✅ Native (DaemonSet per node)
Resource usage	⚠️ Medium (one controller per node)
Code changes	❌ High (distributed state, node affinity logic)
Security	❌ Poor (hostPath requires privileged access)
OpenShift compatible	❌ No (hostPath blocked by restricted SCC)
Complexity	❌ High (node assignment, rebalancing, failure handling)

4. Comparison Matrix

Criterion	Approach 1: Job+PVC	Approach 2: StatefulSet	Approach 3: DaemonSet
Code complexity	✅ Low	⚠️ Medium	❌ High
Job/Pod latency	⚠️ 2-5s	✅ <500ms	✅ <500ms
Resource idle cost	✅ Low	❌ High	⚠️ Medium
Multi-node support	⚠️ Requires RWX	⚠️ Requires RWX	✅ Native
Volume I/O performance	⚠️ Network (NFS)	⚠️ Network (NFS)	✅ Local disk
OpenShift SCC	✅ Compatible	✅ Compatible	❌ Blocked
IPC mechanism	✅ Unchanged	✅ Unchanged	✅ Unchanged
Rollback ease	✅ Easy	⚠️ Medium	❌ Hard
Production readiness	✅ Good	✅ Good	⚠️ Experimental
Recommended for	POC, single-node	Production, <50 groups	High-scale, >100 groups

5. Recommended Approach

Approach 1: Job-Based with PersistentVolumeClaims

Rationale

Minimal disruption — Abstraction layer only, IPC unchanged
OpenShift compatible — No hostPath, SCC-friendly
Easy rollback — Runtime flag toggles Docker/K8s
Natural evolution — Can upgrade to StatefulSet later if needed

Migration Path

Phase 1: Single-Node Kubernetes (Week 1-2)

Implement k8s-runtime.ts with Job API client
Create PVCs for main group (group, IPC, sessions, project)
Test Job creation, status polling, output parsing
Validate IPC mechanism works across PVCs

Phase 2: Multi-Group Support (Week 3-4)

Dynamic PVC provisioning per group
Test concurrent Job execution (5 simultaneous groups)
Performance benchmarking (Job creation latency, PVC I/O)

Phase 3: Multi-Node Deployment (Week 5-6)

Evaluate RWX PVC backends (NFS vs CephFS vs AWS EFS)
Test cross-node scheduling (Pod on Node 2, PVC on Node 1)
If latency unacceptable: pilot Approach 3 (DaemonSet + hostPath)

Phase 4: Production Hardening (Week 7-8)

OpenShift SCC validation
Security audit (PVC isolation, secrets handling)
Resource limits and quotas
Monitoring and alerting (Job failures, PVC capacity)

Risk Mitigation

High Risk: PVC Performance

Symptom: Slow I/O on NFS-backed PVCs
Mitigation: Benchmark early (Phase 2), pivot to DaemonSet if needed
Fallback: Use ReadWriteOnce + node affinity (pseudo-hostPath)

Medium Risk: Job Creation Latency

Symptom: 5-10s delay for Job → Running
Mitigation: Pre-warm Pod pool (StatefulSet with scale=0, scale up on demand)
Fallback: Accept latency or switch to StatefulSet (Approach 2)

Low Risk: OpenShift SCC

Symptom: PVC mount permissions fail
Mitigation: Use fsGroup in securityContext, request anyuid SCC if needed
Fallback: Manual PVC permission fixing via initContainer

6. Implementation Checklist

Prerequisites

Kubernetes cluster (1.24+) or OpenShift (4.12+)
StorageClass with ReadWriteMany support (NFS, CephFS, EFS)
Container registry for nanoclaw-agent image
RBAC permissions (create Jobs, PVCs, read Pods)

Code Changes

Create /workspace/project/src/k8s-runtime.ts (Job API client)
Modify /workspace/project/src/container-runtime.ts (runtime detection)
Modify /workspace/project/src/container-runner.ts (Job dispatcher)
Add /workspace/project/src/config.ts (CONTAINER_RUNTIME, K8S_NAMESPACE)
Add /workspace/project/k8s/pvc-templates.yaml (PVC manifests)
Add tests for K8s runtime abstraction

Deployment

Build and push nanoclaw-agent image to registry
Create namespace: kubectl create namespace nanoclaw
Apply PVC templates: kubectl apply -f k8s/pvc-templates.yaml
Deploy host controller (Deployment with PVC mounts)
Set CONTAINER_RUNTIME=kubernetes env var
Verify Job creation: kubectl get jobs -n nanoclaw

Testing

Single-group test (main group)
Concurrent execution test (5 groups simultaneously)
IPC round-trip test (follow-up messages work)
Idle timeout test (Pod cleans up after 30min)
Failure recovery test (Job fails, retry logic works)
Performance test (Job latency, PVC throughput)

7. Future Work

Short-Term (1-3 months)

Performance optimization: Pre-warm Pod pool to reduce Job creation latency
Dynamic PVC provisioning: Auto-create PVCs for new groups
Multi-cluster support: Federate Jobs across multiple K8s clusters

Long-Term (6-12 months)

Native K8s IPC: Replace filesystem polling with HTTP (Pod → Service)
Serverless integration: Knative for auto-scaling (scale to zero when idle)
Operator pattern: Custom Resource Definitions (CRD) for NanoClaw groups

8. Conclusion

Deploying NanoClaw on Kubernetes/OpenShift unlocks multi-node scaling, resource orchestration, and enterprise security without sacrificing simplicity. The Job-based architecture with PersistentVolumeClaims provides the best balance of low complexity, OpenShift compatibility, and clear evolution paths. Implementation requires minimal code changes (~500 LOC) and preserves the existing IPC mechanism.

For organizations running NanoClaw at scale (>10 groups, multi-node), this migration enables cloud-native deployment patterns while maintaining the framework’s core philosophy: secure by isolation, simple by design.

References

NanoClaw source code: https://github.com/qwibitai/nanoclaw
Kubernetes Jobs documentation: https://kubernetes.io/docs/concepts/workloads/controllers/job/
OpenShift Security Context Constraints: https://docs.openshift.com/container-platform/4.12/authentication/managing-security-context-constraints.html
PersistentVolumes with ReadWriteMany: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes