Kubernetes/OpenShift Deployment Architecture for NanoClaw

Author: Brenner Axiom, #B4mad Industries Date: 2026-02-23 Bead: nanoclaw-k8s-r1


Abstract

This paper investigates architectural approaches for deploying NanoClaw containers on Kubernetes and OpenShift platforms. NanoClaw currently uses Docker as its container runtime to execute Claude Agent SDK instances in isolated environments. We analyze the existing Docker-based architecture, propose three distinct Kubernetes deployment patterns, and provide detailed trade-off analysis for each approach. We recommend a Job-based architecture with PersistentVolumeClaims for initial implementation due to minimal code disruption, OpenShift compatibility, and clear evolution paths. This paper targets technical readers familiar with container orchestration and Kubernetes primitives.


1. Context: Why Kubernetes for NanoClaw?

NanoClaw is a lightweight personal AI assistant framework that runs Claude Code in isolated Linux containers. Each agent session spawns an ephemeral Docker container with filesystem isolation, supporting:

  • Multi-group isolation โ€” Each WhatsApp/Telegram group gets its own container sandbox
  • Concurrent execution โ€” Up to 5 containers running simultaneously (configurable)
  • Filesystem-based IPC โ€” Host controller communicates with containers via polling
  • Security by isolation โ€” Bind mounts for workspace access, secrets via stdin

Current Limitations

The Docker-based architecture works well for single-host deployments but lacks:

  1. Multi-node scaling โ€” Cannot distribute workload across multiple machines
  2. Resource orchestration โ€” No native quotas, limits, or priority scheduling
  3. High availability โ€” Single point of failure (Docker daemon on one host)
  4. Enterprise security โ€” OpenShift Security Context Constraints (SCC) not enforceable

Migrating to Kubernetes/OpenShift enables cloud-native deployment patterns while preserving NanoClaw’s simplicity and security model.


2. Current Architecture Analysis

2.1 Container Lifecycle

File: /workspace/project/src/container-runner.ts

Each agent session follows this lifecycle:

  1. Spawn โ€” docker run with bind mounts for workspace, IPC, sessions
  2. Stream โ€” Parse stdout for structured results (sentinel markers)
  3. Idle โ€” Container stays alive 30min after completion (handles follow-ups)
  4. Cleanup โ€” Graceful docker stop or force kill after timeout

Key characteristics:

  • Ephemeral containers (--rm flag, no persistent state)
  • Short-lived (30min max per session)
  • Named pattern: nanoclaw-{groupFolder}-{timestamp}

2.2 Volume Mount Strategy

File: /workspace/project/src/container-runner.ts (lines 53-179)

NanoClaw uses Docker bind mounts to provide filesystem isolation:

/workspace/project    โ†’ {projectRoot}              (read-only)
/workspace/group      โ†’ groups/{folder}/           (read-write)
/home/node/.claude    โ†’ data/sessions/{folder}     (read-write)
/workspace/ipc        โ†’ data/ipc/{folder}/         (read-write)
/workspace/extra/*    โ†’ {additionalMounts}         (validated)

Security boundaries:

  • Main group gets read-only access to project root (prevents code tampering)
  • Non-main groups forced read-only for extra mounts (security boundary)
  • Mount allowlist stored outside project (~/.config/nanoclaw/mount-allowlist.json)

2.3 IPC Mechanism

File: /workspace/project/container/agent-runner/src/index.ts

Communication between host controller and container uses filesystem polling:

Host โ†’ Container:

  • Write JSON files to /workspace/ipc/input/{timestamp}.json
  • Write sentinel _close to signal shutdown

Container โ†’ Host:

  • Write structured output to stdout (parsed by host)
  • Wrap results in ---NANOCLAW_OUTPUT_START--- markers

Why filesystem?

  • Simple, reliable, no network dependencies
  • Works across container runtimes (Docker, Apple Container, Kubernetes)
  • No port conflicts or service discovery

2.4 Concurrency Model

File: /workspace/project/src/group-queue.ts

A GroupQueue manages concurrent container execution:

  • Global limit: 5 containers (configurable via MAX_CONCURRENT_CONTAINERS)
  • Per-group state: Active process, idle flag, pending messages/tasks
  • Queue behavior: FIFO processing when slots become available
  • Preemption: Idle containers can be killed for pending high-priority tasks

2.5 Security Model

Secrets โ€” Never written to disk:

  • Read from .env only where needed
  • Passed to container via stdin
  • Stripped from Bash subprocess environment

User isolation โ€” UID/GID mapping:

  • Container runs as host user (not root)
  • Ensures bind-mounted files have correct permissions
  • Skipped for root (uid 0) or container default (uid 1000)

Mount security โ€” Allowlist validation:

  • Blocked patterns: .ssh, .aws, .kube, .env, private keys
  • Enforced on host before container creation (tamper-proof)
  • Non-main groups forced read-only for extra mounts

3. Kubernetes Deployment Approaches

We propose three architectures, each with different trade-offs for complexity, performance, and multi-node support.

3.1 Approach 1: Job-Based with Persistent Volumes

Overview

Each agent session spawns a Kubernetes Job โ†’ one Pod โ†’ auto-cleanup after completion. State persists via PersistentVolumeClaims (PVC).

Architecture Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Host Controller (Deployment)                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ GroupQueue                               โ”‚   โ”‚
โ”‚  โ”‚ - Queue pending messages/tasks           โ”‚   โ”‚
โ”‚  โ”‚ - Create Job when slot available         โ”‚   โ”‚
โ”‚  โ”‚ - Poll Job status for completion         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                  โ”‚
โ”‚  Mounted PVCs:                                  โ”‚
โ”‚  - /data/ipc/{groupFolder}/  (IPC polling)     โ”‚
โ”‚  - /data/sessions/{groupFolder}/               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚
                    โ”‚ Creates Job
                    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Kubernetes Job: nanoclaw-main-1708712345       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Pod (ephemeral)                          โ”‚   โ”‚
โ”‚  โ”‚                                           โ”‚   โ”‚
โ”‚  โ”‚ Volumes:                                  โ”‚   โ”‚
โ”‚  โ”‚ - PVC: nanoclaw-group-main โ†’ /workspace/group โ”‚
โ”‚  โ”‚ - PVC: nanoclaw-ipc-main โ†’ /workspace/ipc    โ”‚
โ”‚  โ”‚ - PVC: nanoclaw-sessions-main โ†’ /.claude     โ”‚
โ”‚  โ”‚ - PVC: nanoclaw-project-ro โ†’ /workspace/project โ”‚
โ”‚  โ”‚                                           โ”‚   โ”‚
โ”‚  โ”‚ securityContext:                          โ”‚   โ”‚
โ”‚  โ”‚   runAsUser: 1000                         โ”‚   โ”‚
โ”‚  โ”‚   fsGroup: 1000                           โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                  โ”‚
โ”‚  activeDeadlineSeconds: 1800  (30min timeout)  โ”‚
โ”‚  ttlSecondsAfterFinished: 300  (5min cleanup)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Volume Strategy

PVC per resource type:

# Group workspace (read-write)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-group-main
spec:
  accessModes:
    - ReadWriteMany  # Multi-node requires RWX
  resources:
    requests:
      storage: 10Gi
  storageClassName: nfs  # Or cephfs, efs, etc.

# IPC directory (read-write)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-ipc-main
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

# Project root (read-only)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nanoclaw-project-ro
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi

Job manifest template:

apiVersion: batch/v1
kind: Job
metadata:
  name: nanoclaw-main-{{timestamp}}
spec:
  activeDeadlineSeconds: 1800
  ttlSecondsAfterFinished: 300
  template:
    spec:
      restartPolicy: Never
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
      containers:
      - name: agent
        image: nanoclaw-agent:latest
        stdin: true
        stdinOnce: true
        volumeMounts:
        - name: group-workspace
          mountPath: /workspace/group
        - name: ipc
          mountPath: /workspace/ipc
        - name: sessions
          mountPath: /home/node/.claude
        - name: project
          mountPath: /workspace/project
          readOnly: true
      volumes:
      - name: group-workspace
        persistentVolumeClaim:
          claimName: nanoclaw-group-main
      - name: ipc
        persistentVolumeClaim:
          claimName: nanoclaw-ipc-main
      - name: sessions
        persistentVolumeClaim:
          claimName: nanoclaw-sessions-main
      - name: project
        persistentVolumeClaim:
          claimName: nanoclaw-project-ro

Implementation Changes

New file: /workspace/project/src/k8s-runtime.ts

import * as k8s from '@kubernetes/client-node';

export async function createAgentJob(
  groupFolder: string,
  timestamp: number,
  volumeMounts: VolumeMount[]
): Promise<string> {
  const kc = new k8s.KubeConfig();
  kc.loadFromDefault();

  const batchV1 = kc.makeApiClient(k8s.BatchV1Api);

  const jobName = `nanoclaw-${groupFolder}-${timestamp}`;
  const job = buildJobManifest(jobName, groupFolder, volumeMounts);

  await batchV1.createNamespacedJob('default', job);
  return jobName;
}

export async function pollJobStatus(
  jobName: string
): Promise<JobStatus> {
  // Poll Job.status.conditions for completion
  // Return exit code or error
}

Modified: /workspace/project/src/container-runtime.ts

export const CONTAINER_RUNTIME_TYPE =
  process.env.CONTAINER_RUNTIME || 'docker';  // 'docker' | 'kubernetes'

export function getRuntime(): ContainerRuntime {
  if (CONTAINER_RUNTIME_TYPE === 'kubernetes') {
    return new K8sRuntime();
  }
  return new DockerRuntime();
}

Modified: /workspace/project/src/container-runner.ts

const runtime = getRuntime();

if (runtime instanceof K8sRuntime) {
  const jobName = await runtime.createAgentJob(groupFolder, timestamp, mounts);
  const result = await runtime.pollJobStatus(jobName);
  // Parse result same as Docker output
} else {
  // Existing Docker spawn() logic
}

Pros & Cons

Aspect Assessment
Code changes โœ… Low (abstraction layer only)
IPC mechanism โœ… Unchanged (filesystem polling works)
OpenShift compatible โœ… Yes (PVC + SCC friendly)
Latency โš ๏ธ Medium (Job creation ~2-5s vs Docker <1s)
Multi-node โš ๏ธ Requires ReadWriteMany PVCs (NFS, CephFS)
Resource usage โœ… Low (ephemeral Pods, auto-cleanup)
Complexity โœ… Low (native K8s primitives)
Rollback โœ… Easy (just switch runtime back to Docker)

3.2 Approach 2: StatefulSet with Sidecar Pattern

Overview

Replace ephemeral Jobs with long-lived Pods (one per group) that stay idle between sessions. Host controller sends work via IPC (unchanged).

Architecture Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Host Controller (Deployment)                   โ”‚
โ”‚  - Sends IPC messages to wake idle Pods         โ”‚
โ”‚  - Scales StatefulSet to 0 after idle timeout   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ”‚
                    โ”‚ IPC via PVC
                    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  StatefulSet: nanoclaw-main (1 replica)         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Pod: nanoclaw-main-0 (always running)    โ”‚   โ”‚
โ”‚  โ”‚                                           โ”‚   โ”‚
โ”‚  โ”‚ Container loops forever:                  โ”‚   โ”‚
โ”‚  โ”‚ 1. Poll /workspace/ipc/input/             โ”‚   โ”‚
โ”‚  โ”‚ 2. Process message if present             โ”‚   โ”‚
โ”‚  โ”‚ 3. Write output                            โ”‚   โ”‚
โ”‚  โ”‚ 4. Sleep 500ms, repeat                     โ”‚   โ”‚
โ”‚  โ”‚                                           โ”‚   โ”‚
โ”‚  โ”‚ Idle timeout: 30min โ†’ graceful shutdown   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                  โ”‚
โ”‚  volumeClaimTemplate:                           โ”‚
โ”‚  - workspace (10Gi RWX)                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Volume Strategy

StatefulSet automatically provisions PVCs via volumeClaimTemplates:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nanoclaw-main
spec:
  serviceName: nanoclaw
  replicas: 1
  selector:
    matchLabels:
      app: nanoclaw
      group: main
  template:
    spec:
      containers:
      - name: agent
        image: nanoclaw-agent:latest
        command: ["/app/entrypoint-loop.sh"]  # Modified entrypoint
        volumeMounts:
        - name: workspace
          mountPath: /workspace
  volumeClaimTemplates:
  - metadata:
      name: workspace
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Implementation Changes

Modified: /workspace/project/container/agent-runner/src/index.ts

// Replace single-shot execution with infinite loop
while (true) {
  const message = await pollIpcInput();
  if (message === '_close') {
    console.log('Shutdown signal received');
    break;
  }
  if (message) {
    await processQuery(message);
  }
  await sleep(500);

  // Idle timeout
  if (Date.now() - lastActivity > IDLE_TIMEOUT) {
    console.log('Idle timeout, shutting down');
    break;
  }
}

Modified: /workspace/project/src/group-queue.ts

// Instead of spawning new container, ensure StatefulSet exists
async ensureStatefulSet(groupFolder: string) {
  if (!await k8s.statefulSetExists(groupFolder)) {
    await k8s.createStatefulSet(groupFolder);
  }
  await k8s.waitForPodReady(groupFolder);
}

// Send IPC message to wake idle Pod
async enqueueMessageCheck(groupFolder: string, message: Message) {
  await ensureStatefulSet(groupFolder);
  await writeIpcMessage(groupFolder, message);
}

Pros & Cons

Aspect Assessment
Code changes โš ๏ธ Medium (queue + agent-runner modifications)
Latency โœ… Low (Pod already running, no Job creation)
Resource usage โŒ High (idle Pods consume memory/CPU)
IPC mechanism โœ… Unchanged
OpenShift compatible โœ… Yes
Session reuse โœ… Claude SDK stays warm (faster startup)
Complexity โš ๏ธ Medium (StatefulSet lifecycle, idle timeout logic)
Multi-node โš ๏ธ Requires RWX PVCs

3.3 Approach 3: DaemonSet Controller + Job Workers

Overview

Host controller runs as DaemonSet on each K8s node. Jobs are node-affinited to the same node as their group’s PVC. Optimized for multi-node clusters with hostPath volumes (local disk speed).

Architecture Diagram

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Kubernetes Cluster (3 nodes)                          โ”‚
โ”‚                                                         โ”‚
โ”‚  Node 1                Node 2               Node 3     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ nanoclaw-   โ”‚      โ”‚ nanoclaw-   โ”‚     โ”‚ ... โ”‚   โ”‚
โ”‚  โ”‚ controller  โ”‚      โ”‚ controller  โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  โ”‚ DaemonSet   โ”‚      โ”‚ DaemonSet   โ”‚                 โ”‚
โ”‚  โ”‚ Pod         โ”‚      โ”‚ Pod         โ”‚                 โ”‚
โ”‚  โ”‚             โ”‚      โ”‚             โ”‚                 โ”‚
โ”‚  โ”‚ Manages:    โ”‚      โ”‚ Manages:    โ”‚                 โ”‚
โ”‚  โ”‚ - group-a   โ”‚      โ”‚ - group-c   โ”‚                 โ”‚
โ”‚  โ”‚ - group-b   โ”‚      โ”‚ - group-d   โ”‚                 โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚         โ”‚                     โ”‚                        โ”‚
โ”‚         โ”‚ Creates Job         โ”‚ Creates Job            โ”‚
โ”‚         โ”‚ with nodeSelector   โ”‚ with nodeSelector      โ”‚
โ”‚         โ–ผ                     โ–ผ                        โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚ Job: group-aโ”‚      โ”‚ Job: group-cโ”‚                โ”‚
โ”‚  โ”‚ (Node 1)    โ”‚      โ”‚ (Node 2)    โ”‚                โ”‚
โ”‚  โ”‚             โ”‚      โ”‚             โ”‚                โ”‚
โ”‚  โ”‚ hostPath:   โ”‚      โ”‚ hostPath:   โ”‚                โ”‚
โ”‚  โ”‚ /var/       โ”‚      โ”‚ /var/       โ”‚                โ”‚
โ”‚  โ”‚ nanoclaw/   โ”‚      โ”‚ nanoclaw/   โ”‚                โ”‚
โ”‚  โ”‚ group-a/    โ”‚      โ”‚ group-c/    โ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Group โ†’ Node Assignment

Use consistent hashing to assign groups to nodes:

function getNodeForGroup(groupFolder: string, nodes: Node[]): string {
  const hash = createHash('sha256')
    .update(groupFolder)
    .digest('hex');
  const index = parseInt(hash.slice(0, 8), 16) % nodes.length;
  return nodes[index].metadata.name;
}

Store mapping in ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nanoclaw-group-assignments
data:
  group-main: "node-1"
  group-family: "node-2"
  group-work: "node-1"

Volume Strategy

hostPath volumes for zero network latency:

apiVersion: batch/v1
kind: Job
metadata:
  name: nanoclaw-main-{{timestamp}}
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/hostname: node-1  # Pinned to same node as controller
      containers:
      - name: agent
        volumeMounts:
        - name: ipc
          mountPath: /workspace/ipc
        - name: group
          mountPath: /workspace/group
      volumes:
      - name: ipc
        hostPath:
          path: /var/nanoclaw/ipc/main
          type: Directory
      - name: group
        hostPath:
          path: /var/nanoclaw/groups/main
          type: Directory

Implementation Changes

New file: /workspace/project/src/k8s-daemonset.ts

export async function assignGroupToNode(groupFolder: string): Promise<string> {
  const nodes = await k8s.listNodes();
  const nodeName = getNodeForGroup(groupFolder, nodes);

  // Store in ConfigMap
  await k8s.updateConfigMap('nanoclaw-group-assignments', {
    [groupFolder]: nodeName
  });

  return nodeName;
}

export async function createJobWithAffinity(
  groupFolder: string,
  nodeName: string
): Promise<string> {
  const job = buildJobManifest(groupFolder, {
    nodeSelector: {
      'kubernetes.io/hostname': nodeName
    },
    volumes: buildHostPathVolumes(groupFolder)
  });
  await k8s.createJob(job);
}

Pros & Cons

Aspect Assessment
Performance โœ… Best (local disk I/O, no network mounts)
Multi-node โœ… Native (DaemonSet per node)
Resource usage โš ๏ธ Medium (one controller per node)
Code changes โŒ High (distributed state, node affinity logic)
Security โŒ Poor (hostPath requires privileged access)
OpenShift compatible โŒ No (hostPath blocked by restricted SCC)
Complexity โŒ High (node assignment, rebalancing, failure handling)

4. Comparison Matrix

Criterion Approach 1: Job+PVC Approach 2: StatefulSet Approach 3: DaemonSet
Code complexity โœ… Low โš ๏ธ Medium โŒ High
Job/Pod latency โš ๏ธ 2-5s โœ… <500ms โœ… <500ms
Resource idle cost โœ… Low โŒ High โš ๏ธ Medium
Multi-node support โš ๏ธ Requires RWX โš ๏ธ Requires RWX โœ… Native
Volume I/O performance โš ๏ธ Network (NFS) โš ๏ธ Network (NFS) โœ… Local disk
OpenShift SCC โœ… Compatible โœ… Compatible โŒ Blocked
IPC mechanism โœ… Unchanged โœ… Unchanged โœ… Unchanged
Rollback ease โœ… Easy โš ๏ธ Medium โŒ Hard
Production readiness โœ… Good โœ… Good โš ๏ธ Experimental
Recommended for POC, single-node Production, <50 groups High-scale, >100 groups

Approach 1: Job-Based with PersistentVolumeClaims

Rationale

  1. Minimal disruption โ€” Abstraction layer only, IPC unchanged
  2. OpenShift compatible โ€” No hostPath, SCC-friendly
  3. Easy rollback โ€” Runtime flag toggles Docker/K8s
  4. Natural evolution โ€” Can upgrade to StatefulSet later if needed

Migration Path

Phase 1: Single-Node Kubernetes (Week 1-2)

  • Implement k8s-runtime.ts with Job API client
  • Create PVCs for main group (group, IPC, sessions, project)
  • Test Job creation, status polling, output parsing
  • Validate IPC mechanism works across PVCs

Phase 2: Multi-Group Support (Week 3-4)

  • Dynamic PVC provisioning per group
  • Test concurrent Job execution (5 simultaneous groups)
  • Performance benchmarking (Job creation latency, PVC I/O)

Phase 3: Multi-Node Deployment (Week 5-6)

  • Evaluate RWX PVC backends (NFS vs CephFS vs AWS EFS)
  • Test cross-node scheduling (Pod on Node 2, PVC on Node 1)
  • If latency unacceptable: pilot Approach 3 (DaemonSet + hostPath)

Phase 4: Production Hardening (Week 7-8)

  • OpenShift SCC validation
  • Security audit (PVC isolation, secrets handling)
  • Resource limits and quotas
  • Monitoring and alerting (Job failures, PVC capacity)

Risk Mitigation

High Risk: PVC Performance

  • Symptom: Slow I/O on NFS-backed PVCs
  • Mitigation: Benchmark early (Phase 2), pivot to DaemonSet if needed
  • Fallback: Use ReadWriteOnce + node affinity (pseudo-hostPath)

Medium Risk: Job Creation Latency

  • Symptom: 5-10s delay for Job โ†’ Running
  • Mitigation: Pre-warm Pod pool (StatefulSet with scale=0, scale up on demand)
  • Fallback: Accept latency or switch to StatefulSet (Approach 2)

Low Risk: OpenShift SCC

  • Symptom: PVC mount permissions fail
  • Mitigation: Use fsGroup in securityContext, request anyuid SCC if needed
  • Fallback: Manual PVC permission fixing via initContainer

6. Implementation Checklist

Prerequisites

  • Kubernetes cluster (1.24+) or OpenShift (4.12+)
  • StorageClass with ReadWriteMany support (NFS, CephFS, EFS)
  • Container registry for nanoclaw-agent image
  • RBAC permissions (create Jobs, PVCs, read Pods)

Code Changes

  • Create /workspace/project/src/k8s-runtime.ts (Job API client)
  • Modify /workspace/project/src/container-runtime.ts (runtime detection)
  • Modify /workspace/project/src/container-runner.ts (Job dispatcher)
  • Add /workspace/project/src/config.ts (CONTAINER_RUNTIME, K8S_NAMESPACE)
  • Add /workspace/project/k8s/pvc-templates.yaml (PVC manifests)
  • Add tests for K8s runtime abstraction

Deployment

  • Build and push nanoclaw-agent image to registry
  • Create namespace: kubectl create namespace nanoclaw
  • Apply PVC templates: kubectl apply -f k8s/pvc-templates.yaml
  • Deploy host controller (Deployment with PVC mounts)
  • Set CONTAINER_RUNTIME=kubernetes env var
  • Verify Job creation: kubectl get jobs -n nanoclaw

Testing

  • Single-group test (main group)
  • Concurrent execution test (5 groups simultaneously)
  • IPC round-trip test (follow-up messages work)
  • Idle timeout test (Pod cleans up after 30min)
  • Failure recovery test (Job fails, retry logic works)
  • Performance test (Job latency, PVC throughput)

7. Future Work

Short-Term (1-3 months)

  • Performance optimization: Pre-warm Pod pool to reduce Job creation latency
  • Dynamic PVC provisioning: Auto-create PVCs for new groups
  • Multi-cluster support: Federate Jobs across multiple K8s clusters

Long-Term (6-12 months)

  • Native K8s IPC: Replace filesystem polling with HTTP (Pod โ†’ Service)
  • Serverless integration: Knative for auto-scaling (scale to zero when idle)
  • Operator pattern: Custom Resource Definitions (CRD) for NanoClaw groups

8. Conclusion

Deploying NanoClaw on Kubernetes/OpenShift unlocks multi-node scaling, resource orchestration, and enterprise security without sacrificing simplicity. The Job-based architecture with PersistentVolumeClaims provides the best balance of low complexity, OpenShift compatibility, and clear evolution paths. Implementation requires minimal code changes (~500 LOC) and preserves the existing IPC mechanism.

For organizations running NanoClaw at scale (>10 groups, multi-node), this migration enables cloud-native deployment patterns while maintaining the framework’s core philosophy: secure by isolation, simple by design.


References