← Docs · Core Concepts

Colony Lifecycle Management

Full lifecycle from spawn to stop — states, transitions, supervision trees, and service orchestration in Colony

Every colony follows a well-defined lifecycle. Understanding these states helps you monitor progress and debug issues.

Lifecycle States

spawning → provisioning → building → running → stopping → stopped
    ↓           ↓            ↓          ↓
  [error]    [error]      [error]    [manual]
StateDescriptionDuration
spawningCreating network namespace and Jujutsu workspace<100ms
provisioningRunning environment setup, installing deps5-30s
buildingRunning build commands (npm build, cargo build)10-60s
runningServices are live, agent is workingMinutes-hours
stoppingGraceful shutdown, cleaning up resources<5s
stoppedColony is inactive, resources freed
errorUnrecoverable failure at any stage

State Transitions

Transitions happen automatically based on success or failure:

// Successful path
Spawning -> Provisioning -> Building -> Running

// Error paths
Spawning [error] -> Error
Provisioning [error] -> Error (with rollback)
Building [error] -> Error
Running -> Stopping -> Stopped (manual or on completion)

State changes broadcast via WebSocket to Bloom. The UI updates instantly when a colony transitions.

Spawning Stage

Spawning creates the isolated environment.

What Happens

  1. Generate colony ID — unique identifier (e.g., colony-alpha-7f3d)
  2. Create network namespaceip netns add ns-{colony_id}
  3. Set up veth pair — virtual ethernet connecting namespace to host
  4. Configure networking — assign IP, set up routing, enable NAT
  5. Create Jujutsu workspacejj workspace add {colony_id}
  6. Initialize database — SQLite instance at db/{colony_id}.db
  7. Create log buffer — ETS ring buffer for log storage

Error Handling

If spawning fails (namespace already exists, for example), we roll back everything:

# Rollback actions
ip netns del ns-{colony_id}
rm -rf workspaces/{colony_id}
rm -f db/{colony_id}.db

Provisioning Stage

Provisioning sets up the environment based on colony.toml.

Configuration Parsing

[environment]
node_version = "20"
packages = ["git", "curl", "jq"]

[services]
# Services defined here

Actions

  1. Parse colony.toml — load configuration from workspace
  2. Install language runtime — nvm, rustup, pyenv based on config
  3. Install system packages — apt/dnf based on distro
  4. Run setup scripts — custom provisioning commands
  5. Install dependencies — npm install, cargo fetch, pip install

Provisioning runs inside the namespace via ip netns exec. All installed tools are namespace-local.

Rollback on Failure

If provisioning fails (npm install error, for example), we restore the previous state:

  1. Restore workspace to pre-provision commit
  2. Clear installed packages (if possible)
  3. Transition to error state with logs attached

Building Stage

Building compiles or bundles your application.

Build Commands

[build]
command = "npm run build"
timeout = 300  # 5 minutes

We execute the build command and stream output to the log buffer. Exit code 0? Transition to running. Non-zero? Transition to error.

Parallel Builds

Multiple colonies can build at the same time. They’re isolated, so there are no conflicts over temp files or ports.

Running Stage

This is where agents do their work and services are live.

Service Spawning

For each service in colony.toml:

[[services]]
name = "web"
command = "npm start"
port = 4001

We:

  1. Spawn dedicated owner process — separate from colony actor
  2. Open Erlang portopen_port({spawn, "npm start"}, [...])
  3. Capture OS PID — via erlang:port_info(Port, os_pid)
  4. Stream output — stdout/stderr to log buffer
  5. Register Caddy route — non-blocking HTTP call to Caddy API

Why Dedicated Owner Process?

The colony actor needs to stay responsive to handle API requests (get state, stream logs, etc.). If we did synchronous HTTP calls or blocking I/O inside the actor, it’d become unresponsive.

Solution: spawn (not spawn_link) a separate process that owns the port. If the service crashes, only that process dies. The colony actor survives and reports the error.

// Simplified version
pub fn spawn_service(service: Service) -> Result(Pid, Error) {
  // Spawn separate process to own the port
  let owner_pid = process.spawn(fn() {
    let port = open_port(service.command)
    let os_pid = get_os_pid(port)

    // Register route (non-blocking, separate process)
    task.async(fn() { register_caddy_route(service) })

    // Stream output forever
    stream_output_loop(port)
  })

  Ok(owner_pid)
}

Service Health Monitoring

The owner process monitors the port for exit signals:

receive
  {Port, {exit_status, 0}} ->
    % Service exited gracefully
    notify_actor(service_stopped);
  {Port, {exit_status, Code}} ->
    % Service crashed
    notify_actor({service_crashed, Code})
end

Crashed services don’t crash the colony. The actor gets the error and can restart the service or transition to error state.

Stopping Stage

Stopping performs graceful shutdown.

Actions

  1. Kill servicesos:cmd("kill {os_pid}") for each service
  2. Wait for port exit — timeout 5s, force kill if needed
  3. Deregister Caddy routes — remove reverse proxy rules
  4. Close log buffer — flush logs to disk (optional)
  5. Keep namespace alive — don’t destroy (allows restart)

OS PID Cleanup

Erlang port_close(Port) doesn’t kill the underlying OS process. We explicitly kill by PID:

% Get OS PID from port
{os_pid, OsPid} = erlang:port_info(Port, os_pid),

% Kill the process
os:cmd(io_lib:format("kill ~p", [OsPid])),

% Close the port
port_close(Port).

This ensures services are fully terminated, not orphaned.

OTP Supervision Tree

The colony manager uses an OTP supervisor with one_for_one strategy:

ColonyManager (Supervisor)
    ├─ ColonyActor[colony-alpha] (GenServer)
    │   └─ ServiceOwner[web] (Pid)
    │   └─ ServiceOwner[api] (Pid)
    ├─ ColonyActor[colony-beta] (GenServer)
    └─ ColonyActor[colony-gamma] (GenServer)

If a colony actor crashes (unhandled exception), the supervisor restarts only that actor. Other colonies keep running.

Fault Isolation

// If this crashes...
pub fn handle_call(msg: GetState, state: State) {
  case msg {
    GetState -> {
      let invalid_state = crash_here()  // Oops!
      Response(invalid_state)
    }
  }
}

// ...only this colony actor restarts.
// Other colonies are unaffected.

This is OTP’s superpower: automatic fault recovery with surgical precision.

State Diagram (ASCII Art)

              ┌─────────────┐
              │   spawning  │
              └──────┬──────┘

                success

              ┌──────▼──────────┐
              │  provisioning   │──── error ───┐
              └──────┬──────────┘              │
                     │                         │
                success                        │
                     │                         │
              ┌──────▼──────┐                 │
              │  building   │──── error ───────┤
              └──────┬──────┘                  │
                     │                         │
                success                        │
                     │                         │
              ┌──────▼──────┐                 │
              │   running   │                 │
              └──────┬──────┘                 │
                     │                         │
               stop command                    │
                     │                         │
              ┌──────▼──────┐                 │
              │  stopping   │                 │
              └──────┬──────┘                 │
                     │                         │
              ┌──────▼──────┐          ┌──────▼──────┐
              │   stopped   │          │    error    │
              └─────────────┘          └─────────────┘

Caddy Route Registration

When a service starts, we register a reverse proxy rule with Caddy:

{
  "handle": [{
    "handler": "reverse_proxy",
    "upstreams": [{
      "dial": "10.200.1.2:4001"
    }]
  }]
}

Now the service is accessible at http://web-4001.colony.local/ from the host and Bloom.

Deregistration on Stop

When the colony stops, we delete the Caddy route:

DELETE /config/apps/http/servers/colony/routes/route-colony-alpha-web

No stale routes pointing to dead services.

Next Steps