Talos Bare-Metal Cluster
Nested Pulumi project for bootstrapping bare-metal Talos Linux clusters.
Overview
This project handles the full lifecycle of a bare-metal Talos cluster:
- Phase 1: Talos bootstrap (machine secrets, config apply, bootstrap)
- Phase 2: Post-install (Cilium CNI, Longhorn storage, Gateway API)
Architecture
talos-cluster-baremetal/
├── __main__.py # Main orchestration (2 phases)
├── patches.py # PatchConfig dataclass + patch builders
├── check_node_state.py # Pre-flight node state detection
├── wait_for_talos_node.py # Dynamic provider: wait for Talos API
├── wait_for_k8s_api.py # Dynamic provider: wait for K8s API
├── tests/ # pytest tests for patch generation
├── outputs/ # Generated credentials (gitignored)
│ ├── talosconfig # Talos client config
│ ├── kubeconfig # Kubernetes kubeconfig
│ ├── controlplane.yaml # Generated machine config
│ └── machine-secrets.yaml # Cluster secrets
└── Pulumi.dev.yaml # Stack config with node details
Node State Detection
Before declaring resources, check_node_state.py probes the node:
| State | Meaning | Action |
|---|---|---|
MAINTENANCE | Node in maintenance mode | Full bootstrap pipeline |
RUNNING | Cluster already running | Skip config apply + bootstrap |
UNREACHABLE | Node not responding | Assume maintenance, full pipeline |
This enables idempotent pulumi up - re-running after a successful deploy skips bootstrap.
Deployment Phases
Phase 1: Talos Bootstrap
- Generate machine secrets
- Build machine config with patches
- Apply config to node (triggers reboot)
- Wait for Talos API (50000/tcp)
- Bootstrap cluster (etcd init)
- Generate kubeconfig
Phase 2: Post-Install
- Wait for Kubernetes API (6443/tcp)
- Install Gateway API CRDs
- Install Cilium CNI (with L2 announcements)
- Create Longhorn namespace
- Install external-snapshotter CRDs
- Install Longhorn storage
- Create VolumeSnapshotClass
Configuration
Required Settings
# Pulumi.dev.yaml
config:
openchoreo:control_plane_node: "192.168.1.100"
openchoreo:network_addresses: '["192.168.1.100/24"]'
openchoreo:network_gateway: "192.168.1.1"
openchoreo:cluster_endpoint: "https://192.168.1.100:6443"
Optional Settings
| Config | Default | Description |
|---|---|---|
cluster_name | openchoreo | Talos cluster name |
talos_version | v1.12.5 | Talos Linux version |
kubernetes_version | 1.33.0 | Kubernetes version |
schematic_id | (none) | Image Factory schematic |
wipe_install_disk | false | Wipe disk on install |
longhorn_disk | (none) | Dedicated Longhorn disk |
enable_cloudflared | false | Install Cloudflare tunnel |
enable_nvidia | false | Enable NVIDIA GPU support |
enable_zfs | false | Enable ZFS storage |
Patch System
patches.py builds Talos machine config patches using a PatchConfig dataclass:
from patches import PatchConfig, build_control_plane_patches
cfg = PatchConfig(
cluster_name="openchoreo",
cluster_endpoint="https://192.168.1.100:6443",
network_interface="enp0s1",
network_addresses=["192.168.1.100/24"],
# ... more settings
)
patches = build_control_plane_patches(cfg)
Patch Categories
| Patch | Purpose |
|---|---|
cluster_name_patch | Sets cluster name in Talos config |
install_disk_patch | Configures install disk (WWID or auto) |
static_network_patch | Static IP, gateway, DNS |
cert_sans_patch | Additional SANs for API cert |
kubelet_extra_mounts_patch | Mounts for containerd/kubelet |
longhorn_disk_patch | Dedicated disk for Longhorn |
cloudflared_patch | Cloudflare tunnel sidecar |
nvidia_patch | NVIDIA container runtime |
zfs_patch | ZFS kernel module |
Dynamic Providers
WaitForTalosNodeReady
Waits for Talos API on port 50000, then runs talosctl health:
WaitForTalosNodeReady(
"wait-talos",
node="192.168.1.100",
endpoint="192.168.1.100",
talosconfig_path="outputs/talosconfig",
timeout=600,
)
WaitForKubernetesAPI
Waits for Kubernetes API on port 6443:
WaitForKubernetesAPI(
"wait-k8s",
host="192.168.1.100",
port=6443,
timeout=600,
)
Testing
# Run patch generation tests
uv run pytest tests/
# Test with specific config
uv run pytest tests/test_config_patches.py -v
Tests use fixtures from conftest.py that provide PatchConfig instances.
Outputs
After successful deployment:
| Output | Description |
|---|---|
cluster_name | Talos cluster name |
control_plane_ip | Node IP address |
kubeconfig_raw | Kubeconfig (secret) |
talosconfig_raw | Talosconfig (secret) |
kubernetes_endpoint | API server URL |
written_files | Paths to generated files |
Files Written
Credentials are written to outputs/ (gitignored):
| File | Mode | Purpose |
|---|---|---|
talosconfig | 0600 | talosctl configuration |
kubeconfig | 0600 | kubectl configuration |
controlplane.yaml | 0644 | Generated machine config |
machine-secrets.yaml | 0600 | Cluster bootstrap secrets |
Component Versions
| Component | Version | Purpose |
|---|---|---|
| Cilium | 1.17.6 | CNI with L2 announcements |
| Longhorn | 1.9.1 | Distributed block storage |
| Gateway API | v1.3.0 | Gateway API CRDs |
| external-snapshotter | v8.3.0 | CSI snapshot controller |
Conventions
Naming
- Pulumi resources: kebab-case (
wait-talos-node-ready) - Python variables: snake_case (
control_plane_node) - Config keys: snake_case (
control_plane_node)
Dependencies
- Always use
opts=pulumi.ResourceOptions(depends_on=[...])for ordering - Phase 2 resources depend on kubeconfig or wait resources
Secrets
- Credentials written with mode 0600
- Outputs marked as
pulumi.Output.secret() outputs/directory is gitignored