ClickHouse on Kubernetes: Patterns, Storage & Backups

Hands-on DevOps guide for running ClickHouse on Kubernetes: patterns, storage, HA, backups, and observability for production OLAP.

Hook — If your ClickHouse cluster fails under load, this guide fixes the root causes

Running ClickHouse for production analytics on Kubernetes is not just “lift-and-shift.” Engineering teams tell us the same story: nodes reboot during upgrades, merges pile up and kill I/O, backups are inconsistent, and DNS/volume storms create cascading failures. This guide gives you a pragmatic, 2026-focused DevOps playbook for deploying ClickHouse on Kubernetes at scale — including cluster patterns, stateful workload primitives, storage decisions, and robust backup/restore strategies that actually work in the wild.

Top-level recommendations (most important first)

Prefer an operator that understands ClickHouse topology and DDL sequencing — it dramatically simplifies upgrades and re-sharding.
Use local NVMe or fast block storage for hot parts and S3-compatible object storage for backups and cold storage tiering.
Replicate aggressively (RF ≥ 3) and enforce strict anti-affinity + PodDisruptionBudget to preserve availability during maintenance.
Combine CSI volume snapshots with logical backups (clickhouse-backup) to get fast restores and point-in-time consistency of metadata + data.
Measure and alert on merge/parts metrics — these determine cluster health more than query latency alone.

Why ClickHouse on Kubernetes matters in 2026

ClickHouse adoption exploded after 2024 as teams replaced monolithic OLAP stacks with distributed, high-performance column stores. Late 2025 positioned ClickHouse as a major OLAP choice for cloud-native analytics — the space continues to benefit from improved operator maturity, built-in keeper services, and first-class object storage integrations (S3 tiering). Kubernetes provides the orchestration, but stateful analytics workloads demand platform patterns beyond vanilla deployments.

Production architecture patterns for ClickHouse OLAP clusters

Pick a pattern based on data size, ingestion profile, and failure domains.

1) Small clusters — single shard with replication

Best for teams with a single region and moderate ingestion. Use a single shard across N replicas (N ≥ 3). The shard handles all queries; replicas provide HA and offload reads.

2) Sharded + replicated clusters (recommended for scale)

Split data into logical shards (range or hash), each with replicated replicas. This pattern supports linear scaleout for write throughput and storage. Plan shard keys carefully — re-sharding is expensive.

3) Multi-tenant isolation

Use namespaces, resource quotas, and node pools to isolate workloads. Run critical analytics on dedicated node pools with fast storage; run less critical tenants on slower classes with S3 tiering.

Coordination service

ClickHouse historically used ZooKeeper; more recent releases and operators also support ClickHouse Keeper (a lightweight built-in alternative). For stricter operational control, run a dedicated ensemble (3 or 5 nodes) and treat Keeper/ZooKeeper backups as first-class artifacts.

Kubernetes primitives: StatefulSet vs Operator

Two common deployment approaches exist; choose based on operational maturity.

StatefulSet (DIY)

StatefulSet + headless service + PVC templates works for simple setups. But you’ll miss many cluster-level semantics: DDL coordination, controlled rolling restarts, and safe re-sharding.

Pros: Familiar, Kubernetes-native.
Cons: Manual handling of replica bootstrapping, metadata migrations, and rolling upgrades.

ClickHouse Operator (recommended for production)

Use a maintained operator (open-source or vendor-provided) that exposes CRDs for clusters, shards, replicas, and keeper ensembles. Operators implement safe rolling upgrades, apply DDLs and coordinate re-replication — saving hours during incidents.

Key Kubernetes controls to use

PodDisruptionBudget (PDB) — enforce minimum available replicas during maintenance.
Pod Anti-Affinity — avoid collocating replicas on the same node or zone.
VolumeBindingMode: WaitForFirstConsumer — ensures topology-aware provisioning for stateful shards.
PriorityClass — protect ClickHouse pods during node pressure.
Readiness/Liveness probes — probe both process health and replication state.

Example PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: clickhouse-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: clickhouse

Persistent storage: patterns and trade-offs

Storage choice is the single biggest performance and availability lever for ClickHouse. The engine is I/O-bound: merges write and rewrite large parts. Your storage decisions must align with your SLA and budget.

Storage options

Local NVMe / instance storage — Best performance for heavy-write workloads. Combine with replication since local volumes don’t survive node loss.
Fast block volumes (gp3, io2) — Good cloud option; use iops-provisioned types for consistent latency.
Network filesystems (NFS) — Avoid for hot MergeTree data; they cause unpredictable latency.
S3 / Object storage — Use for backups and cold tiering. ClickHouse supports disk configurations pointing to S3 for colder data.

Recommended approach in 2026

Use fast local NVMe or provisioned block volumes for the ClickHouse data disks (hot), plus an S3-backed disk for cold storage and backups. Use the CSI snapshot API for fast point-in-time snapshots and combine with logical backups to capture metadata.

StorageClass example (Kubernetes)

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: clickhouse-fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iopsPerGB: "10"
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

High Availability: replication, affinity and PDBs

Replication factor should be at least 3 for production. Place replicas across failure domains (nodes, racks, AZs). Combine replication with:

Pod anti-affinity and topology spread constraints.
Strict PodDisruptionBudget to avoid losing quorum during node drains.
Separate node pools for Keeper/ZooKeeper to protect quorum services.

Rolling upgrades and zero-downtime

Operators can orchestrate graceful restarts: drain writes from a replica, wait for replication lag to zero, then restart. For StatefulSet DIY, automate the sequence with a control script that uses system tables (system.replication_queue, system.replicas) to track progress.

Backups and restores — strategy and playbook

Backups for ClickHouse require both data and coordination metadata (Keeper/ZooKeeper). Snapshotting just volumes misses ZooKeeper state; logical dumps miss binary compaction state. Combine multiple approaches.

Backup components

Metadata — table schemas, users, configs (extract from system tables and /etc/clickhouse-server).
Local data — MergeTree parts stored on node disks.
Coordination state — ZooKeeper/ClickHouse Keeper snapshots.

Two-tier backup recipe (fast restore + safety)

Use CSI VolumeSnapshot to capture block storage snapshots for each replica simultaneously — this provides fast local restore capability.
Run a logical backup using clickhouse-backup (widely used in 2026) to export metadata and upload parts to S3. This tool understands MergeTree formats and uploads incremental backups.
Snapshot Keeper/ZooKeeper data directories (via snapshot or filesystem copy) and store in S3, ensuring you capture the same logical time as the data snapshots.

Automated backup example (clickhouse-backup)

# create a backup and upload to S3
clickhouse-backup create full_$(date +%F)
clickhouse-backup upload full_$(date +%F) --config /etc/clickhouse-backup/config.yml

# restore
clickhouse-backup download full_2026-01-01
clickhouse-backup restore full_2026-01-01

Schedule backups during low-activity windows and test restores monthly. Always verify Keeper snapshots can be restored to recover replica topology.

Cluster-consistent backups

For full cluster consistency: 1) pause writes or route writes to a maintenance shard, 2) snapshot data disks and Keeper, 3) resume writes. If pausing writes is impossible, rely on logical backups plus replication to rebuild missing parts.

Restore playbook — quick recovery steps

Restore Keeper/ZooKeeper snapshot and ensure the ensemble reaches quorum.
Restore metadata (users, replicated table definitions) before real data.
Restore local PV snapshots or download parts from S3, then start ClickHouse replicas one at a time.
Verify replication status (system.replicas) and monitor for missing parts; trigger manual fetches if necessary.

Observability: what to monitor and why

In 2026 teams combine Prometheus, Grafana, and OTLP for ClickHouse observability. Key signals drive most incidents:

replication_delay_seconds — lag means writes not fully durable across replicas.
merge_queue_size and parts — too many parts indicate write patterns that will explode I/O.
disk_usage_bytes and free_space — merges can fail without disk headroom.
query_duration_seconds => tail spikes indicate resource contention.
threads_active and memory_per_query metrics to avoid OOMs.

Sample PromQL alerts

# Alert when replication lag > 30s
max_over_time(clickhouse_replica_lag_seconds[5m]) > 30

# Alert when merge queue grows
sum(clickhouse_merge_queue_length) by (instance) > 50

CI/CD for ClickHouse on Kubernetes

Apply GitOps for your ClickHouse CRs, storage policies, and RBAC configs. Key practices:

Store DDLs in version control and use a migration job that runs idempotent ALTERs.
Test schema changes on a staging cluster with representative data and merges enabled for realistic compaction behavior.
Use canary replicas for large ALTERs or compression changes: apply on a single replica, observe disk and merge behavior, then roll out.

Capacity planning: a quick calculator

Estimate storage need with a simple formula:

Estimated disk (GB) = (daily_rows * average_row_size * retention_days * replication_factor) / compression_ratio

Example: 50M rows/day * 200 bytes/row * 90 days * 3 replicas / 10 = 27,000 GB. Always add 30–50% headroom for merges and temporary files.

Common failure modes and how to recover

1) Node loss with local NVMe

Replace node, attach new disk, bring up ClickHouse replica, and allow it to replicate missing parts from other replicas. If a replica is permanently lost, reconfigure ReplicatedMergeTree to exclude the missing replica after verifying data safety.

2) Keeper/ZooKeeper quorum loss

Bring up new Keeper instances from backups. Avoid rolling through more than one Keeper at a time. Test Keeper restores regularly — many incidents trace back to untested recovery scripts.

3) Merge storm / disk full

Throttle inserts, increase replication to shift read load, and give merges time to complete. Use emergency TTL or drop old partitions if retention permits. For long-term stability, adjust insertion batching and use partitioning that supports partition drops.

Future trends and what to prepare for (2026+)

Operators will continue to add safe DDL orchestration and automated re-sharding tools.
S3-native MergeTree extensions and smarter cold-tiering reduce hot disk demand.
Edge analytics and low-latency read replicas will push hybrid architectures combining local inference with central OLAP clusters.
Greater standardization of CSI snapshot semantics will improve backup/restore reliability across clouds.

“In 2026, architecting ClickHouse clusters on Kubernetes means choosing orchestration and storage patterns intentionally — not by default.”

Actionable checklist (apply today)

Adopt a ClickHouse operator and store all cluster CRs in Git (GitOps).
Set replication factor ≥ 3 and configure Pod anti-affinity across AZs.
Use local NVMe or provisioned IO for hot data; enable S3 tiering for cold data.
Implement combined backups: CSI snapshots + clickhouse-backup uploads to S3 and Keeper snapshots.
Instrument Prometheus metrics and add alerts for replication lag and merge queues.
Run and rehearse full restores quarterly (including Keeper restore).

Closing — next steps

Deploying ClickHouse on Kubernetes in production is within reach if you pair the right operator with storage patterns and a hardened backup strategy. Start with a small sharded test cluster in a dedicated node pool using local SSDs, wire up clickhouse-backup to S3, and automate restores into a sandbox. Once you can reliably restore, scale shards and tune retention.

Want hands-on help? If you need a partner to evaluate architecture choices, run a proof-of-concept, or take over operations, contact the qubit.host team for an architecture review and managed deployment options tailored to ClickHouse on Kubernetes.