Skip to main content

ClickHouse

ClickHouse is an open-source columnar OLAP database built for real-time analytical queries on large datasets. It is the primary platform for high-throughput analytics workloads where query latency and storage efficiency are critical.


When to Use ClickHouse

ClickHouse is the right choice when:

  • Analytical queries need to run in milliseconds on billions of rows
  • Storage costs matter (ClickHouse achieves 5–10× compression vs row-oriented databases)
  • The workload is predominantly read-heavy with append-only writes
  • You need to serve dashboards with sub-second response times at scale

It is not the right choice for:

  • OLTP workloads (frequent point updates, transactions)
  • Datasets smaller than ~100M rows where PostgreSQL or DuckDB suffice
  • Complex multi-table joins that are unavoidable (ClickHouse favors denormalized wide tables)

Core Architecture

Columnar Storage

ClickHouse stores each column in a separate file on disk. This means:

  • Queries that select 5 columns from a 200-column table read only 2.5% of the data
  • Column-level compression is highly effective (similar values stored together)
  • Vectorized CPU operations process entire column arrays at once

MergeTree Engine Family

All production tables use the MergeTree engine family:

EngineUse Case
MergeTreeStandard append-only analytical tables
ReplacingMergeTreeDeduplication — keeps latest row per key
AggregatingMergeTreePre-aggregation patterns
SummingMergeTreeAutomatic sum aggregation on merge
ReplicatedMergeTreeMulti-node replication

Compression

ClickHouse applies compression per column using LZ4 by default.

Custom codecs can be applied per column:

event_time DateTime CODEC(Delta, ZSTD),
status LowCardinality(String),
revenue Decimal(10,2) CODEC(ZSTD(3))

Key Design Principles

Denormalization is Preferred

Unlike traditional data warehouses where star schemas are normalized:

  • ClickHouse performs best on wide, flat tables with pre-joined dimensions
  • JOIN operations exist but are expensive at scale
  • Embed dimension attributes directly in fact rows when possible

Primary Key Drives Everything

The ORDER BY clause defines the sorting key, which powers the sparse primary index.

  • Queries using the primary key prefix skip granules → fast
  • Queries not using the primary key → full scan

Design the primary key around your most frequent query filters:

ORDER BY (event_date, customer_id)   -- if you always filter by date + customer

Partitioning for Pruning

Monthly partitioning is the standard for time-series data:

PARTITION BY toYYYYMM(event_date)

ClickHouse skips entire partitions that don't match the query filter, reducing I/O dramatically.


Integration Points

ToolIntegration Method
Power BIODBC driver
GrafanaNative ClickHouse plugin
dbtdbt-clickhouse adapter
Pythonclickhouse-driver or clickhouse-connect
PySparkJDBC driver or HTTP connector
KafkaClickHouse Kafka engine (native)
AirflowHTTP operator or Python operator

Deployment Options

Self-Hosted

  • Deploy on bare metal or VM
  • Full control over configuration and resources
  • Requires operational expertise

ClickHouse Cloud

  • Managed service by ClickHouse Inc.
  • Automatic scaling and backups
  • Consumption-based pricing
  • Recommended for teams without dedicated infrastructure engineering

Kubernetes


Operational Considerations

Monitoring

Key metrics to monitor:

  • Query duration via system.query_log
  • Merge backlog via system.merges
  • Disk usage via system.parts
  • Replication lag (if replicated) via system.replication_queue

Backups

ClickHouse supports native backup to S3:

BACKUP TABLE analytics.fct_orders TO S3('s3://my-bucket/backups/', 'key', 'secret');

Access Control

  • Use dedicated users per application (ETL, BI, admin)
  • Apply row-level policies for multi-tenant scenarios
  • Enable TLS for all client connections in production

Further Reading