• Product
  • Pricing
  • Docs
  • Using PostHog
  • Community
  • Company
  • Login
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • Breaking glass to debug PostHog Cloud
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Data storage or what is a MergeTree
      • Data replication
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Table of contents

  • Handbook
    • Start here
    • Meetings
    • Story
    • Team
    • Investors
    • Strategy overview
    • Business model
    • Objectives
    • Roadmap
    • Brand
    • Culture
    • Values
    • Small teams
    • Goal setting
    • Diversity and inclusion
    • Communication
    • Management
    • Offsites
    • Security
    • Brand assets
    • Team structure
    • Customer Success
    • Exec
    • Experimentation
    • Growth
    • Infrastructure
    • Marketing
    • People & Ops
    • Pipeline
    • Product Analytics
    • Session Recording
    • Website & Docs
    • Compensation
    • Share options
    • Benefits
    • Time off
    • Spending money
    • Progression
    • Training
    • Side gigs
    • Feedback
    • Onboarding
    • Offboarding
      • Product Manager ramp up
    • Merch store
      • Overview
      • How to interview
      • Engineering hiring
      • Marketing hiring
      • Operations hiring
      • Design hiring
      • Exec hiring
      • Developing locally
      • Tech stack
      • Project structure
      • How we review PRs
      • Frontend coding
      • Backend coding
      • Support hero
      • Feature ownership
      • Working with product design
      • Releasing a new version
      • Handling incidents
      • Bug prioritization
      • Event ingestion explained
      • Making schema changes safely
      • How to optimize queries
      • How to write an async migration
      • How to run migrations on PostHog Cloud
      • Working with ClickHouse materialized columns
      • Deployments support
      • Working with cloud providers
      • Breaking glass to debug PostHog Cloud
      • Developing the website
      • MDX setup
      • Markdown
      • Jobs
      • Data storage or what is a MergeTree
      • Data replication
    • Shipping things, step by step
    • Feature flags specification
    • Setting up SSL locally
    • Tech talks
    • Overview
    • Product metrics
    • User feedback
    • Paid features
    • Releasing as beta
    • Our philosophy
    • Product design process
    • Designing posthog.com
    • Overview
    • Personas
    • Testimonials
    • Value propositions
      • Content & SEO
      • Sponsorship
      • Paid ads
      • Email
      • Press
    • Growth strategy
    • Customer support
    • Inbound sales model
    • Sales operations
      • Managing our CRM
      • YC onboarding
      • Demos
      • Billing
      • Who we do business with
    • Growth reviews
  • Handbook
  • Engineering
  • Working with data
  • Event ingestion explained

Event ingestion explained

Last updated: Nov 07, 2022

On this page

  • Ingestion data flow
  • Client libraries
  • Capture API
  • App server
  • Kafka
  • ClickHouse
  • Non-sharded ClickHouse events ingestion
  • kafka_events table
  • events_mv Materialized View
  • events table
  • Sharded events ingestion
  • writable_events table
  • sharded_events table
  • events table
  • Persons ingestion

In its simplest form, PostHog is an analytics data store where events come in and get analyzed.

This document gives an overview of how data ingestion works.

Ingestion data flow

Client Library
/decide API
Capture API
Plugin server
PostgreSQL (persons table)
Kafka
Kafka
ClickHouse

The following sections break each part down in more detail.

Client libraries

Client libraries are responsible for capturing user interactions and sending the events to us.

Note that various client libraries also can call /decide endpoint for:

  • posthog-js: on load for compression, session recording, feature flags and other autocapture-related settings
  • other libraries: for checking feature flags

Capture API

Capture API is responsible for capturing data.

It is responsible for:

  • Validating API keys.
  • Anonymizing IPs according to project settings.
  • Decompressing and normalizing the shape of event data for the rest of the system.
  • Sending processed to events_plugin_ingestion Kafka topic.
  • If communication with Postgres fails, logging events to kafka dead_letter_queue table.

The design goal of this service is to be as simple and resilient as possible to avoid dropping events.

App server

On a high level during ingestion, app server:

  • Reads events from events_plugin_ingestion kafka topic
  • Runs user-created apps on the events, potentially modifying the shape of the events.
  • Handles person (and groups) creation and updates, using posthog_person postgresql table as the source of truth.
  • Sends events, persons, groups to specialized kafka tables for clickhouse to read.
  • Does that in a highly parallelized way to handle high ingestion volume.

Kafka

Kafka is used as a resilient message bus between different services.

You can find relevant kafka topics in the PostHog codebase.

ClickHouse

ClickHouse is our main analytics backend.

Instead of data being inserted directly into ClickHouse, it itself data from Kafka. This makes our ingestion pipeline more resilient towards outages.

The following sections go more into depth in how this works exactly.

Non-sharded ClickHouse events ingestion

clickhouse_events_proto topic
reads from
pushes data to
kafka_events table
(Kafka table engine)
events_mv table
(Materialized view)
events table
(ReplacingMergeTree table engine)
Kafka

kafka_events table

kafka_events table is of Kafka table engine

In essence it behaves as a kafka consumer group - reading from this table reads from the underlying kafka topic and advances the current offset.

events_mv Materialized View

events_mv table is a Materialized View.

In this case it acts as a data pipe which periodically pulls from kafka_events and pushes the results into the (events) table.

events table

events table is of ReplacingMergeTree table engine. Insights and other features query this table for analytics results.

Note that while ReplacingMergeTree is used, it's not an effective deduplication method and we should avoid writing duplicate data into the table.

Sharded events ingestion

Currently PostHog cloud has more than a single ClickHouse instance. To support this, sharding and a different schema for ClickHouse is used.

clickhouse_events_proto topic
reads from
pushes data to
pushes data to
reads from
kafka_events table
(Kafka table engine)
events_mv table
(Materialized view)
writable_events table
(Distributed table engine)
events table
(Distributed table engine)
sharded_events table
(ReplicatedReplacingMergeTree table engine)
Kafka

writable_events table

writable_events table is of Distributed table engine.

The schema looks something like as follows:

SQL
CREATE TABLE posthog.writable_events (
`uuid` UUID,
`event` String,
`properties` String,
`timestamp` DateTime64(6, 'UTC'),
`team_id` Int64,
`distinct_id` String,
`elements_hash` String,
`created_at` DateTime64(6, 'UTC'),
`_timestamp` DateTime,
`_offset` UInt64,
`elements_chain` String
) ENGINE = Distributed('posthog', 'posthog', 'sharded_events', sipHash64(distinct_id))

This table:

  • Gets pushed rows from events_mv table.
  • For every row, it calculates a hash based on the distinct_id column.
  • Based on the hash, sends the row to the right shard on the posthog cluster into the posthog.sharded_events table.
  • Does not contain materialized columns as they would hinder INSERT queries.

sharded_events table

sharded_events table is a ReplicatedReplacingMergeTree.

This table:

  • Stores the event data.
  • Is sharded and replicated.
  • Is queried indirectly via the events table.

events table

Similar to writable_events, events table is of Distributed table engine.

This table is being queried from app and for every query, figures out what shard(s) to query and aggregates the results from shards.

Persons ingestion

Persons ingestion works similarly to events, except there's two tables involved: person and person_distinct_id.

Note that querying both tables requires handling duplicated rows. Check out PersonQuery code for an example of how it's done.

In sharded setups, person and person_distinct_id tables are not sharded and instead replicated onto each node to avoid JOINs over the network.

Questions?

Was this page useful?

Next article

Making schema changes safely

PostHog's database schema evolves constantly along with the app. Each schema change safely requires delibration though, as a badly designed migration can cause pain for users, and require extra effort from the engineering team. Below are some important considerations to keep in mind regarding schema changes: General considerations Before making a schema change, consider: Do we need the schema change at all? Would this be better solved with an application-level code change instead? Is my change…

Read next article

Authors

  • Karl-Aksel Puulmann
    Karl-Aksel Puulmann
  • justinjones
    justinjones
  • rafalmierzwiak
    rafalmierzwiak

Share

Jump to:

  • Ingestion data flow
  • Client libraries
  • Capture API
  • App server
  • Kafka
  • ClickHouse
  • Non-sharded ClickHouse events ingestion
  • Sharded events ingestion
  • Persons ingestion
  • Questions?
  • Edit this page
  • Raise an issue
  • Toggle content width
  • Toggle dark mode
  • Product

  • Overview
  • Pricing
  • Product analytics
  • Session recording
  • A/B testing
  • Feature flags
  • Apps
  • Customer stories
  • PostHog vs...
  • Docs

  • Quickstart guide
  • Self-hosting
  • Installing PostHog
  • Building an app
  • API
  • Webhooks
  • How PostHog works
  • Data privacy
  • Using PostHog

  • Product manual
  • Apps manuals
  • Tutorials
  • Community

  • Questions?
  • Product roadmap
  • Contributors
  • Partners
  • Newsletter
  • Merch
  • PostHog FM
  • PostHog on GitHub
  • Handbook

  • Getting started
  • Company
  • Strategy
  • How we work
  • Small teams
  • People & Ops
  • Engineering
  • Product
  • Design
  • Marketing
  • Customer success
  • Company

  • About
  • Team
  • Investors
  • Press
  • Blog
  • FAQ
  • Support
  • Careers
© 2022 PostHog, Inc.
  • Code of conduct
  • Privacy policy
  • Terms