|
1 | 1 |
|
2 | | -# Journey of Data from CM to Insights |
3 | | -[This image](https://uploads.linear.app/aebec7ad-5649-4758-9bed-061f7228a879/b72d9f55-8f27-4c57-81fe-729807c12ffb/36c116c2-0f88-4735-a932-0c3e6bf8ea45) shows how data flows from CM to Insights. |
| 2 | +# Tinybird Documentation |
4 | 3 |
|
5 | | -## Activity Preprocessing Pipeline |
6 | | -See LAMBDA_ARCHITECTURE.md for details |
| 4 | +## Table of Contents |
7 | 5 |
|
8 | | ---- |
| 6 | +- [Introduction](#introduction) |
| 7 | +- [Journey of Data](#journey-of-data-from-cdp-to-insights) |
| 8 | +- [Making Changes to Resources](#making-changes-to-resources) |
| 9 | +- [How to Iterate on Data](#how-to-iterate-on-data) |
| 10 | +- [Testing Tinybird Pipes Locally](#testing-tinybird-pipes-locally) |
| 11 | +- [Creating Backups](#creating-a-backup-datasource-in-tinybird) |
| 12 | +- [Glossary](#glossary) |
| 13 | + |
| 14 | +## Introduction |
| 15 | + |
| 16 | +This directory contains documentation CDP and Tinybird integration. **Tinybird** is a real-time analytics database built on ClickHouse that powers our Insights platform with fast, scalable queries on community activity data. |
| 17 | + |
| 18 | +### System Role |
| 19 | + |
| 20 | +Tinybird sits between our Community Data Platform (CDP) backend and the Insights frontend: |
| 21 | + |
| 22 | +1. **Data Ingestion**: Receives data from Postgres (via Sequin → Kafka → Kafka Connect) |
| 23 | +2. **Processing**: Enriches, filters, and aggregates activity data using different architectures (Bucketing & Lambda) |
| 24 | +3. **Serving**: Provides fast API endpoints for the Insights dashboard and other consumers |
| 25 | + |
| 26 | +## Journey of Data from CDP to Insights |
| 27 | + |
| 28 | +See [dataflow](./dataflow.md) for a visual diagram showing how data flows from CDP Backend through Tinybird to Insights. |
| 29 | + |
| 30 | +## Architecture Overview |
| 31 | + |
| 32 | +We use **two parallel architectures** to process activityRelations data: |
| 33 | + |
| 34 | +### Lambda Architecture |
| 35 | +1) Deduplicates activityRelations without any filtering. Mainly consumed in CDP, and monitoring pipes. Output: `activityRelations_enriched_deduplicated_ds` |
| 36 | + |
| 37 | +2) Used for ingesting pull request event data and merging with existing events. Output: `pull_requests_analyzed` |
| 38 | + |
| 39 | +For details see [lambda-architecture.md](./lambda-architecture.md) |
| 40 | + |
| 41 | +- Filtering: UNFILTERED (includes bots, all activities) |
| 42 | +- Used by: Pull requests, CDP pipes, monitoring |
| 43 | +- Details: Lambda architecture pattern for deduplication enrichment and pull request processing |
| 44 | + |
| 45 | + |
| 46 | +### Bucketing Architecture |
| 47 | +Produces filtered data (10 buckets) for Insights API queries. For details see [bucketing-architecture.md](./bucketing-architecture.md) |
| 48 | + |
| 49 | +- Output: `activityRelations_deduplicated_cleaned_bucket_*_ds` (10 buckets) |
| 50 | +- Filtering: FILTERED (valid members, enabled repos only) |
| 51 | +- Used by: Insights API queries |
| 52 | +- Details: Hash-based bucketing architecture for parallel processing |
| 53 | + |
| 54 | + |
| 55 | +### Comparison |
| 56 | + |
| 57 | +The following table compares the two parallel architectures processing activityRelations data: |
| 58 | + |
| 59 | +| Aspect | Lambda Architecture | Bucketing Architecture | |
| 60 | +|--------|---------------------|------------------------| |
| 61 | +| **Primary Use Case** | Pull requests, CDP, monitoring, member management | Insights API queries | |
| 62 | +| **Output Datasource** | pull_requests_analyzed, activityRelations_enriched_deduplicated_ds | activityRelations_deduplicated_cleaned_bucket_0-9_ds | |
| 63 | +| **Data Filtering** | UNFILTERED (includes bots, all repos) | FILTERED (valid members, enabled repos) | |
| 64 | +| **Partitioning Strategy** | Single datasource, snapshot-based | 10 parallel buckets, hash-based | |
| 65 | +| **Copy Mode** | Append (creates new snapshots) | Replace (hourly full refresh) | |
| 66 | +| **Query Pattern** | Filter by max(snapshotId) | Union all buckets or route to specific bucket | |
| 67 | +| **TTL** | 6 hours (keeps ~6 snapshots) | No TTL on buckets (replace mode) | |
| 68 | +| **Scalability** | Vertical (single large datasource) | Horizontal (add more buckets) | |
| 69 | +| **Dependencies** | Single-table triggers work well | Multi-table dependencies (members, repos) | |
| 70 | + |
| 71 | +**Which activityRelations output to use:** |
| 72 | + |
| 73 | +- Use Bucketing Architecture output (`activityRelations_deduplicated_cleaned_bucket_*_ds`) for: Insights API, project-specific analytics, filtered queries - since each bucket contains a subset of project data, main use-case is project-specific widgets |
| 74 | + |
| 75 | +- **Use Lambda Architecture output** (`activityRelations_enriched_deduplicated_ds`) for: CDP operations, monitoring, any use case requiring complete unfiltered data, where we can not use the buckets |
9 | 76 |
|
10 | 77 | ## Making changes to resources |
11 | 78 | 1. Install the **tb client** for classic tinybird |
@@ -90,7 +157,7 @@ GRANT SELECT ON "tableName" to sequin; |
90 | 157 | Switching between old and new datasources can lead to **temporary downtime**, but only for **endpoint pipes that consume raw datasources directly**. |
91 | 158 |
|
92 | 159 | **No Downtime** if the endpoint pipe uses a **copy pipe result**: |
93 | | -- You can safely remove the raw datasource after stopping the copy job |
| 160 | +- You can safely remove the raw datasource after stopping the copy pipe |
94 | 161 | - The copy pipe result datasource will continue to serve data |
95 | 162 | - New fields will be included in the **next copy run** |
96 | 163 |
|
@@ -270,3 +337,17 @@ tb sql "SELECT count() FROM activities_backup FINAL" |
270 | 337 | - (3) = (4) → same number of logical records after deduplication |
271 | 338 |
|
272 | 339 | If both pairs match, the backup is **logically consistent** with the source dataset. |
| 340 | + |
| 341 | + |
| 342 | +## Glossary |
| 343 | + |
| 344 | +- **CDP (Community Data Platform)**: Customer data operations and management pipelines |
| 345 | +- **Tinybird**: Real-time analytics database built on ClickHouse, used for fast query processing |
| 346 | +- **Datasource**: A Tinybird table where data is stored (analogous to database tables) |
| 347 | +- **Pipe**: A Tinybird SQL query that can be scheduled or materialized |
| 348 | +- **MV (Materialized View)**: A pipe that triggers automatically on INSERT to a datasource |
| 349 | +- **Copy Pipe**: A scheduled pipe that copies/transforms data from one datasource to another |
| 350 | +- **Sequin**: Database replication tool that streams Postgres changes to Kafka |
| 351 | +- **Insights**: The frontend analytics interface for community data |
| 352 | +- **segmentId**: Unique identifier for a project/community segment |
| 353 | +- **snapshotId**: Timestamp identifier used for deduplication and versioning in lambda architecture |
0 commit comments