Skip to content

Commit c181a2f

Browse files
authored
feat: activityRelations buckets for subset of projects (#3675)
1 parent 57b2d74 commit c181a2f

File tree

170 files changed

+3283
-548
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

170 files changed

+3283
-548
lines changed

services/libs/tinybird/README.md

Lines changed: 87 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,78 @@
11

2-
# Journey of Data from CM to Insights
3-
[This image](https://uploads.linear.app/aebec7ad-5649-4758-9bed-061f7228a879/b72d9f55-8f27-4c57-81fe-729807c12ffb/36c116c2-0f88-4735-a932-0c3e6bf8ea45) shows how data flows from CM to Insights.
2+
# Tinybird Documentation
43

5-
## Activity Preprocessing Pipeline
6-
See LAMBDA_ARCHITECTURE.md for details
4+
## Table of Contents
75

8-
---
6+
- [Introduction](#introduction)
7+
- [Journey of Data](#journey-of-data-from-cdp-to-insights)
8+
- [Making Changes to Resources](#making-changes-to-resources)
9+
- [How to Iterate on Data](#how-to-iterate-on-data)
10+
- [Testing Tinybird Pipes Locally](#testing-tinybird-pipes-locally)
11+
- [Creating Backups](#creating-a-backup-datasource-in-tinybird)
12+
- [Glossary](#glossary)
13+
14+
## Introduction
15+
16+
This directory contains documentation CDP and Tinybird integration. **Tinybird** is a real-time analytics database built on ClickHouse that powers our Insights platform with fast, scalable queries on community activity data.
17+
18+
### System Role
19+
20+
Tinybird sits between our Community Data Platform (CDP) backend and the Insights frontend:
21+
22+
1. **Data Ingestion**: Receives data from Postgres (via Sequin → Kafka → Kafka Connect)
23+
2. **Processing**: Enriches, filters, and aggregates activity data using different architectures (Bucketing & Lambda)
24+
3. **Serving**: Provides fast API endpoints for the Insights dashboard and other consumers
25+
26+
## Journey of Data from CDP to Insights
27+
28+
See [dataflow](./dataflow.md) for a visual diagram showing how data flows from CDP Backend through Tinybird to Insights.
29+
30+
## Architecture Overview
31+
32+
We use **two parallel architectures** to process activityRelations data:
33+
34+
### Lambda Architecture
35+
1) Deduplicates activityRelations without any filtering. Mainly consumed in CDP, and monitoring pipes. Output: `activityRelations_enriched_deduplicated_ds`
36+
37+
2) Used for ingesting pull request event data and merging with existing events. Output: `pull_requests_analyzed`
38+
39+
For details see [lambda-architecture.md](./lambda-architecture.md)
40+
41+
- Filtering: UNFILTERED (includes bots, all activities)
42+
- Used by: Pull requests, CDP pipes, monitoring
43+
- Details: Lambda architecture pattern for deduplication enrichment and pull request processing
44+
45+
46+
### Bucketing Architecture
47+
Produces filtered data (10 buckets) for Insights API queries. For details see [bucketing-architecture.md](./bucketing-architecture.md)
48+
49+
- Output: `activityRelations_deduplicated_cleaned_bucket_*_ds` (10 buckets)
50+
- Filtering: FILTERED (valid members, enabled repos only)
51+
- Used by: Insights API queries
52+
- Details: Hash-based bucketing architecture for parallel processing
53+
54+
55+
### Comparison
56+
57+
The following table compares the two parallel architectures processing activityRelations data:
58+
59+
| Aspect | Lambda Architecture | Bucketing Architecture |
60+
|--------|---------------------|------------------------|
61+
| **Primary Use Case** | Pull requests, CDP, monitoring, member management | Insights API queries |
62+
| **Output Datasource** | pull_requests_analyzed, activityRelations_enriched_deduplicated_ds | activityRelations_deduplicated_cleaned_bucket_0-9_ds |
63+
| **Data Filtering** | UNFILTERED (includes bots, all repos) | FILTERED (valid members, enabled repos) |
64+
| **Partitioning Strategy** | Single datasource, snapshot-based | 10 parallel buckets, hash-based |
65+
| **Copy Mode** | Append (creates new snapshots) | Replace (hourly full refresh) |
66+
| **Query Pattern** | Filter by max(snapshotId) | Union all buckets or route to specific bucket |
67+
| **TTL** | 6 hours (keeps ~6 snapshots) | No TTL on buckets (replace mode) |
68+
| **Scalability** | Vertical (single large datasource) | Horizontal (add more buckets) |
69+
| **Dependencies** | Single-table triggers work well | Multi-table dependencies (members, repos) |
70+
71+
**Which activityRelations output to use:**
72+
73+
- Use Bucketing Architecture output (`activityRelations_deduplicated_cleaned_bucket_*_ds`) for: Insights API, project-specific analytics, filtered queries - since each bucket contains a subset of project data, main use-case is project-specific widgets
74+
75+
- **Use Lambda Architecture output** (`activityRelations_enriched_deduplicated_ds`) for: CDP operations, monitoring, any use case requiring complete unfiltered data, where we can not use the buckets
976

1077
## Making changes to resources
1178
1. Install the **tb client** for classic tinybird
@@ -90,7 +157,7 @@ GRANT SELECT ON "tableName" to sequin;
90157
Switching between old and new datasources can lead to **temporary downtime**, but only for **endpoint pipes that consume raw datasources directly**.
91158

92159
**No Downtime** if the endpoint pipe uses a **copy pipe result**:
93-
- You can safely remove the raw datasource after stopping the copy job
160+
- You can safely remove the raw datasource after stopping the copy pipe
94161
- The copy pipe result datasource will continue to serve data
95162
- New fields will be included in the **next copy run**
96163

@@ -270,3 +337,17 @@ tb sql "SELECT count() FROM activities_backup FINAL"
270337
- (3) = (4) → same number of logical records after deduplication
271338

272339
If both pairs match, the backup is **logically consistent** with the source dataset.
340+
341+
342+
## Glossary
343+
344+
- **CDP (Community Data Platform)**: Customer data operations and management pipelines
345+
- **Tinybird**: Real-time analytics database built on ClickHouse, used for fast query processing
346+
- **Datasource**: A Tinybird table where data is stored (analogous to database tables)
347+
- **Pipe**: A Tinybird SQL query that can be scheduled or materialized
348+
- **MV (Materialized View)**: A pipe that triggers automatically on INSERT to a datasource
349+
- **Copy Pipe**: A scheduled pipe that copies/transforms data from one datasource to another
350+
- **Sequin**: Database replication tool that streams Postgres changes to Kafka
351+
- **Insights**: The frontend analytics interface for community data
352+
- **segmentId**: Unique identifier for a project/community segment
353+
- **snapshotId**: Timestamp identifier used for deduplication and versioning in lambda architecture

0 commit comments

Comments
 (0)