Dataplex is Google Cloud's intelligent data fabric that unifies distributed data across data lakes, data warehouses, and data marts. It provides centralized data management, governance, and discovery without moving or duplicating data.
Key capabilities:
🏗️ Organize data using Lakes → Zones → Assets hierarchy
🔍 Automatic metadata discovery and cataloging
✅ Data quality monitoring and validation
🔗 Data lineage tracking across BigQuery and Cloud Storage
🔐 Unified security and access control
This tutorial will guide you through setting up and using Dataplex to manage your data estate on GCP.
I am using the same BQ table as in the previous tutorial bq-data-masking-example.
Data Catalog in Dataplex provides a unified discovery platform that helps both technical and non-technical users quickly find and access data across the organization through searchable metadata. It ensures data quality consistency and regulatory compliance while reducing unnecessary costs, and enables organizations to trace data lineage to understand where data originated, how it was transformed, and who used it. Users can add rich text table descriptions, assign data stewards for metadata management, and establish clear ownership to improve trust and confidence in data assets. Additionally, Data Catalog integrates with Sensitive Data Protection to automatically identify and tag sensitive data using tag templates, centralizing governance and reducing search friction across the organization.
Here you can find all info about your data: Entry type; Platform; System; Creation time; Last modification time; Labels; Description; Contacts; Fully Qualified Name. Here is an example of the table users under dataset bq_data_masking_demo.
Overview lets you provide additional description for the entry. Usage lets you see usage of the entry. Aspects lets you attach metadata to an entry.
The number of tags shows tags associated with columns in the table. In our case this table has 4 tags masking sensitive data.
Dataplex Universal Catalog makes it easier to understand and analyze your data by automatically profiling your BigQuery tables.
Profiling is like getting a detailed health report for your data. It gives you key statistics, such as common values, how the data is spread out (distribution), and how many entries are missing (null counts). This information speeds up your analysis.
Data profiling automatically detects sensitive information and lets you set access control policies. It recommends data quality check rules to ensure your data stays reliable.
Here is an example of Data Profile of users table.
Data Lineage in Dataplex provides automated tracking of data flows across your Google Cloud environment. It visualizes how data moves between BigQuery tables, Cloud Storage buckets, and other GCP services, showing upstream sources and downstream consumers. This enables teams to understand data dependencies, assess the impact of schema changes, ensure compliance, and troubleshoot data quality issues by tracing data back to its origin.
In this example, I created additional tables and joined them to demonstrate data lineage and trace data origins. On the right, you can see the BigQuery queries used to join these tables. Here is an example:
CREATE OR REPLACE TABLE `elt-project-482220.bq_data_masking_demo.users_with_purchases` AS
SELECT
u.id,
u.first_name,
u.last_name,
p.item,
p.amount,
p.purchase_date
FROM `elt-project-482220.bq_data_masking_demo.users` AS u
INNER JOIN `elt-project-482220.bq_data_masking_demo.user_purchases` AS p
ON u.id = p.user_id;You can also view detailed information about each table, filter connections by column name, and use upstream/downstream directions with time range filters.
Use a business glossary to establish a standardized vocabulary for your data assets, which reduces ambiguity and improves data discovery and governance across your organization. By creating a common language for data using Dataplex Universal Catalog business glossary, you can achieve the following:
Define a clear hierarchy of business categories and terms.
Link concepts using synonyms and show relationships between terms.
Search for data resources based on business concepts, not just technical names.
Dataplex Universal Catalog business glossary helps streamline data discovery and reduce ambiguity, resulting in better governance, more accurate analysis, and faster insights.
Here is an example of 'Customer Transaction Data Glossary'.
When you connect terms to the tables, they appear in Related entries.
And in Glossary Terms in table Details in Dataplex.
Business terms can also be connected under the table schema in Dataplex.
Dataplex organizes data using a hierarchical structure: Lakes contain Zones, and Zones contain Assets (data stored in BigQuery or Cloud Storage).
Lake: A lake is a logical domain that groups related data together based on business function, department, or data domain (e.g., "Customer Data Lake", "Finance Lake", "Marketing Lake"). Lakes provide:
High-level organizational boundaries
Centralized governance and access control
A way to manage related datasets as a unified domain
Zone: Zones are subdivisions within a lake that organize data by processing stage, data quality level, or functional area. Common zone patterns include:
Raw Zone - Ingested data in original format, unprocessed
Curated Zone - Cleaned, validated, and transformed data
Processed/Analytics Zone - Business-ready data for analytics and reporting
Zones can be either:
Raw zones - contain unstructured or semi-structured data (typically Cloud Storage)
Curated zones - contain structured, quality-controlled data (typically BigQuery tables)
Asset: Assets are the actual data resources (BigQuery datasets/tables or Cloud Storage buckets) attached to zones.
Benefits:
Clear data lifecycle management (raw → curated → analytics)
Consistent governance policies applied at lake or zone level
Better data discovery and organization
Simplified access control management
Lake My data mesh was created here as an example.
And then 2 Zones: Curated Zone for BQ dataset and Raw Zone for GCS bucket.
There are options to see the details of the Zone and send alerts based on specific rules.
Dataplex Universal Catalog lets you define and measure the quality of the data in your BigQuery tables. You can automate the data scanning, validate data against defined rules, and log alerts if your data doesn't meet quality requirements. Auto data quality lets you manage data quality rules and deployments as code, improving the integrity of data production pipelines.
You can use predefined quality rules or build custom rules.
Dataplex Universal Catalog provides monitoring, troubleshooting, and Cloud Logging alerting that's integrated with auto data quality. More about Data Quality scans here.
Creating and using a data quality scan consists of the following steps:
Define data quality rules
Configure rule execution
Analyze data quality scan results
Set up monitoring and alerting
Troubleshoot data quality failures
Here is an example where we scan users table. First, there is an option to schedule the scans.
Then we have to configure validation rules. Use SQL to create your own rules or use built-in rule types.
Here is an example of built-in rules.
It is possible to export scan results to a BigQuery table, as well as receive notification reports via email.
All details and rules are visible inside the scan.
Each scan is visible in Dataplex.
Each job provides detailed results.
Each scan is visible in the Dataplex Data Quality view for the table.
Results are published in a BQ table as well.
And here is AutoDQ Email Report.
