swiggy changes #307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

arvindh-km wants to merge 5 commits into databrickslabs:master from arvindh-km:updates

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -12,3 +12,4 @@ dist/ @@
     .tox/
     databricks_migration_tool.egg-info
     migrate.iml
+    export_dir/

custom/README.md

-Original file line number
+Diff line change
@@ -0,0 +1,248 @@
+    # Databricks UC Metastore Migration Guide
+    This guide provides step-by-step instructions for exporting and importing Unity Catalog (UC) metastore data between Databricks workspaces, including table metadata and access control lists (ACLs).
+    ## Overview
+    This migration process allows you to:
+    - Export metastore metadata from a source workspace
+    - Update S3 paths and prefixes for cross-region migration
+    - Export table ACLs
+    - Import metastore and ACLs into a target workspace
+    ## Prerequisites
+. **Clone the repository**
+       ```bash
+       git clone -b updates https://github.com/arvindh-km/databricks-migrate.git
+       cd databricks-migrate
+       ```
+       **Note:** Make sure to clone the `updates` branch as it contains the latest migration scripts and fixes.
+. **Install Databricks CLI**
+       - Using Homebrew: `brew install databricks-cli`
+       - Or using curl
+. **Set up Python environment**
+       ```bash
+       python3 -m venv venv
+       source venv/bin/activate
+       pip install -r requirements.txt
+       ```
+    ## Initial Setup
+    ### Step 1: Configure Source Workspace
+. **Create a Personal Access Token (PAT)**
+       - Navigate to your source Databricks workspace
+       - Go to User Settings → Access Tokens
+       - Generate a new token and save it securely
+. **Configure Databricks CLI profile**
+       ```bash
+       databricks configure --token --profile sg_dsp
+       ```
+       - Replace `sg_dsp` with a profile name of your choice
+       - When prompted, enter your workspace host URL and PAT token
+. **Create migration clusters**
+       - Open `custom/create_clusters.py`
+       - Update the `profile` value to match your profile name
+       - Update the `email` field
+       - Run the script:
+         ```bash
+         python3 custom/create_clusters.py
+         ```
+       - **Important**: Wait for cluster creation to complete in the source workspace before proceeding
+       - If cluster creation fails due to workspace-specific init scripts or restrictions, update the script accordingly
+    ## Export Process
+    ### Step 2: Export Metastore
+    Run the following command to export the metastore:
+    ```bash
+    python3 migration_pipeline.py \
+      --profile <your-profile> \
+      --set-export-dir <local-dir-path> \
+      --cluster-name "metastore-migrate-mti" \
+      --export-pipeline \
+      --use-checkpoint \
+      --num-parallel 10 \
+      --retry-total 3 \
+      --retry-backoff 2 \
+      --keep-tasks metastore \
+      --session run1 \
+      --skip-failed \
+      --metastore-unicode \
+      --repair-metastore-tables \
+      --database <schema-name>
+    ```
+    **Parameters to update:**
+    - `<your-profile>`: Your Databricks CLI profile name
+    - `<local-dir-path>`: Local directory path (e.g., `/export_dir/`)
+    - `<schema-name>`: Schema/database name to export (leave empty for all schemas)
+    - `--cluster-name`: Should match the cluster created in Step 1
+    **Output:** Metastore data will be exported to `/export_dir/run1/metastore/`
+    ### Step 3: Update S3 Paths and Prefixes
+    Before importing, you need to update S3 paths to point to the target region's buckets:
+. Open `custom/metastore_s3.py`
+. Update the following fields:
+       - `schemas`: Add all schemas you want to migrate
+       - `path`: Update to the metastore export path (e.g., `/export_dir/run1/metastore/`)
+       - `s3_buckets` dictionary: Update with equivalent region-specific bucket names
+       - `uc_prefix` dictionary: Update with required region-specific prefixes
+. Save the file and run:
+       ```bash
+       python3 custom/metastore_s3.py
+       ```
+    **What this does:**
+    - Updates S3 paths from source region (e.g., Singapore) to target region (e.g., Mumbai) buckets
+    - Removes UC-specific prefixes that prevent table creation for Managed Tables
+    - Apply additional filters as needed based on your bucket requirements
+    ### Step 4: Export Table ACLs
+    Export table access control lists:
+    ```bash
+    python3 migration_pipeline.py \
+      --profile <your-profile> \
+      --set-export-dir <local-dir-path> \
+      --cluster-name "table-acls-migrate-mti" \
+      --export-pipeline \
+      --use-checkpoint \
+      --num-parallel 10 \
+      --retry-total 3 \
+      --retry-backoff 2 \
+      --keep-tasks metastore_table_acls \
+      --session run1 \
+      --skip-failed \
+      --metastore-unicode \
+      --repair-metastore-tables \
+      --database <schema-name>
+    ```
+    **Note:** Users and groups do not need to be exported separately. The ACL export works without them.
+    **Output:** Table ACLs will be exported to `/export_dir/run1/table_acls/` as zip files (no need to decompress)
+    ## Import Process
+    ### Step 5: Configure Target Workspace
+. **Create PAT token in target workspace** (e.g., Mumbai workspace)
+. **Configure Databricks CLI profile for target workspace:**
+       ```bash
+       databricks configure --token --profile <target-profile>
+       ```
+. **Create clusters in target workspace:**
+       - Follow Step 1 instructions, but use the target workspace profile and PAT
+    ### Step 6: Export and Import Catalog ACLs
+. **Export catalog ACLs from source workspace:**
+       - Clone the notebook `custom/export_catalog_acls.py` in your source workspace
+       - Update the `catalog` variable to match your catalog name (e.g., `'prod'`)
+       - Run the notebook to generate catalog ACL export
+       - Copy the JSON output containing the GRANT commands
+. **Import catalog ACLs to target workspace:**
+       - Clone the notebook `custom/import_catalog_acls.py` in your target workspace
+       - Replace the `grant_cmds` list with the output from the previous step
+       - Run all cells in the notebook to apply catalog ACLs
+    ### Step 7: Export and Import Schemas
+. **Export schemas from source workspace:**
+       - Clone the notebook `custom/export_schema_s3.py` in your source workspace
+       - Update the `schemas_to_filter` list with the schemas you want to migrate
+       - Run the notebook to generate schema export
+       - Copy the JSON output containing schema creation commands
+. **Import schemas to target workspace:**
+       - Clone the notebook `custom/import_schema_s3.py` in your target workspace
+       - Update the schema map using the output from the previous step
+       - Run all cells in the notebook to create schemas
+    ### Step 8: Export and Import Schema ACLs
+. **Export schema ACLs from source workspace:**
+       - Clone the notebook `custom/export_schema_acls.py` in your source workspace
+       - Update the `schemas_to_filter` list with the schemas you want to export ACLs for (leave empty to export all schemas)
+       - Run the notebook to generate schema ACL export
+       - Copy the JSON output containing the schema ACL map
+. **Import schema ACLs to target workspace:**
+       - Clone the notebook `custom/import_schema_acls.py` in your target workspace
+       - Replace the `grant_cmds` dictionary with the JSON output from the previous step
+       - Run all cells in the notebook to apply schema ACLs
+    ### Step 9: Import Metastore
+    Import the metastore into the target workspace:
+    ```bash
+    python3 migration_pipeline.py \
+      --profile <target-profile> \
+      --set-export-dir <local-dir-path> \
+      --cluster-name "metastore-migrate-mti" \
+      --import-pipeline \
+      --use-checkpoint \
+      --num-parallel 8 \
+      --retry-total 3 \
+      --retry-backoff 2 \
+      --keep-tasks metastore \
+      --session run1 \
+      --skip-failed \
+      --metastore-unicode \
+      --repair-metastore-tables \
+      --database <schema-name>
+    ```
+    **Important:**
+    - Use the target workspace profile (e.g., Mumbai workspace profile)
+    - Maintain the same export directory used during export
+    - Specify the schema you want to import
+    - This creates tables on top of the updated S3 paths
+    ### Step 10: Import Table ACLs
+    Update table access control lists in the target workspace:
+    ```bash
+    python3 migration_pipeline.py \
+      --profile <target-profile> \
+      --set-export-dir <local-dir-path> \
+      --cluster-name "metastore-migrate-mti" \
+      --import-pipeline \
+      --use-checkpoint \
+      --num-parallel 8 \
+      --retry-total 3 \
+      --retry-backoff 2 \
+      --keep-tasks metastore_table_acls \
+      --session run1 \
+      --skip-failed \
+      --metastore-unicode \
+      --repair-metastore-tables \
+      --database <schema-name>
+    ```
+    **Parameters to update:**
+    - `<target-profile>`: Target workspace profile (e.g., Mumbai workspace profile)
+    - `<schema-name>`: Schema for which you want to update access controls
+    ## Troubleshooting
+    - **Cluster creation failures**: Update `custom/create_clusters.py` with workspace-specific init scripts and restrictions
+    - **S3 path issues**: Verify bucket names and prefixes in `custom/metastore_s3.py` match your target region configuration
+    - **Import failures**: Ensure schemas are created in the target workspace before importing metastore

custom/create_clusters.py

-Original file line number
+Diff line change
@@ -0,0 +1,71 @@
+    # Creates migration clusters in Databricks workspace for metastore and table ACLs migration
+    import os
+    import subprocess
+    import json
+    # Update these values before running
+    profile = 'dsp'
+    email = "arvindh.km@swiggy.in"
+    # Cluster configuration for metastore migration
+    cluster_config_for_metastore = {
+        "cluster_name": "metastore-migrate-mti",
+        "spark_version": "13.3.x-scala2.12",
+        "aws_attributes": {
+            "first_on_demand": 0,
+            "availability": "ON_DEMAND",
+            "zone_id": "auto",
+            "spot_bid_price_percent": 100,
+            "ebs_volume_count": 0
+        },
+        "node_type_id": "m6gd.large",
+        "driver_node_type_id": "r6gd.xlarge",
+        "autotermination_minutes": 30,
+        "enable_elastic_disk": True,
+        "single_user_name": email,
+        "enable_local_disk_encryption": False,
+        "data_security_mode": "DATA_SECURITY_MODE_DEDICATED",
+        "runtime_engine": "STANDARD",
+        "kind": "CLASSIC_PREVIEW",
+        "is_single_node": False,
+        "autoscale": {
+            "min_workers": 1,
+            "max_workers": 5
+        },
+        "apply_policy_default_values": False
+    }
+    result = subprocess.run(['databricks', '--profile', profile, 'clusters', 'create', '--json', json.dumps(cluster_config_for_metastore), '--no-wait'])
+    print(result.returncode)
+    # Cluster configuration for table ACLs migration
+    cluster_config_for_table_acls = {
+        "cluster_name": "table-acls-migrate-mti",
+        "spark_version": "13.3.x-scala2.12",
+        "aws_attributes": {
+            "first_on_demand": 0,
+            "availability": "ON_DEMAND",
+            "zone_id": "auto",
+            "spot_bid_price_percent": 100,
+            "ebs_volume_count": 0
+        },
+        "node_type_id": "m6gd.large",
+        "driver_node_type_id": "r6gd.xlarge",
+        "autotermination_minutes": 30,
+        "enable_elastic_disk": True,
+        "enable_local_disk_encryption": False,
+        "data_security_mode": "DATA_SECURITY_MODE_STANDARD",
+        "runtime_engine": "STANDARD",
+        "kind": "CLASSIC_PREVIEW",
+        "is_single_node": False,
+        "autoscale": {
+            "min_workers": 1,
+            "max_workers": 5
+        },
+        "apply_policy_default_values": False
+    }
+    result = subprocess.run(['databricks', '--profile', profile, 'clusters', 'create', '--json', json.dumps(cluster_config_for_table_acls), '--no-wait'])
+    print(result.returncode)

custom/export_catalog_acls.py

-Original file line number
+Diff line change
@@ -0,0 +1,15 @@
+    # Exports catalog ACLs as GRANT commands in JSON format
+    import json
+    # Update catalog name as needed
+    catalog = 'prod'
+    result = []
+    for grant in spark.sql(f"SHOW GRANT ON catalog {catalog}").collect():
+        action = grant.ActionType
+        principal = grant.Principal
+        prod_cmd = f"GRANT {action} ON CATALOG {catalog} TO {principal}"
+        result.append(prod_cmd)
+    print(json.dumps(result, indent=4))

custom/export_schema_acls.py

-Original file line number
+Diff line change
@@ -0,0 +1,19 @@
+    '''
+    Export schema ACLs from source workspace to a JSON.
+    To filter for particular schema, mention the schema name without catalog in schemas_to_filter list. If list is empty, all schemas will be exported.
+    '''
+    import json
+    schemas = [i.databaseName for i in spark.sql('show schemas in prod').collect()]
+    schemas_to_filter = ['dsp']
+    schemas = [schema for schema in schemas if schema in schemas_to_filter]
+    schema_acl_map = {}
+    for schema in schemas:
+        schema_grants = [f"GRANT {grant.ActionType} ON SCHEMA {schema} TO {grant.Principal}" for grant in spark.sql(f"SHOW GRANT ON SCHEMA {schema}").collect() if grant.ObjectType == 'SCHEMA']
+        schema_acl_map[schema] = schema_grants
+    print(json.dumps(schema_acl_map, indent=4))

custom/export_schema_s3.py

-Original file line number
+Diff line change
@@ -0,0 +1,21 @@
+    # Exports schema creation commands with S3 locations
+    import json
+    schemas = [i.databaseName for i in spark.sql('show schemas in prod').collect()]
+    schema_commands = {}
+    # Update with schemas to export (empty list exports all schemas)
+    schemas_to_filter = ['data_science_prod']
+    schemas = [schema for schema in schemas if schema in schemas_to_filter]
+    for schema in schemas:
+        schema_desc = spark.sql(f'describe schema {schema}').collect()
+        catalog, name, location = None, None, None
+        for i in schema_desc:
+            if i.database_description_item == 'Catalog Name': catalog = i.database_description_value
+            if i.database_description_item == 'Namespace Name': name = i.database_description_value
+            if i.database_description_item == 'RootLocation': location = i.database_description_value
+        schema_commands[schema] = f"CREATE SCHEMA IF NOT EXISTS {catalog}.{name} LOCATION '{location}'"
+    print(json.dumps(schema_commands, indent=4))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swiggy changes #307

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

swiggy changes #307

Are you sure you want to change the base?

Uh oh!

swiggy changes #307

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!