Skip to content

Expose container-level containerProperties configuration (IPC mode, ulimits, sharedMemorySize, linuxParameters) in @batch decorator #2692

@aman5319

Description

@aman5319

We are running Metaflow on AWS Batch and recently hit a limitation when deploying GPU-based Graph Neural Network workloads (PyG). PyG requires larger shared memory (/dev/shm) and specific ulimits to avoid Bus Errors during large graph allocations.

These require Docker flags such as:

--ipc=host
--ulimit memlock=-1
--ulimit stack=67108864

The exact warning I get when running.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyG. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Currently, the batch decorator does not expose any way to configure these container-level properties. Inspecting batch_client.py, only max_swap and swappiness are exposed via linuxParameters, with no escape hatch to modify the full containerProperties.

This makes certain ML workloads (e.g., PyG, DGL, large CUDA graphs, RAPIDS) impossible to run without modifying Metaflow internals.

Requested Feature
Expose either of the following:

Option A: Direct kwargs pass-through
Allow batch(container_properties=...) to merge into the AWS Batch Job Definition.

Option B: Escape hatch
Something like:

batch(
    gpu=1,
    image="xyz",
    ecs_overrides={
        "containerProperties": {...}
    }
)

Why this matters
Graph-based workloads, high-memory CUDA ops, and frameworks like PyG require larger SHMEM and specific ulimits. Without these, Metaflow cannot support a class of real-world ML tasks that rely on Batch GPU compute.

I’m happy to submit a PR once there’s agreement on the API design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions