Can not find the correct attributes or methods to get the correct values of croping #4820

AntoineFachez · 2025-11-28T19:47:33Z

AntoineFachez
Nov 28, 2025

Hi there!

I am trying to get the data using the below script. But I am still struggleing to get the values of cropped images that where exported as pdf by mac pages or mac keynote.

Any ideas where to pinpoint me?

Goal/Original:

Current state:

`import fitz
import math

def rgb_to_hex(rgb_tuple):
# Converts (r, g, b) float tuple (0-1) to #RRGGBB hex string
if not rgb_tuple or len(rgb_tuple) < 3: return None
try:
r, g, b = [max(0, min(255, int(c * 255))) for c in rgb_tuple[:3]]
return f"#{r:02x}{g:02x}{b:02x}"
except (TypeError, ValueError):
print(f"Warning: Could not convert color tuple {rgb_tuple} to hex.")
return None

def get_rotation_from_matrix(matrix):
"""Calculates rotation angle in degrees (0, 90, 180, 270) from a fitz.Matrix."""
if not isinstance(matrix, fitz.Matrix):
return 0
a, b, _, _, _, _ = matrix # Only need 'a' and 'b' for rotation

# Use atan2 for robust angle calculation
angle = math.degrees(math.atan2(b, a))

# Normalize and snap to the nearest 90-degree angle
final_angle = round(angle / 90) * 90 % 360
return int(final_angle)

def process_shapes_and_lines(page, z_counter):
"""
Extracts vector drawings (shapes and lines) from a page.
This version correctly identifies thin filled rectangles as lines.
Returns a list of vector elements and the updated z_counter.
"""
vector_elements = []
page_area = page.rect.get_area()
drawings = page.get_drawings()

for path in drawings:
    bbox = fitz.Rect(path.get("rect", (0,0,0,0)))
    if not bbox.is_valid or bbox.is_empty:
         continue

    # Filter out large white background rectangles
    is_white_background = (path.get("fill") == (1.0, 1.0, 1.0) and
                           path.get("fill_opacity", 1.0) == 1.0 and
                           bbox.get_area() > page_area * 0.90)
    if is_white_background:
        continue
    
    # Filter out clipping paths (they aren't visible elements)
    if path.get("clip"):
        continue

    element_data = {
        "position": {'x0': bbox.x0, 'y0': bbox.y0, 'x1': bbox.x1, 'y1': bbox.y1},
        "zIndex": z_counter
    }

    is_line = False
    is_shape = False
    
    stroke_color = rgb_to_hex(path.get("color"))
    line_width = path.get("width", 0)

    # 1. CHECK FOR "STROKED" LINES
    if stroke_color and line_width > 0:
        
        if bbox.height < max(line_width * 2, 2) and bbox.width > bbox.height * 3:
             element_data.update({"type": "line", "strokeColor": stroke_color, "strokeWidth": line_width})
             is_line = True
        elif bbox.width < max(line_width * 2, 2) and bbox.height > bbox.width * 3:
             element_data.update({"type": "line", "strokeColor": stroke_color, "strokeWidth": line_width})
             is_line = True

    # 2. CHECK FOR "FILLED" ELEMENTS (Shapes & Filled Rectangular Lines)
    fill_color = rgb_to_hex(path.get("fill"))
    fill_opacity = path.get("fill_opacity", 1.0)

    if not is_line and fill_color:
        
        if bbox.height < 2.0 and bbox.width > bbox.height * 3:
            element_data.update({"type": "line", "strokeColor": fill_color, "strokeWidth": bbox.height})
            is_line = True 
        
        elif bbox.width < 2.0 and bbox.height > bbox.width * 3:
            element_data.update({"type": "line", "strokeColor": fill_color, "strokeWidth": bbox.width})
            is_line = True
            
        else:
            element_data.update({"type": "shape", "backgroundColor": fill_color, "opacity": fill_opacity})
            is_shape = True

    # 3. ADD THE ELEMENT TO THE LIST
    if is_line or is_shape:
        vector_elements.append(element_data)
        z_counter += 1

return vector_elements, z_counter

def process_images(doc, page, image_bucket, document_id, z_counter):
"""
Extracts raster images from a page, uploads them, and determines
the final visually cropped bounds and transformation data.

The visually cropped bounds are determined by page.get_image_rects,
which respects PDF clipping paths.
"""
image_elements = []
page_cropbox = page.cropbox
image_info_list = page.get_image_info(xrefs=True)

# Get the visual bounding boxes of all drawn images (clipped)
# bboxlog returns (type, rect) tuples in drawing order.
# We filter for 'fill-image'.
try:
    bbox_log = page.get_bboxlog()
    # bbox_log returns (type, rect_tuple). We need to convert rect_tuple to fitz.Rect.
    visible_image_rects = [fitz.Rect(r) for t, r in bbox_log if t == "fill-image"]
except Exception as e:
    print(f"Warning: Could not get bboxlog: {e}")
    visible_image_rects = []

# Check for images used as fills in drawings (common in Mac Pages)
image_fill_rects = {} # Map xref -> list of rects
# Also collect explicit clipping paths
clipping_paths = []

try:
    drawings = page.get_drawings()
    for draw in drawings:
        # 1. Check for image fills
        if "fill_images" in draw and draw["fill_images"]:
            for xref in draw["fill_images"]:
                if xref not in image_fill_rects:
                    image_fill_rects[xref] = []
                image_fill_rects[xref].append(fitz.Rect(draw["rect"]))
        
        # 2. Check for clipping paths
        if draw["type"] == "clip":
            clipping_paths.append(fitz.Rect(draw["rect"]))
            
except Exception as e:
    print(f"Warning: Could not get drawings: {e}")

for img_index, info in enumerate(image_info_list):
    xref = info["xref"]
    if not xref: continue 

    smask = info.get("smask", 0)
    
    # --- 1. Image Bytes Extraction and Transparency Handling ---
    try:
        image_ext = "png" # Default ext if transparency is involved
        base_image = doc.extract_image(xref)
        image_bytes = None
        
        if smask > 0:
            # Handle images with transparency masks
            mask_image = doc.extract_image(smask)
            pix1 = fitz.Pixmap(doc, xref)
            mask = fitz.Pixmap(mask_image["image"])
            pix_with_mask = fitz.Pixmap(pix1, mask)
            image_bytes = pix_with_mask.tobytes("png")
            pix1 = mask = pix_with_mask = None 
        else:
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]

        if not image_bytes:
             print(f"Warning: Image xref {xref} has empty bytes. Skipping.")
             continue
        
        # 3. Extract RAW image metadata to get original dimensions
        base_img = doc.extract_image(xref)
        orig_w = 0
        orig_h = 0
        if base_img:
            orig_w = base_img["width"]
            orig_h = base_img["height"]
            
            # Check for Aspect Ratio Distortion
            bbox_rect = fitz.Rect(info["bbox"])
            if bbox_rect.width > 0 and bbox_rect.height > 0 and orig_w > 0 and orig_h > 0:
                orig_ratio = orig_w / orig_h
                display_ratio = bbox_rect.width / bbox_rect.height
                
                if abs(orig_ratio - display_ratio) > 0.1:
                    print(f"    • ⚠️ DISTORTION DETECTED for Image {img_index} (Ref: {xref}):")
                    print(f"      Orig: {orig_w}x{orig_h} (Ratio: {orig_ratio:.2f})")
                    print(f"      Display: {bbox_rect.width:.1f}x{bbox_rect.height:.1f} (Ratio: {display_ratio:.2f})")

        # --- 2. Determine Effective Crop Rectangle (Visual Bounds) ---
        full_pos_rect = fitz.Rect(info["bbox"])
        
        # Try to find the matching clipped rect
        effective_crop_rect = full_pos_rect
        
        # Priority 1: Check if image is a fill for a drawing (Mac Pages style)
        # Handle multiple occurrences of the same image xref
        found_fill_match = False
        if xref in image_fill_rects:
            # Find the fill rect that best overlaps with the current image bbox
            best_fill = None
            best_overlap_area = 0
            
            for fill_rect in image_fill_rects[xref]:
                # Calculate intersection with the reported bbox
                inter = full_pos_rect.intersect(fill_rect)
                area = inter.get_area()
                
                # We look for significant overlap
                if area > best_overlap_area:
                    best_overlap_area = area
                    best_fill = fill_rect
            
            # If we found a good match (overlap > 0), use it
            if best_fill and best_overlap_area > 0:
                effective_crop_rect = best_fill
                found_fill_match = True
                # print(f"  -> Found match in drawings (fill): {effective_crop_rect}")

        if not found_fill_match:
            # Priority 2: Check for intersecting clipping paths
            # Heuristic: If a clip rect intersects the image, it's likely masking it.
            best_clip = None
            best_clip_area = 0
            
            for clip_rect in clipping_paths:
                if clip_rect.intersects(full_pos_rect):
                    # Calculate intersection area
                    inter = full_pos_rect.intersect(clip_rect)
                    area = inter.get_area()
                    if area > best_clip_area:
                        best_clip_area = area
                        best_clip = clip_rect
            
            if best_clip:
                effective_crop_rect = full_pos_rect.intersect(best_clip)
                # print(f"  -> Found match in clipping paths: {best_clip} -> Crop: {effective_crop_rect}")
            else:
                # Fallback: intersect with page cropbox
                effective_crop_rect = full_pos_rect.intersect(page_cropbox)
        
        if effective_crop_rect.is_empty:
            effective_crop_rect = full_pos_rect
        
        # Ensure crop is not larger than full (sanity check)
        effective_crop_rect = effective_crop_rect.intersect(full_pos_rect) 
        
        # --- 3. Transformation/Rotation Logic ---
        rotation = 0
        is_flipped_horizontal = False
        is_flipped_vertical = False
        
        # Default values for layout
        unrotated_w = full_pos_rect.width
        unrotated_h = full_pos_rect.height
        center_x = (full_pos_rect.x0 + full_pos_rect.x1) / 2
        center_y = (full_pos_rect.y0 + full_pos_rect.y1) / 2
        
        transform_data = info.get("transform")
        if transform_data and len(transform_data) == 6:
            transform_matrix = fitz.Matrix(transform_data)
            
            # Check determinant for mirroring/flipping (negative determinant = mirror)
            det = transform_matrix.a * transform_matrix.d - transform_matrix.b * transform_matrix.c
            if det < 0:
                is_flipped_horizontal = True 
                # Un-flip the matrix to get correct rotation/dims
                transform_matrix.a = -transform_matrix.a
                transform_matrix.b = -transform_matrix.b
                
            # Calculate unrotated dimensions (scale factors)
            # The matrix maps the unit square (0..1, 0..1) to the image quad.
            # Width vector is (a, b), Height vector is (c, d)
            unrotated_w = math.sqrt(transform_matrix.a**2 + transform_matrix.b**2)
            unrotated_h = math.sqrt(transform_matrix.c**2 + transform_matrix.d**2)
            
            # Calculate rotation
            rotation = get_rotation_from_matrix(transform_matrix)
            
            # Calculate center point (transform of 0.5, 0.5)
            # x = a*0.5 + c*0.5 + e
            # y = b*0.5 + d*0.5 + f
            center_x = transform_matrix.a * 0.5 + transform_matrix.c * 0.5 + transform_matrix.e
            center_y = transform_matrix.b * 0.5 + transform_matrix.d * 0.5 + transform_matrix.f
            
        # --- 4. Upload Logic ---
        image_filename = f"{document_id}/page{page.number + 1}_img{img_index}.{image_ext}"
        blob = image_bucket.blob(image_filename)
        blob.upload_from_string(image_bytes, content_type=f"image/{image_ext}")
        public_url = f"https://storage.googleapis.com/{image_bucket.name}/{image_filename}"

        # --- 5. Build Element Data ---
        image_element = {
            "type": "image", "src": public_url,
            "position": {'x0': full_pos_rect.x0, 'y0': full_pos_rect.y0, 'x1': full_pos_rect.x1, 'y1': full_pos_rect.y1},
            # The 'crop' attribute now holds the precise visual dimensions.
            "crop": {'x0': effective_crop_rect.x0, 'y0': effective_crop_rect.y0, 'x1': effective_crop_rect.x1, 'y1': effective_crop_rect.y1}, 
            "rotation": rotation, 
            "isFlippedHorizontal": is_flipped_horizontal, 
            "isFlippedVertical": is_flipped_vertical, 
            "originalWidth": orig_w,
            "originalHeight": orig_h,
            # New layout data for precise positioning
            "layout": {
                "width": unrotated_w,
                "height": unrotated_h,
                "centerX": center_x,
                "centerY": center_y
            },
            "zIndex": z_counter
        }

        image_elements.append(image_element)
        z_counter += 1

    except Exception as e:
        print(f"Error processing image xref {xref} on page {page.number + 1}. Error: {e}")
    
return image_elements, z_counter`

Answered by JorjMcKie

Nov 29, 2025

This is not an issue but a typical "Discussions" item. Transferring ...

View full answer

JorjMcKie · 2025-11-29T15:45:53Z

JorjMcKie
Nov 29, 2025
Maintainer

This is not an issue but a typical "Discussions" item. Transferring ...

0 replies

JorjMcKie · 2025-11-29T15:54:15Z

JorjMcKie
Nov 29, 2025
Maintainer

You did not include the "problem" file, so there is no way to checkout the situation.
You leave us guessing what your overall goal might be: re-generating the original from ground up maybe?
The included code is also very long - no reasonable way for us to debug this under the circumstances. You also mixed up code blocks and standard text, making it even harder to understand.

Nonetheless one idea you may want to pursue is that the picture may be covered by gray areas ... and is not really cropped.

0 replies

AntoineFachez · 2025-12-03T10:18:46Z

AntoineFachez
Dec 3, 2025
Author

Hello Jorj,

Thank you for your initial feedback. I apologize for the previous post's poor formatting and for posting on the wrong board—thank you for moving this conversation to the appropriate channel. I also understand that the extensive code made debugging difficult.

My overall goal is to accurately reconstruct a PDF page as an interactive HTML/CSS stylesheet using PyMuPDF to extract all visual elements (text, vectors, and images) with their correct rendered positions and crops. I had no issues finding the accurate positions, flipping images, getting the fonts and colors.

The core issue is that images exported to PDF by Apple Pages and Keynote (which often embed the full image but apply a clipping mask) are not being correctly clipped or masked when I extract their bounding box (BBox) for CSS rendering.

🖼️ The Problem Illustrated

The extracted images always bleed because the BBox I retrieve corresponds to the raw, uncropped image dimensions, not the visually clipped area; the cropping dimension values remain the same as the images dimension values.

Extraced pdf source:
250724_SERIE_NICHT-GANZ-SAUBER_KONZEPT (dragged).pdf

HTML rendered result:

🛠️ Trials and Current Approach

I have extensively modified my image extraction logic (process_images in https://github.com/AntoineFachez/script-pitcher/blob/main/run/pdf-processor/visuals.py ) across several days, coding agents and different maskings, croppings, clipping of pdf outputs and with the help of the documentation of pyMuPDF to address this specific clipping problem.

My latest attempt combines several strategies based on PyMuPDF's advanced extraction features:

get_image_info() with bbox: The default bbox (which is often the unclipped size).

get_drawings(): Used to detect explicit clipping paths (draw["type"] == "clip") and, crucially, to identify images used as fill patterns (draw["fill_images"]), which is a common Apple-exporter technique.

Soft Masks (smask): Explicitly checked and handled using doc.extract_image(smask).

Heuristics: Using the largest-overlapping clipping path or image fill rectangle as the effective_crop_rect.

The logic now attempts to set the final crop attribute to the geometrically derived effective_crop_rect.
💻 Relevant Code Snippets

Here is the current state of my process_images function that attempts to find the correct visual bounding box/crop:

From pdf-processor/visuals.py

def process_images(doc, page, image_bucket, document_id, z_counter):
image_elements = []
page_cropbox = page.cropbox
image_info_list = page.get_image_info(xrefs=True)

# Check for images used as fills in drawings (common in Mac Pages)
image_fill_rects = {} # Map xref -> list of rects
clipping_paths = []

try:
    drawings = page.get_drawings()
    for draw in drawings:
        # 1. Check for image fills
        if "fill_images" in draw and draw["fill_images"]:
            for xref in draw["fill_images"]:
                if xref not in image_fill_rects:
                    image_fill_rects[xref] = []
                image_fill_rects[xref].append(fitz.Rect(draw["rect"]))
        
        # 2. Check for clipping paths
        if draw["type"] == "clip":
            clipping_paths.append(fitz.Rect(draw["rect"]))
            
except Exception as e:
    print(f"Warning: Could not get drawings: {e}")
    
for img_index, info in enumerate(image_info_list):
    xref = info["xref"]
    if not xref: continue 

    smask = info.get("smask", 0)
    full_pos_rect = fitz.Rect(info["bbox"])
    effective_crop_rect = full_pos_rect # Initialize

    # ... (Image extraction and smask handling omitted for brevity) ...

    # Priority 1: Check if image is a fill for a drawing (Mac Pages style)
    found_fill_match = False
    if xref in image_fill_rects:
        # Logic to find the fill rect that best overlaps with the current image bbox
        best_fill = None
        best_overlap_area = 0
        for fill_rect in image_fill_rects[xref]:
            inter = full_pos_rect.intersect(fill_rect)
            area = inter.get_area()
            if area > best_overlap_area:
                best_overlap_area = area
                best_fill = fill_rect
        
        if best_fill and best_overlap_area > 0:
            effective_crop_rect = best_fill
            found_fill_match = True

    if not found_fill_match:
        # Priority 2: Check for intersecting clipping paths
        best_clip = None
        best_clip_area = 0
        for clip_rect in clipping_paths:
            if clip_rect.intersects(full_pos_rect):
                inter = full_pos_rect.intersect(clip_rect)
                area = inter.get_area()
                if area > best_clip_area:
                    best_clip_area = area
                    best_clip = clip_rect
        
        if best_clip:
            effective_crop_rect = full_pos_rect.intersect(best_clip)
        else:
            # Fallback: intersect with page cropbox
            effective_crop_rect = full_pos_rect.intersect(page_cropbox)
    
    # Final adjustment
    effective_crop_rect = effective_crop_rect.intersect(full_pos_rect) 
    
    # ... (Transformation/rotation and upload logic omitted for brevity) ...

    image_element = {
        "type": "image", "src": public_url,
        "position": {'x0': full_pos_rect.x0, 'y0': full_pos_rect.y0, 'x1': full_pos_rect.x1, 'y1': full_pos_rect.y1},
        "crop": {'x0': effective_crop_rect.x0, 'y0': effective_crop_rect.y0, 'x1': effective_crop_rect.x1, 'y1': effective_crop_rect.y1}, 
        # ... (other data)
    }
    # ...

❓ My Core Question

Despite this complex logic involving get_drawings (for clips and fills) and get_image_info, the effective_crop_rect I calculate still often matches the raw full_pos_rect when processing PDFs from Mac apps.

Is there a specific PyMuPDF function or MuPDF concept (perhaps related to the graphics state stack or a different type of masking operator used by Apple products) that I am missing to reliably determine the visually rendered bounding box of a clipped image?

I included the original PDF file for analysis.

Thank you for your time and continued support.

Anthony

0 replies

JorjMcKie · 2025-12-03T17:35:59Z

JorjMcKie
Dec 3, 2025
Maintainer

I'm afraid there is limited degree of freedom to deal with this. As I wrote: the image is not "cropped" nor "clipped".
It it covered by stuff (gray area) more in the foreground. These things occur "later" in the source page's appearance source (/Contents objects). As long as you do not replicate the sequence of things when writing to your target file, you will not be able to achieve your goal.

1 reply

AntoineFachez Dec 8, 2025
Author

I wasnt able yet accessing these elements and replicate the sequence of writing to the target file but thanks for your immediate replies, Jorj.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can not find the correct attributes or methods to get the correct values of croping #4820

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can not find the correct attributes or methods to get the correct values of croping #4820

Uh oh!

AntoineFachez Nov 28, 2025

Replies: 4 comments · 1 reply

Uh oh!

JorjMcKie Nov 29, 2025 Maintainer

Uh oh!

JorjMcKie Nov 29, 2025 Maintainer

Uh oh!

AntoineFachez Dec 3, 2025 Author

From pdf-processor/visuals.py

Uh oh!

JorjMcKie Dec 3, 2025 Maintainer

Uh oh!

AntoineFachez Dec 8, 2025 Author

AntoineFachez
Nov 28, 2025

Replies: 4 comments 1 reply

JorjMcKie
Nov 29, 2025
Maintainer

JorjMcKie
Nov 29, 2025
Maintainer

AntoineFachez
Dec 3, 2025
Author

JorjMcKie
Dec 3, 2025
Maintainer

AntoineFachez Dec 8, 2025
Author