Skip to content

Conversation

@emerose
Copy link
Contributor

@emerose emerose commented Jan 24, 2026

When processing PowerPoint files containing picture shapes that reference external images (rather than embedded images), the python-pptx library raises a ValueError("no embedded image") when accessing the image property.

Previously, this caused the entire document conversion to fail because:

  1. The hasattr(shape, "image") check at line 690 would trigger the property getter, which raises ValueError (hasattr only catches AttributeError, not ValueError)

  2. The exception handler in _handle_pictures() only caught UnidentifiedImageError and OSError, not ValueError

This fix:

  • Removes the unnecessary hasattr check since we already verify the shape type is MSO_SHAPE_TYPE.PICTURE
  • Adds ValueError to the exception handler in _handle_pictures() so that picture shapes with external references are gracefully skipped with a warning instead of crashing the pipeline

@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2026

DCO Check Passed

Thanks @emerose, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Jan 24, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 24, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ceberam ceberam added bug Something isn't working pptx issue related to pptx backend labels Jan 29, 2026
@ceberam
Copy link
Member

ceberam commented Jan 29, 2026

Thanks @emerose for catching this issue and suggesting this fix.
It looks like a clean and valid approach. Just to validate it, could you please share an example of document with that external image reference? Eventually, you could add it to the .pptx test file powerpoint_with_image.pptx.

@emerose
Copy link
Contributor Author

emerose commented Jan 30, 2026

@ceberam unfortunately, the source file that caused the problem was an internal / private company document that I can't share directly. they were created by a partner, but I don't think that there was anything particularly unusual about them. let me know if there are other ways I can help

When processing PowerPoint files containing picture shapes that reference
external images (rather than embedded images), the python-pptx library
raises a ValueError("no embedded image") when accessing the `image`
property.

Previously, this caused the entire document conversion to fail because:

1. The `hasattr(shape, "image")` check at line 690 would trigger the
   property getter, which raises ValueError (hasattr only catches
   AttributeError, not ValueError)

2. The exception handler in `_handle_pictures()` only caught
   UnidentifiedImageError and OSError, not ValueError

This fix:
- Removes the unnecessary hasattr check since we already verify the
  shape type is MSO_SHAPE_TYPE.PICTURE
- Adds ValueError to the exception handler in `_handle_pictures()` so
  that picture shapes with external references are gracefully skipped
  with a warning instead of crashing the pipeline

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@ceberam ceberam force-pushed the fix/pptx-external-image-valueerror branch from 7045926 to 2c78604 Compare January 30, 2026 17:59
@ceberam
Copy link
Member

ceberam commented Jan 30, 2026

@emerose I've just added a commit with a modification of the test file powerpoint_with_image.pptx that includes a image referenced to an external file. I checked that our latest release fails indeed, and your fix works as expected.

I'll approve this PR. Maybe some thoughts for further improvement:

  • it would make sense to report the issue to python-pptx
  • we may want to fetch the image and embed it, if the user requests it. We have this pattern in the HTML backend parser.

@ceberam ceberam self-assigned this Jan 30, 2026
emerose and others added 2 commits January 31, 2026 19:57
I, Sam Quigley <[email protected]>, hereby add my Signed-off-by to this commit: e69779e

Signed-off-by: Sam Quigley <[email protected]>
@ceberam ceberam force-pushed the fix/pptx-external-image-valueerror branch from 2c78604 to 0c025d5 Compare January 31, 2026 18:59
@dolfim-ibm
Copy link
Member

@ceberam @emerose I understand this avoids a failure in the processing, it is actually not fetching the referenced image, right?

In case of loading referenced/remote images we should add the explicit option as for html, see https://github.com/docling-project/docling/blob/main/docling/datamodel/backend_options.py#L31.

@dolfim-ibm dolfim-ibm merged commit 5e452a2 into docling-project:main Feb 1, 2026
44 of 45 checks passed
@ceberam
Copy link
Member

ceberam commented Feb 1, 2026

@ceberam @emerose I understand this avoids a failure in the processing, it is actually not fetching the referenced image, right?

Correct, that was the intent of this PR

In case of loading referenced/remote images we should add the explicit option as for html, see https://github.com/docling-project/docling/blob/main/docling/datamodel/backend_options.py#L31.

That’s what I meant with my earlier comment on this conversation

#2914 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working pptx issue related to pptx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants