Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting core_properties.last_modified_by makes document invalid #1037

Open
reinold123 opened this issue Dec 24, 2021 · 16 comments
Open

Setting core_properties.last_modified_by makes document invalid #1037

reinold123 opened this issue Dec 24, 2021 · 16 comments

Comments

@reinold123
Copy link

Changing the last modified user with the core properties makes the document invalid to Word and the lastModified user is stil the old one.

If I look in the core.xml the cp:lastModifiedBy property is there twice.

@scanny
Copy link
Contributor

scanny commented Dec 24, 2021

Please post the code you used.

@reinold123
Copy link
Author

reinold123 commented Dec 28, 2021

docx_doc = Document(f"{GENERATING_FOLDER}/{file_name}")
core_properties = docx_doc.core_properties
core_properties.author = 'Myself'
core_properties.last_modified_by = 'Someone'

@reinold123
Copy link
Author

It seems to happen when the xmlns:cp="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" property is added, other elements are replaced, but these are double. @scanny

@scanny
Copy link
Contributor

scanny commented Dec 28, 2021

Paste in a snippet of the core.xml that shows it in there twice.

@scanny
Copy link
Contributor

scanny commented Dec 28, 2021

Also, what do you mean by:

It seems to happen when the xmlns:cp="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" property is added

That is a namespace declaration not a property and who is adding it and how can you tell?

@reinold123
Copy link
Author

reinold123 commented Dec 30, 2021

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<cp:coreProperties
    xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:dcmitype="http://purl.org/dc/dcmitype/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>
  <dc:title/>
  <dc:subject/>
  <dc:creator>Author</dc:creator>
  <cp:keywords/>
  <dc:description></dc:description>
  <cp:lastModifiedBy>first_author</cp:lastModifiedBy>
  <cp:revision>30</cp:revision>
  <dcterms:created xsi:type="dcterms:W3CDTF">2020-12-15T09:27:00Z</dcterms:created>
  <dcterms:modified xsi:type="dcterms:W3CDTF">2021-11-22T20:52:00Z</dcterms:modified>
  <cp:lastModifiedBy xmlns:cp="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties">
    Second author
  </cp:lastModifiedBy>
</cp:coreProperties>

@reinold123
Copy link
Author

so cp:lastModifiedBy
xmlns:cp="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"> and cp:lastModifiedBy>

@reinold123
Copy link
Author

Also, what do you mean by:

It seems to happen when the xmlns:cp="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" property is added

That is a namespace declaration not a property and who is adding it and how can you tell?

You're right it's not a attribute, my bad ;-)

@scanny
Copy link
Contributor

scanny commented Dec 30, 2021

Okay, so this looks like a namespace collision. What is the provenance of the document you are making this change to? What happens if you make this change to a newly-created Word document?

The cp: namespace prefix (namespace abbreviation) cannot refer both to the namespace "http://schemas.openxmlformats.org/package/2006/metadata/core-properties" and the namespace "http://schemas.openxmlformats.org/package/2006/metadata/custom-properties". It should, in fact, be used only for the core-properties namespace.

In addition to an account of the provenance, please paste in the original core.xml, before any changes are made. Also please format the pasted XML for readability, see the reformatted message 3 above this one (enter edit mode on that message and see how it is done).

Also, show the output of the following:

>>> document = Document(...)
>>> core_properties = document.core_properties
>>> core_properties.last_modified_by
...
>>> core_properties.last_modified_by = 'Someone'
>>> core_properties.last_modified_by
...

@Kisioj
Copy link

Kisioj commented Jan 30, 2024

I found the solution and it's not in python-docx. I encountered this problem when using docxtpl which uses python-docx.
The problem is that docxtpl uses docxcompose which changes nsmap and overrides cp namespace.

from docx.oxml.ns import nsmap
NS = {
    ...
    'cp': 'http://schemas.openxmlformats.org/officeDocument/2006/custom-properties',
    ...
}
nsmap.update(NS)

However I think that the bug can be fixed on python-docx's side. coreProperties' saves information about cp: namespace and even when nsmap is overridden it still saves docx with proper namespace:
<cp:coreProperties xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties...

Why is it that if we write
core_properties.last_modified_by = 'Someone'
it is assumed that last_modified_by is lastModifiedBy from global cp: namespace and not the same one as coreProperties? I guess in cases like that it should be assumed that even if cp: is globally overwritten we always use parent's namespace thus no need to add new tag if old one exists and no need to add xmlns:cp= to that new tag.

@bhavin-qryptal
Copy link

bhavin-qryptal commented Aug 18, 2024

We are encountering somewhat related problem while using python-docx (indirectly via docxtpl).

Simple fact of getting actual docx template properties by calling doc.core_properties, makes a duplicate for docProps/core.xml.

Ref: elapouya/python-docx-template#558
Input.docx
Output.docx

Problem occurs when Input.docx is generated by LibreOffice , If Input.docx is generated by MS Word this issue does not happen.

Run the attached code against any LibreOffice generated docx file to reproduce this issue.

`from docx import Document

Load any document generated using LibreOffice

doc = Document('Input.docx')
core_properties = doc.core_properties

Following line generates UserWarning and output file is corrupt. It could not be opened using MS Word 2013.

...\Lib\zipfile.py:1566: UserWarning: Duplicate name: 'docProps/core.xml'

return self._open_to_write(zinfo, force_zip64=force_zip64)

doc.save('Output.docx')
`

@cip91sk
Copy link

cip91sk commented Oct 1, 2024

@bhavin-qryptal I had the same problem and I made a PR with a fix: #1436

@scanny
Copy link
Contributor

scanny commented Oct 1, 2024

Looks like this is a documented non-conformance, aka. "Normative Variation":
https://learn.microsoft.com/en-us/openspecs/office_standards/ms-oi29500/28beaa8f-42ce-41e6-820d-4d7af34457a5

I'd be interested to see what Word does when you load a file like this and change a core-property. I'm inclined to think it would convert the namespace to the ../package/.. version and save it like that.

I don't see how we could have aliases for a namespace.

It's one thing to recognize that a part is already present and not add a new one, it would be something else to decide which namespace to use for the elements of that part at runtime.

@cip91sk
Copy link

cip91sk commented Oct 2, 2024

Attached are some test files:
a new docx created by libreoffice when setting "Word 2010-365" as output libreoffice_original.docx, then just opening and saving with word word_save.docx, and then when doing "save as" from word word_save_as.docx. I also opened again the base file, changed its author and saved again word_changed_author.docx.
As you said, when changing some core props or just doing "save as" the namespace is changed to the ../package/.. version

@scanny
Copy link
Contributor

scanny commented Oct 2, 2024

@cip91sk have you observed that new DOCX files produced by LibreOffice when choosing the "Word 2010-365" format have the cp:.../officedocument/... namespace mapping?

I'm inclined to think the desirable behavior for python-docx would be to inspect the core-properties part and change the cp: namespace mapping to the ../package/.. namespace when it finds the ../officedocument/.. namespace in use.

This could happen when accessing the core-properties part, in the same spot where python-docx currently creates a new one. So the logic would become something like:

On access to document.core_properties:

  • check for a core-properties part having expected mapping; if it exists return it
  • check for a core-properties part having ../officedocument/.. namespace mapping; if it exists remap the namespace to ../package/.. and return that
  • otherwise create a new core-properties part.

@cip91sk
Copy link

cip91sk commented Oct 7, 2024

@scanny I added a commit to the PR that should do what you asked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants