Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to clean XML Document / preserve case in cleaned HTML #1930

Open
Juli4nSc opened this issue Mar 29, 2023 · 1 comment
Open

Add ability to clean XML Document / preserve case in cleaned HTML #1930

Juli4nSc opened this issue Mar 29, 2023 · 1 comment
Labels

Comments

@Juli4nSc
Copy link

Juli4nSc commented Mar 29, 2023

Hello,

I'm trying to sanitize some SVG content and am using Jsoup for that specific case.
It is possible to get an XML Document by using the xmlParser as below:

Document document = Jsoup.parse(svg, Parser.xmlParser());

However, there is no possible way next to clean this XML with a whitelist (Safelist).
It handles the content as if it is HTML.
Is there a way of doing this ? This would be expected with the XML parsing being enabled.

What I need here is preserving case sensitivity on the Attributes and Tags which is only possible when using XML parsing

@jhy
Copy link
Owner

jhy commented Mar 29, 2023

Right, the Cleaner right now is designed to take HTML body content and clean that. I had been thinking of adding extra support to clean a complete Document (vs a body fragment). That path would also then support XML Documents.

Another (and for your case, probably better) feature would be to enable case-insensitive attribute checks and output case-preserving HTML. You can almost do that now -- the cleaner checks tag normal names, but does not do that for attributes. So currently through the Cleaner, tag case can be preserved, but not attribute case.

What I need here is preserving case sensitivity on the Attributes and Tags which is only possible when using XML parsing

For just parsing (not the cleaner, as noted above), you can preserve tag and attribute case and still use the HTML parser. E.g.:

Document doc = Jsoup.parse(
    "<SVG viewBox=123 />",
    Parser.htmlParser()
        .settings(ParseSettings.preserveCase)
);
System.out.println(doc.html());

Gives

<SVG viewBox="123" />

Another nice to have may be to automatically preserve case in SVG elements when in HTML.

@jhy jhy changed the title Unable to Clean XML Document Add ability to clean XML Document / preserve case in cleaned HTML Mar 29, 2023
@jhy jhy added the feature label Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
@jhy @Juli4nSc and others