-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimpleHtmlSanitizer should support self-closing variants of whitelisted tags #9994
Comments
Thanks for the report - I do note the following from your link:
That paragraph goes on to discuss ways that different implementations might interpret such malformed html differently. Then this note:
As such, it seems very reasonable that the SimpleHtmlSanitizer does not support XML or XHTML, as these are not the same thing. Am I missing something here that would make it correct for valid HTML to contain self-closing tags? We wouldn't want this to be ambiguous, but it seems it could be safe for void elements? I'll leave this point open to more discussion. For the rest of this, let's assume that we've either elected to add a SimpleXhtmlSanitizer, or have found a valid way in which HTML can contain self-closing tags: I think solving the "allow a whitespace char" could be solved distinctly from "allow self-closing tags", but in the same block of code: If I'm reading this right, this would capture just the tag name itself so that either The most strict interpretation I can find here then is that if we had a specific list of void elements and only permit the above transformation for those (as you noted, this would be only |
You can of course make an argument that self-closing tags are not valid in the first place. And since the purpose of the SimpleHtmlSanitizer seems to be to do a best effort to sanitize input that may not be entirely under your control (while still allowing it to make some use of basic formatting tags), I'd argue that it wouldn't hurt to also go the extra mile and support these cases that may not be fully conforming to the spec but may be encountered in the wild nonetheless. A distinction between a separate sanitizer for HTML and XHTML just for these cases may not be suitable, because you may not know in advance whether the input you receive is aiming to be HTML, XHTML or even just some weird mix that would still work in modern browsers. And if the SimpleXhtmlSanitizer would basically do the same thing as the regular one except for also accounting for self-closing tags, it would rather boil down to the question of "how lenient do you want to be with input that does not follow the HTML5 spec to a tee?". Side note: The lack of support for self-closing tags seems to have also been brought up in a comment way back when support for |
I'm not disagreeing that this would be a good addition - that's why I outlined how the change could be made. But I think there is a solid case to be made that this is not what the tool intended, since it forbids lots of other valid HTML5 content too, and supports/normalized no invalid HTML5. "Sanitizing output outside of your control" usually means a best effort has to be made - is there any example of a case that isn't fully conforming that is supported today? My naming might not have been ideal - something that implies that like tools like jsoup it is rewriting the content to be valid and then sanitizing to a subset could make sense. I don't have a concrete suggestion here, but I think as I showed the implementation shouldn't be terribly difficult. What about either a subclass that adds this feature (and documents it accordingly), or a default property or ctor arg so that existing implementations continue to function as they did, but new uses can opt-in to slightly more flexible behavior? Again, I'm not objecting to this, but I want there to be some discussion around suitability and design before loosening what is designed to be a safety feature. |
GWT version: 2.11
Browser (with version): Firefox 129.0.2
Operating System: Windows 10
Description
The
SimpleHtmlSanitizer
supports the<br>
and<hr>
tags in a whitelist to leave them unchanged. But this does not work for self-closing variants of these void elements.Depending on the browser implementation, it will likely simply discard the trailing slash on start elements. But the sanitizer should still support those variants, since they are used if you aim to create valid XHTML instead of HTML5, and the input you are sanitizing may not be fully under your control in that regard.
Of the list of tags that are already being whitelisted by the
SimpleHtmlSanitizer
,<br>
and<hr>
are the only ones that are also listed as official void elements in the documentation above. So these are probably all that we need to support.A trivial solution would be to add
"br/"
,"br /"
,"hr/"
and"hr /"
to the existing whitelist. Alternatively, you could try to change the code to handle these trailing slashes with and without whitespace properly.Steps to reproduce
Known workarounds
The text was updated successfully, but these errors were encountered: