Does sanitize-html convert "messy" HTML (which you find across the web) into standard HTML? #641

lancejpollard · 2024-01-17T11:02:23Z

lancejpollard
Jan 17, 2024

Years ago I read that the chrome browser could parse any HTML, even when you had self-closing tags on non-self-closable elements like <div /> and other "messy" stuff. I would assume it can even recover from things like <div <p>hello world</p</div>, but not totally sure.

Does sanitize-html handle this sort of stuff? Basically I'm wondering if someone inputs some arbitrary webpage HTML, will it return more "standardized" HTML as output, like what Chrome would see? In this way, I don't have to go down the road of using html-parser2 or parse5 to parse messy HTML into an AST, then convert the AST back to standards compliant HTML.

This way, when converting HTML to PDF with pandoc (it successfully creates a malformed PDF with malformed HTML), I could first "sanitize" the HTML, to get standards-compliant HTML, and then render that to PDF. That would fix that problem, amongst other things.

If not, what does sanitize-html do differently?

Answered by boutell

Jan 17, 2024

sanitize-html uses htmlparser2 to parse the document, so you're going to get htmlparser2's interpretation of the document, for good or ill (generally for good in my experience, I'm not casting shade).

I think that answers the question, but if you have further questions about it you can check into the htmlparser2 documentation.

View full answer

boutell · 2024-01-17T13:27:44Z

boutell
Jan 17, 2024
Maintainer

sanitize-html uses htmlparser2 to parse the document, so you're going to get htmlparser2's interpretation of the document, for good or ill (generally for good in my experience, I'm not casting shade).

I think that answers the question, but if you have further questions about it you can check into the htmlparser2 documentation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does sanitize-html convert "messy" HTML (which you find across the web) into standard HTML? #641

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Does sanitize-html convert "messy" HTML (which you find across the web) into standard HTML? #641

lancejpollard Jan 17, 2024

Replies: 1 comment

boutell Jan 17, 2024 Maintainer

lancejpollard
Jan 17, 2024

boutell
Jan 17, 2024
Maintainer