Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support implied end tags. #51

Closed
wants to merge 2 commits into from
Closed

Conversation

aartaka
Copy link

@aartaka aartaka commented Jul 14, 2023

This is a somewhat frivolous interpretation of HTML Standard sections mentioning implied end tags (the respective section numbers are listed in coments). The implementation is hacking into the parser (read-tag) and throwing catch tags for closed tags up the stack.

Performance (not the least important thing, as I guessed) is preserved in 1.5-2 range from Plump without this PR. Tested on SBCL with

(defvar page (dexador:get "https://html.spec.whatwg.org/multipage/parsing.html"))
(time (loop repeat 1000 do (plump:parse page)))

Without implicit end tags:

Evaluation took:
  65.939 seconds of real time
  65.933210 seconds of total run time (65.383450 user, 0.549760 system)
  [ Run times consist of 1.025 seconds GC time, and 64.909 seconds non-GC time. ]
  99.99% CPU
  204,143,004,686 processor cycles
  22,649,541,840 bytes consed

With implied end tags:

Evaluation took:
  108.214 seconds of real time
  108.142281 seconds of total run time (107.410723 user, 0.731558 system)
  [ Run times consist of 1.199 seconds GC time, and 106.944 seconds non-GC time. ]
  99.93% CPU
  334,864,850,988 processor cycles
  23,030,772,272 bytes consed

To me, the improved correctness is well worth the slowdown. I've only tested correctness on simple examples, mostly generated by Spinneret for my own website. So it might actually break on some more complex cases. Hopefully it doesn't.

Fixes #50.

@Shinmera
Copy link
Owner

Thanks for your work.

Unfortunately I cannot merge this as-is, as it interferes in xml parsing modes that do not have the same HTML self-closing behaviour.

@aartaka
Copy link
Author

aartaka commented Jul 15, 2023

Okay, what if the doctype is dynamically bound during parsing? Maybe make a toggle to force HTML/XML parsing?

What's the default parsing mode for Plump, actually? Is it XML?

@Shinmera
Copy link
Owner

There already is a toggle, it's binding the *tag-dispatchers*.

The default parsing mode is a mix between the two, but that really doesn't matter for this.

@aartaka
Copy link
Author

aartaka commented Jul 15, 2023

If I'm reading it right, then checking for XML tag presence in *tag-dispatchers* should be enough to say whether it's XML-ish parsing or no-XML HTML one?

@aartaka
Copy link
Author

aartaka commented Jul 15, 2023

Something like:

(unless (any (lambda (dispatcher)
               (find (tag-dispatcher-name dispatcher) *tag-dispatchers* :key #'tag-dispatcher-name))
             *xml-tags*)
  #|...|#)

@Shinmera
Copy link
Owner

No, that would be an extremely ugly and leaky hack that would break custom dispatcher tables. The fundamental approach of this PR is not mergable.

@aartaka
Copy link
Author

aartaka commented Jul 15, 2023

Okay, what about defining a fully custom default dispatcher (whatever that means, I don't yet understand all this infrastructure of dispatchers) for HTML tags, making it absolutely separate from XML?

@Shinmera
Copy link
Owner

I suppose you could install a custom default parser in the dispatch table for html.

@aartaka
Copy link
Author

aartaka commented Nov 14, 2023

Closing this. Thanks for feedback and pointers!

@aartaka aartaka closed this Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plump doesn't handle self closing tags
2 participants