Support implied end tags. #51

aartaka · 2023-07-14T13:27:49Z

This is a somewhat frivolous interpretation of HTML Standard sections mentioning implied end tags (the respective section numbers are listed in coments). The implementation is hacking into the parser (read-tag) and throwing catch tags for closed tags up the stack.

Performance (not the least important thing, as I guessed) is preserved in 1.5-2 range from Plump without this PR. Tested on SBCL with

(defvar page (dexador:get "https://html.spec.whatwg.org/multipage/parsing.html"))
(time (loop repeat 1000 do (plump:parse page)))

Without implicit end tags:

Evaluation took:
  65.939 seconds of real time
  65.933210 seconds of total run time (65.383450 user, 0.549760 system)
  [ Run times consist of 1.025 seconds GC time, and 64.909 seconds non-GC time. ]
  99.99% CPU
  204,143,004,686 processor cycles
  22,649,541,840 bytes consed

With implied end tags:

Evaluation took:
  108.214 seconds of real time
  108.142281 seconds of total run time (107.410723 user, 0.731558 system)
  [ Run times consist of 1.199 seconds GC time, and 106.944 seconds non-GC time. ]
  99.93% CPU
  334,864,850,988 processor cycles
  23,030,772,272 bytes consed

To me, the improved correctness is well worth the slowdown. I've only tested correctness on simple examples, mostly generated by Spinneret for my own website. So it might actually break on some more complex cases. Hopefully it doesn't.

Fixes #50.

Shinmera · 2023-07-15T12:57:54Z

Thanks for your work.

Unfortunately I cannot merge this as-is, as it interferes in xml parsing modes that do not have the same HTML self-closing behaviour.

aartaka · 2023-07-15T13:54:18Z

Okay, what if the doctype is dynamically bound during parsing? Maybe make a toggle to force HTML/XML parsing?

What's the default parsing mode for Plump, actually? Is it XML?

Shinmera · 2023-07-15T13:59:53Z

There already is a toggle, it's binding the *tag-dispatchers*.

The default parsing mode is a mix between the two, but that really doesn't matter for this.

aartaka · 2023-07-15T14:12:41Z

If I'm reading it right, then checking for XML tag presence in *tag-dispatchers* should be enough to say whether it's XML-ish parsing or no-XML HTML one?

aartaka · 2023-07-15T14:17:19Z

Something like:

(unless (any (lambda (dispatcher)
               (find (tag-dispatcher-name dispatcher) *tag-dispatchers* :key #'tag-dispatcher-name))
             *xml-tags*)
  #|...|#)

Shinmera · 2023-07-15T14:25:28Z

No, that would be an extremely ugly and leaky hack that would break custom dispatcher tables. The fundamental approach of this PR is not mergable.

aartaka · 2023-07-15T17:59:59Z

Okay, what about defining a fully custom default dispatcher (whatever that means, I don't yet understand all this infrastructure of dispatchers) for HTML tags, making it absolutely separate from XML?

Shinmera · 2023-09-17T09:41:32Z

I suppose you could install a custom default parser in the dispatch table for html.

aartaka · 2023-11-14T17:56:48Z

Closing this. Thanks for feedback and pointers!

aartaka added 2 commits July 14, 2023 16:52

parser: Support implied end tags.

471f8b9

parser(read-tag): Add a TODO about implied end tags incompleteness.

895d84b

aartaka mentioned this pull request Jul 21, 2023

Use another library for HTML parsing atlas-engineer/nyxt#3092

Closed

aartaka closed this Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support implied end tags. #51

Support implied end tags. #51

aartaka commented Jul 14, 2023 •

edited

Loading

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Sep 17, 2023

aartaka commented Nov 14, 2023

Support implied end tags. #51

Support implied end tags. #51

Conversation

aartaka commented Jul 14, 2023 • edited Loading

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Jul 15, 2023

aartaka commented Jul 15, 2023

Shinmera commented Sep 17, 2023

aartaka commented Nov 14, 2023

aartaka commented Jul 14, 2023 •

edited

Loading