-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to get line/column number of a tag? #532
Comments
Hi, @yasoob! This is an interesting idea. I think it is not possible with the current data structure, as you said, but could work if we have a more "complete" tree representation like we do internally today - see However I'm not sure the amount of work required if we decide to expose that from the Just an additional context: the Mochiweb parser is not the most aligned with the specs 😅 |
So as a fun experiment I spent some time yesterday looking into it. I wanted to get the line numbers of the:
I ended up updating the tokenize(B, S = #decoder{offset = O}) ->
case B of
%% ... Truncated ...
<<_:O/binary, "</", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?ADV_COL(S, 2)),
{S2, _} = find_gt(B, S1),
{{end_tag, Tag, {line_no, S#decoder.line}}, S2};
<<_:O/binary, "<", C, _/binary>> when
?IS_WHITESPACE(C); not ?IS_LETTER(C)
->
%% This isn't really strict HTML
{{data, Data, _Whitespace}, S1} = tokenize_data(B, ?INC_COL(S)),
{{data, <<$<, Data/binary>>, false}, S1};
<<_:O/binary, "<", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?INC_COL(S)),
{Attrs, S2} = tokenize_attributes(B, S1),
{S3, HasSlash} = find_gt(B, S2),
Singleton = HasSlash orelse is_singleton(Tag),
{{start_tag, Tag, Attrs, Singleton, {line_no, S#decoder.line}}, S3};
_ ->
tokenize_data(B, S)
end. I did something similar for the attributes and added line numbers there as well. So if I directly use this new tokenize function like this:
It produces such output: {:start_tag, "style", [{"type", "text/css", {:line_no, 13}}], false,
{:line_no, 13}},
{:end_tag, "style", {:line_no, 68}}, I checked the line numbers in the output and they were correct. But as you can imagine, this output can't really be used for any further processing as all other functions expect a different data structure. This is a long winded way of saying that it is not only feasible but works correctly as well in the scenarios that I tested. As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example? This would help me see if it breaks the kind of work I am trying to do. I don't really care for the final HTML output to be "correct". As in, I don't want Mochiweb to add a missing tag in the final output to make it compliant. But I do want it to accurately tokenize what is present in the input. I actually want the broken output where the tags that are missing in the source are also missing in the tokenized output. This would have been much easier to implement if we had a low level tokenizer in Elixir but mochiweb is what we have. I had previously tried to add this support in the html5ever NIF as it also calls an internal method to update the line number during parsing/tokenizing according to this issue. I managed to get as far as getting a line number printed in the terminal but it wasn't super reliable and my rust is very "rusty". I doubt I can get anywhere with that solution without learning more Rust. Maybe you or someone else who has more Rust experience can look into it. If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself. I also wasn't aware of the |
This is awesome! :D
I can say that most of our bugs are related to lack of support from our current parser. There is one example that can affect your output: multiple whitespace chars are collapsed to just one. So if you have multiple new lines, I think it is going to count incorrectly (I didn't try with your patch).
I will take a look when I can!
Thinking now, I guess we would need to change the parsing to build the I cannot promise to add the feature soon, but I will look forward to work on this. Also, if you feel comfortable, don't hesitate to sending PRs. They are more than welcome! |
So I spent some time on this and was able to get the line number from Html5ever as well with the following changes:
pub struct Node {
id: NodeHandle,
line_no: u64,
children: PoolOrVec<NodeHandle>,
parent: Option<NodeHandle>,
data: NodeData,
}
impl Node {
fn new(id: usize, line_no: u64, data: NodeData, pool: &Vec<NodeHandle>) -> Self {
Node {
id: NodeHandle(id),
parent: None,
children: PoolOrVec::new(pool),
line_no: line_no,
data,
}
}
}
pub struct FlatSink {
pub root: NodeHandle,
pub nodes: Vec<Node>,
pub pool: Vec<NodeHandle>,
pub current_line: u64,
}
impl FlatSink {
pub fn new() -> FlatSink {
let mut sink = FlatSink {
root: NodeHandle(0),
nodes: Vec::with_capacity(200),
pool: Vec::with_capacity(2000),
current_line: 1,
};
// Element 0 is always root
sink.nodes
.push(Node::new(0, 1, NodeData::Document, &sink.pool));
sink
}
// ... trunc ...
}
impl TreeSink for FlatSink {
// ... trunc ...
fn set_current_line(&mut self, line_number: u64) {
self.current_line = line_number;
}
}
// Do this for all Node types:
NodeData::Document => map
.map_put(atoms::type_().encode(env), atoms::document().encode(env))
.map_err(to_custom_error)?
.map_put(atoms::line_no().encode(env), node.line_no.encode(env))
.map_err(to_custom_error), Now if I call %{
0 => %{id: 0, line_no: 1, parent: nil, type: :document},
1 => %{
attrs: [],
children: [2, 27, 28],
id: 1,
line_no: 1,
name: "html",
parent: 0,
type: :element
},
2 => %{
attrs: [],
children: [3, 4, 5, 7, 8, 9, 26],
id: 2,
line_no: 2,
name: "head",
parent: 1,
type: :element
},
// ...
} I did not create a PR for Html5ever repo because this change will break quite a lot of other things and I don't have enough knowledge/experience to work on fixing it all. But I wanted to give you a head-start if/when you decide to implement this. Html5ever does not expose column details. It only exposes line numbers. I hope this helps! This was fun as I had to learn some Rust and was able to create a separate NIF for a CSS inliner as well. All in all, a good thing to have worked on :D |
Hi!
I am trying to find an HTML tokenizer for Elixir that can also provide me with line number of the matching tag. I see that
floki_mochi_html
has#decoder{offset}
record and there are also references toINC_COL
in the code base. I could have tried to extract this information on my own but I am not well-versed with Erlang. Do you think it is possible to expose this information from Floki?This is probably going to require changes to the data structure. Maybe
flat_parse
could contain this additional information?Please let me know if this is doable and even if this is not a good fit for Floki, I would love to hear your suggestions of how I could go about implementing this on my own.
Just for some added context, a sample usecase for this could be a tool using Floki that extracts all
a
tags and then lists their line numbers/location in the html document. My own usecase is a bit different but this one is a simpler representative example.The text was updated successfully, but these errors were encountered: