-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Floki removes blank text nodes without option to avoid this #75
Comments
Hi @Eiji7. This issue is a known issue inside Mochiweb HTML parser, which is the parser behind Floki. The suggested workaround would replace white-spaces between tags with a I tried some expressions to replace and I came up with this: "<span>5</span> <span>=</span> <span>5</span>"
|> String.replace(~r/>[ \n\r]+</, "> <")
|> Floki.parse
# => [{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}] Not sure if this is safe to do inside Maybe the option you are suggesting could come as |
Edit: |
@Eiji7 Surely it's going to add a lot of overhead if you don't need this though? Maybe it should be Floki.preserve_inter_tag_spaces and documented to say that it will copy the entire string to do this. |
@aphillipo: yup, this way or #37.
then I could think about create an XML/HTML parser and validator in Elixir. I can start work on it now, but I don't worked on similar project, so I can't provide others that it will be really fast. As listed in 3rd point at #37 issue we need think about a good algorithm to avoid decrease of efficiency |
Building HTML parsers is very difficult - I want to use Servo's for this eventually... |
@philss and @aphillipo: Did you know Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:4:4] [async-threads:10]
Interactive Elixir (1.3.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :xmerl_scan.string('<p><span>5</span> <span>=</span> <span>5</span></p>')
{{:xmlElement, :p, :p, [], {:xmlNamespace, [], []}, [], 1, [],
[{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 1, [],
[{:xmlText, [span: 1, p: 1], 1, [], '5', :text}], [], '/home/eiji',
:undeclared}, {:xmlText, [p: 1], 2, [], ' ', :text},
{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 3, [],
[{:xmlText, [span: 3, p: 1], 1, [], '=', :text}], [], :undefined,
:undeclared}, {:xmlText, [p: 1], 4, [], ' ', :text},
{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 5, [],
[{:xmlText, [span: 5, p: 1], 1, [], '5', :text}], [], :undefined,
:undeclared}], [], '/home/eiji', :undeclared}, []} What do you think about it? |
@Eiji7 Here are some references: One of the possibilities to fix this is to fix mochiweb's HTML parser. As @aphillipo said, HTML parsers are very difficult to build. I created that issue (#37) without understand that. Now I'm more inclined to the idea of reuse an existing HTML parser. Servo's html5ever is my favorite choice for now. I did not started working on this, but help trying out this "bridge" between Elixir and Rust would be very welcome! Here is a project that is very promising: https://github.com/hansihe/Rustler. |
@philss: Ok, I see your point. I see that your way ("bridge") is faster to implement. I don't know how exactly will work. Will it be as fast (with that "bridge") as native Elixir parser? It may be interesting for me in further future. |
@Eiji7 sorry for the delay. I'm not sure how it will work. It does not seems to have a big performance gap when using, for example, an Erlang NIF written in C or Rust. The problems are more related to the security of code. If the C/Rust code crash, the entire Erlang VM can crash. This is one of the reasons I want to give Rust a try with Cool! The idea to write a HTML parser entirely in Elixir is not dead yet! It is only very hard to do the "right way". I will let this open as a "bug" and try to figure out how to "quick fix" this. Thanks! |
I tried to use So here's a (quite expensive) solution for this problem:
Then, you use this function before your
|
Hello, I am having the same problem. Now it seems that the mochiweb parser is distributed in floki, so the problem could be fixed directly instead of relying on third party (which does not exist anymore AFAICT). I have a workaround with html5ever that works for small payloads, I'm gonna test it with huge, dirty HTML chunks. But it could be nice to have an option to keep all the text nodes regardless of their contents, don't you think? |
Actual results:
Expected results:
Note: this is really important in some cases. For example, please try parsing html generated from github markup (code samples).
The text was updated successfully, but these errors were encountered: