-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOM character is not eaten in Jsoup.parse(String) #1070
Comments
I can't reproduce on 1.11.3, 1.11.2, 1.11.1, 1.10.2.
and the result always is:
|
@krystiangorecki thank you for the test. import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection.Response;
public class JsoupIssue1070 {
public static void main(final String[] args) throws IOException {
Response response = Jsoup.connect("https://gist.githubusercontent.com/amal/8d489507c25a329df22a4a0d806aff0b/raw/83d7b76c30650f0dbf381569f990d49ba51b09f6/ancient-assyrian-palace-iraq.html").execute();
Document doc = Jsoup.parse(response.body());
System.out.println(doc.select("head>meta").size()); // should be more than 0
System.out.println(doc.select("body>meta").size()); // should be 0
System.out.println(doc.toString().substring(0, 200));
}
} my result:
|
You are right. It may be a bug, but it's the same with 1.9.2, 1.8.1 and 1.6.0. |
@jhy Is there a reason why |
@amal @krystiangorecki After digging around for a while, I found the cause of the problem, which is the starting BOM character not handled properly. The inconsistency between (1) @jhy The BOM character problem is already in several issues: #348 and #1003 , and it's only fixed in |
For anyone stumbling upon this proposed fix, watch out that it relies on |
Last versions of Jsoup parse some pages with broken DOM as a result.
Here is an example of such html code:
https://gist.github.com/amal/8d489507c25a329df22a4a0d806aff0b
As a result contents of the
<head>
is not valid, all head tags are in body instead of head:The text was updated successfully, but these errors were encountered: