Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM character is not eaten in Jsoup.parse(String) #1070

Open
amal opened this issue May 30, 2018 · 6 comments
Open

BOM character is not eaten in Jsoup.parse(String) #1070

amal opened this issue May 30, 2018 · 6 comments

Comments

@amal
Copy link

amal commented May 30, 2018

Last versions of Jsoup parse some pages with broken DOM as a result.

Here is an example of such html code:
https://gist.github.com/amal/8d489507c25a329df22a4a0d806aff0b

As a result contents of the <head> is not valid, all head tags are in body instead of head:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head></head>
 <body>
     
  <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.0, minimal-ui"> 
  <!-- header tags --> 
  <meta name="guid" content="150305195222805">
  <title>ISIL fighters bulldoze ancient Assyrian palace in Iraq | Iraq News | Al Jazeera</title>
  <meta name="title" content="ISIL fighters bulldoze ancient Assyrian palace in Iraq">
  <meta name="channel" content="">
  <meta name="author" content="">

...
@krystiangorecki
Copy link
Contributor

krystiangorecki commented May 31, 2018

I can't reproduce on 1.11.3, 1.11.2, 1.11.1, 1.10.2.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupIssue1070 {
	public static void main(final String[] args) throws IOException {
		Document doc = Jsoup.connect("https://gist.githubusercontent.com/amal/8d489507c25a329df22a4a0d806aff0b/raw/38934f389fc300574046080d44338a445e9c9b1e/ancient-assyrian-palace-iraq.html").get();
		System.out.println(doc.select("head>meta").size()); // should be more than 0
		System.out.println(doc.select("body>meta").size()); // should be 0
		System.out.println(doc.toString().substring(0, 100));
	}
}

and the result always is:

30
0
<!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta name="viewport" content=

@amal
Copy link
Author

amal commented May 31, 2018

@krystiangorecki thank you for the test.
Bug seems to be reproducible only with Jsoup.parse.
Try this snippet:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection.Response;

public class JsoupIssue1070 {
	public static void main(final String[] args) throws IOException {
		Response response = Jsoup.connect("https://gist.githubusercontent.com/amal/8d489507c25a329df22a4a0d806aff0b/raw/83d7b76c30650f0dbf381569f990d49ba51b09f6/ancient-assyrian-palace-iraq.html").execute();
		Document doc = Jsoup.parse(response.body());
		System.out.println(doc.select("head>meta").size()); // should be more than 0
		System.out.println(doc.select("body>meta").size()); // should be 0
		System.out.println(doc.toString().substring(0, 200));
	}
}

my result:

0
30
<html xmlns="http://www.w3.org/1999/xhtml">
 <head></head>
 <body>
     
  <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.0, minimal-ui"> 
  <

@krystiangorecki
Copy link
Contributor

krystiangorecki commented May 31, 2018

You are right. It may be a bug, but it's the same with 1.9.2, 1.8.1 and 1.6.0.
Seems like the proper way to parse response is:
Document doc = response.parse();

@krystiangorecki
Copy link
Contributor

@jhy Is there a reason why Jsoup.parse(response.body()) is not consistent with response.parse()?

@zuozhiw
Copy link

zuozhiw commented Jun 12, 2018

@amal @krystiangorecki After digging around for a while, I found the cause of the problem, which is the starting BOM character not handled properly.

The inconsistency between (1)Jsoup.parse(string) and (2)response.parse() is that internally, (2) uses a function parseInputStream, which uses a smaller buffer to read the input stream repeatedly. This way Jsoup can avoid reading all the data into a big string and save memory. The cause of the problem, the BOM character, is handled properly in parseInputStream function, however, method (1) doesn't use the function and lack the code to handle the BOM character.

@jhy The BOM character problem is already in several issues: #348 and #1003 , and it's only fixed in parseInputStream function. I created a PR #1073 to fix the same issue in the case of method (1) Jsoup.parse(string)

@panthony
Copy link

For anyone stumbling upon this proposed fix, watch out that it relies on Charset.defaultCharset() to convert the string back into a byte array which may lead to a bad handling of the BOM if the default charset is "invalid" (ex: US-ASCII)

@jhy jhy changed the title Broken DOM with last Jsoup versions BOM character is not eaten in Jsoup.parse(String) Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants