BOM character is not eaten in Jsoup.parse(String) #1070

amal · 2018-05-30T19:48:27Z

Last versions of Jsoup parse some pages with broken DOM as a result.

Here is an example of such html code:
https://gist.github.com/amal/8d489507c25a329df22a4a0d806aff0b

As a result contents of the <head> is not valid, all head tags are in body instead of head:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head></head>
 <body>
     
  <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.0, minimal-ui"> 
  <!-- header tags --> 
  <meta name="guid" content="150305195222805">
  <title>ISIL fighters bulldoze ancient Assyrian palace in Iraq | Iraq News | Al Jazeera</title>
  <meta name="title" content="ISIL fighters bulldoze ancient Assyrian palace in Iraq">
  <meta name="channel" content="">
  <meta name="author" content="">

...

The text was updated successfully, but these errors were encountered:

krystiangorecki · 2018-05-31T13:00:40Z

I can't reproduce on 1.11.3, 1.11.2, 1.11.1, 1.10.2.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupIssue1070 {
	public static void main(final String[] args) throws IOException {
		Document doc = Jsoup.connect("https://gist.githubusercontent.com/amal/8d489507c25a329df22a4a0d806aff0b/raw/38934f389fc300574046080d44338a445e9c9b1e/ancient-assyrian-palace-iraq.html").get();
		System.out.println(doc.select("head>meta").size()); // should be more than 0
		System.out.println(doc.select("body>meta").size()); // should be 0
		System.out.println(doc.toString().substring(0, 100));
	}
}

and the result always is:

30
0
<!doctype html>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta name="viewport" content=

amal · 2018-05-31T22:40:26Z

@krystiangorecki thank you for the test.
Bug seems to be reproducible only with Jsoup.parse.
Try this snippet:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.Connection.Response;

public class JsoupIssue1070 {
	public static void main(final String[] args) throws IOException {
		Response response = Jsoup.connect("https://gist.githubusercontent.com/amal/8d489507c25a329df22a4a0d806aff0b/raw/83d7b76c30650f0dbf381569f990d49ba51b09f6/ancient-assyrian-palace-iraq.html").execute();
		Document doc = Jsoup.parse(response.body());
		System.out.println(doc.select("head>meta").size()); // should be more than 0
		System.out.println(doc.select("body>meta").size()); // should be 0
		System.out.println(doc.toString().substring(0, 200));
	}
}

my result:

0
30
<html xmlns="http://www.w3.org/1999/xhtml">
 <head></head>
 <body>
     
  <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.0, minimal-ui"> 
  <

krystiangorecki · 2018-05-31T22:54:25Z

You are right. It may be a bug, but it's the same with 1.9.2, 1.8.1 and 1.6.0.
Seems like the proper way to parse response is:
Document doc = response.parse();

krystiangorecki · 2018-05-31T23:14:32Z

@jhy Is there a reason why Jsoup.parse(response.body()) is not consistent with response.parse()?

zuozhiw · 2018-06-12T21:10:44Z

@amal @krystiangorecki After digging around for a while, I found the cause of the problem, which is the starting BOM character not handled properly.

The inconsistency between (1)Jsoup.parse(string) and (2)response.parse() is that internally, (2) uses a function parseInputStream, which uses a smaller buffer to read the input stream repeatedly. This way Jsoup can avoid reading all the data into a big string and save memory. The cause of the problem, the BOM character, is handled properly in parseInputStream function, however, method (1) doesn't use the function and lack the code to handle the BOM character.

@jhy The BOM character problem is already in several issues: #348 and #1003 , and it's only fixed in parseInputStream function. I created a PR #1073 to fix the same issue in the case of method (1) Jsoup.parse(string)

panthony · 2018-12-13T11:06:50Z

For anyone stumbling upon this proposed fix, watch out that it relies on Charset.defaultCharset() to convert the string back into a byte array which may lead to a bad handling of the BOM if the default charset is "invalid" (ex: US-ASCII)

zuozhiw mentioned this issue Jun 12, 2018

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

Open

jhy changed the title ~~Broken DOM with last Jsoup versions~~ BOM character is not eaten in Jsoup.parse(String) Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOM character is not eaten in Jsoup.parse(String) #1070

BOM character is not eaten in Jsoup.parse(String) #1070

amal commented May 30, 2018 •

edited

Loading

krystiangorecki commented May 31, 2018 •

edited

Loading

amal commented May 31, 2018

krystiangorecki commented May 31, 2018 •

edited

Loading

krystiangorecki commented May 31, 2018

zuozhiw commented Jun 12, 2018

panthony commented Dec 13, 2018

BOM character is not eaten in Jsoup.parse(String) #1070

BOM character is not eaten in Jsoup.parse(String) #1070

Comments

amal commented May 30, 2018 • edited Loading

krystiangorecki commented May 31, 2018 • edited Loading

amal commented May 31, 2018

krystiangorecki commented May 31, 2018 • edited Loading

krystiangorecki commented May 31, 2018

zuozhiw commented Jun 12, 2018

panthony commented Dec 13, 2018

amal commented May 30, 2018 •

edited

Loading

krystiangorecki commented May 31, 2018 •

edited

Loading

krystiangorecki commented May 31, 2018 •

edited

Loading