[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

zuozhiw · 2018-06-12T21:01:33Z

Issue #1070 shows a case where Jsoup not parsing an HTML page correctly and putting all the information in the body instead of head. The problem can be reproducible only if use Jsoup.parse(response.body()). Using another method Jsoup.connect(url).get() won't reproduce the same problem.

After digging around for a while, it turns out that the HTML page has the BOM character 65279 in its first character. The reason that the second method, Jsoup.connect(url).get(), internally uses DataUtil.parseInputStream(), which handles the BOM character correctly. However, Jsoup.parse(string) constructs a HTMLTreeBuilder to parse the document directly, therefore the BOM handling process is not there.

This PR fixes this issue by introducing the same handling process in TreeBuilder constructor. Before it passes the input reader to the tokenizer, it reuses the same helper function DataUtil.detectCharsetFromBom to detect the BOM character, and skip the BOM character if needed.

This PR also adds a new test case in DataUtilTest to test if Jsoup.parse(string) can work correctly with BOM. After adding the fix, the problem in issue #1070 can produce the correct result.

There are already several efforts to fix the problem caused by the BOM character:
Issue #348 is the first issue mentioning the problem, and it is fixed in commit 3f9f33d , this commit only fixes the issue in parseByteData (which later becomes parseInputStream).

Later there are several commits to refactor the BOM handling code, such as c3cbe1b , 4eb4f2b

The latest issue mentioning the problem is Issue #1003, and the corresponding fix is in 0f7e0cc , however, this commit still only fixes the function parseInputStream, which is not used by Jsoup.parse(string).

@jhy I'm not sure if TreeBuilder class is the best place to put in the fix code. I am willing to devote more efforts to improve it if you could give some more advice. Thanks :)

panthony · 2018-07-25T08:10:15Z

Any idea when this could be merged ?

panthony · 2018-12-13T10:04:24Z

Using this fix I eventually encountered a nasty issue that made me lost a fair amount of time.

Long story short, I had a piece of code that sometimes where running on a machine that had US-ASCII as default charset instead of UTF-8.

Because of this, the BOM was badly handled was it relies on Charset.defaultCharset().

This is often bad practice to rely on default Locale/Charset, maybe JSoup could take a look at https://github.com/policeman-tools/forbidden-apis (story behind this project: http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html)

jhy · 2018-12-27T03:25:06Z

Hi, thanks for this. I am worried that this only handles the UTF8 BOM, and probably only works if the string was actually decoded as a UTF8 string. It doesn't help (afaict) for UTF16 and 32 BOMs.

My thought is that we should implement so that we expect any input String to have been correctly decoded.

The methods in Connection and Jsoup.load (for file and input stream) correctly handle the different BOMs. The user is being impacted for getting the request body string and that's where the BOM is not handles (and the string is invalid effectively). How about we add the BOM routine to the body() method?

panthony · 2018-12-27T07:06:49Z

Not sure which method we are talking about but you cannot convert back the given string to bytes without taking which charset it was since you cannot rely on defaultCharset.

In my case, if JSoup.parse were to take an InputStream instead of a String I could just wrap it inside a DOMInputStream (Apache commons-io).

Maybe this could just be documented:

This is the connection layer that handles the BOM implicitly
The String input must be cleared from the BOM markup (only the user can do this since he's the only one to know how it is encoded)
The JSoup could provide an helper that takes a the necessary information (string, charset) to cleanup the string if necessary

fix jsoup parse string not skip BOM character correctly

6bae9ba

zuozhiw mentioned this pull request Jun 12, 2018

BOM character is not eaten in Jsoup.parse(String) #1070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

zuozhiw commented Jun 12, 2018

panthony commented Jul 25, 2018

panthony commented Dec 13, 2018

jhy commented Dec 27, 2018

panthony commented Dec 27, 2018 •

edited

Loading

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

Are you sure you want to change the base?

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

Conversation

zuozhiw commented Jun 12, 2018

panthony commented Jul 25, 2018

panthony commented Dec 13, 2018

jhy commented Dec 27, 2018

panthony commented Dec 27, 2018 • edited Loading

panthony commented Dec 27, 2018 •

edited

Loading