-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-English characters are not converted correctly #454
Comments
Hi @tverona1 👋 Thank you for submitting the issue, really interesting finding about latin1 encoding. First of all - could you please check which version are you using? Would be nice to understand if there is a difference in using I'm open to review a PR if this encoding changes do not break other functionality. Would be super-nice if we can have a test which demonstrates a problem and code change which fixes it. Do you think this changes should be applied by default? |
I'm using got, although request will behave similarly according to its documentation - since you're setting encoding to binary (aka latin1), both request and got will pass this as the encoding to Buffer's toString(). See this: https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings I think the desired default behavior for the scraper should be: Always get the body as a Buffer object (not text). When saving to file, save it as either binary or convert to utf8 according to its mime type. There's a pretty good library called mime-types with a function called mime.charset that will return the default charset of a content-type. The above may be a breaking change, so should probably only be introduced in v5. I don't have a ready made repro on the existing code base - I've actually forked it and changed it to use got a while back. I'll see if I can give it a try in the next couple of weeks. It should be easy to repro with a test case that has some non-latin content. |
Also, we should make sure there's a test case for scraping binary content (i.e. png) to make sure thesr changes don't regress that. |
Reopen because a version with a fix 5.1.0 was deprecated and reverted |
@s0ph1e do you have any suggestions on the best way to resolve this? We may need to use this plugin fairly soon to scrape a site which will suffer this issue so I can assign some engineering time to resolving the problem. However we're unfamiliar with the project so some guidance may be useful. |
Hi @phawxby Thank you for your suggestion 👍 As a very quick solution you can use version 5.1.0, it's deprecated because it breaks functionality related to plugins, but if you don't need plugins you can give it a try I don't have a clear vision how to implement a fix in a good way without breaking compatibility and/or major refactoring. I'll try to have a closer look on your PR #496 this week |
* fix: non-english char encoding * fix: optional chaining not supported * fix: various tests not finished yet * fix: tests now pass * fix: testing and code complexity * fix: code climate * fix: change comments * fix: simplify code complexity * refactor: simplify logic and add more testing * fix: reduce cognitive complexity * fix: more cognitive complexity * chore: damn space * fix: broke a test * chore: update docs * test: re-enable utf8 test * chore: update request test
commit 65b8056 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue Dec 6 01:50:13 2022 +0200 Bump fs-extra from 10.1.0 to 11.1.0 (website-scraper#523) Bumps [fs-extra](https://github.com/jprichardson/node-fs-extra) from 10.1.0 to 11.1.0. - [Release notes](https://github.com/jprichardson/node-fs-extra/releases) - [Changelog](https://github.com/jprichardson/node-fs-extra/blob/master/CHANGELOG.md) - [Commits](jprichardson/node-fs-extra@10.1.0...11.1.0) --- updated-dependencies: - dependency-name: fs-extra dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 52a341a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Dec 5 20:36:07 2022 +0000 Bump sinon from 14.0.2 to 15.0.0 (website-scraper#524) commit faa8bab Author: Illia Antypenko <[email protected]> Date: Thu Nov 17 00:11:58 2022 +0200 Update stale.yml commit 5886558 Author: Illia Antypenko <[email protected]> Date: Wed Nov 16 23:24:22 2022 +0200 Update stale.yml commit 2797213 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 16 21:09:28 2022 +0000 Bump actions/setup-node from 2 to 3 (website-scraper#519) commit d9d147a Author: Illia Antypenko <[email protected]> Date: Wed Nov 16 23:06:02 2022 +0200 Delete no-response.yml commit 7b6cbb5 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 16 20:54:40 2022 +0000 Bump paambaati/codeclimate-action from 3.0.0 to 3.2.0 (website-scraper#520) commit 28d6b40 Author: Illia Antypenko <[email protected]> Date: Wed Nov 16 22:40:53 2022 +0200 Delete codeql-analysis.yml commit 145f31d Author: Illia Antypenko <[email protected]> Date: Wed Nov 16 22:40:44 2022 +0200 Create codeql.yml commit 4237388 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 16 17:15:31 2022 +0000 Bump normalize-url from 7.2.0 to 8.0.0 (website-scraper#517) commit 9dc9ef6 Author: Illia Antypenko <[email protected]> Date: Wed Nov 16 19:04:39 2022 +0200 Auto-update of Github Actions commit b169e86 Author: Sofiia Antypenko <[email protected]> Date: Sun Oct 9 09:56:22 2022 +0200 Fix ENOENT when running from working directory without package.json (website-scraper#516) commit de75674 Author: ryanolee <[email protected]> Date: Wed Sep 21 21:28:24 2022 +0100 Remove plugin exports and update docs (website-scraper#511), closes website-scraper#510 * chore(website-scraper#510): Remove plugin exports and update docs * core(website-scraper#510) Update messaging on internal plugins Co-authored-by: ryanolee <[email protected]> commit 6c5cfc5 Author: Sofiia Antypenko <[email protected]> Date: Thu Sep 15 21:55:48 2022 +0200 Revert "RFC: Refactor builtin plugins to expose class properties (website-scraper#506)" (website-scraper#509) This reverts commit 5dde8ec. commit b845fb0 Author: Sofiia Antypenko <[email protected]> Date: Mon Sep 12 10:01:34 2022 +0200 Remove lodash from dependencies (website-scraper#508) commit 5dde8ec Author: ryanolee <[email protected]> Date: Mon Sep 12 08:55:31 2022 +0100 RFC: Refactor builtin plugins to expose class properties (website-scraper#506) * core: Refactor builtin plugins to expose class properties * chore: EsLint Co-authored-by: ryan lee <[email protected]> commit c6ff84e Author: Illia Antypenko <[email protected]> Date: Thu Sep 1 11:55:54 2022 +0300 Remove demo link from the README commit 8d6b5aa Author: s0ph1e <[email protected]> Date: Tue Aug 30 12:23:56 2022 +0200 5.3.0 commit 6bb4e20 Author: Sophia Antipenko <[email protected]> Date: Mon Aug 29 21:56:22 2022 +0200 Use encoding from resource text (website-scraper#504) commit 07c4c02 Author: Illia Antypenko <[email protected]> Date: Sat Jul 2 00:36:57 2022 +0300 Use codeclimate project id instead of secret commit 525dc50 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Fri Jul 1 09:20:14 2022 +0200 Bump cheerio from 1.0.0-rc.11 to 1.0.0-rc.12 (website-scraper#499) Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.11 to 1.0.0-rc.12. - [Release notes](https://github.com/cheeriojs/cheerio/releases) - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md) - [Commits](cheeriojs/cheerio@v1.0.0-rc.11...v1.0.0-rc.12) --- updated-dependencies: - dependency-name: cheerio dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 07e6858 Author: Sophia Antipenko <[email protected]> Date: Thu Jun 23 08:57:14 2022 +0200 Bump version 5.2.0 commit e8c6f9d Author: Paul Hawxby <[email protected]> Date: Thu Jun 23 07:24:45 2022 +0100 fix: non-english char encoding, close website-scraper#454, close website-scraper#466 (website-scraper#496) * fix: non-english char encoding * fix: optional chaining not supported * fix: various tests not finished yet * fix: tests now pass * fix: testing and code complexity * fix: code climate * fix: change comments * fix: simplify code complexity * refactor: simplify logic and add more testing * fix: reduce cognitive complexity * fix: more cognitive complexity * chore: damn space * fix: broke a test * chore: update docs * test: re-enable utf8 test * chore: update request test commit 3bd2555 Author: Sophia Antipenko <[email protected]> Date: Sat Jun 11 14:43:32 2022 +0200 Revert encoding fix website-scraper#482 (website-scraper#495) commit c6f60b8 Author: Sophia Antipenko <[email protected]> Date: Thu Jun 9 14:18:08 2022 +0200 Remove link to gitter from CONTRIBUTING.md commit 087d943 Author: Sophia Antipenko <[email protected]> Date: Thu Jun 9 14:07:10 2022 +0200 Bump version 5.1.0 commit 5a58f48 Author: Illia Antypenko <[email protected]> Date: Thu Jun 9 14:04:51 2022 +0200 Fix encoding issue for non-English websites, closes website-scraper#454 website-scraper#466 (website-scraper#482) commit d80e9b0 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sat May 28 08:57:53 2022 +0200 Bump cheerio from 1.0.0-rc.10 to 1.0.0-rc.11 (website-scraper#491) Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.10 to 1.0.0-rc.11. - [Release notes](https://github.com/cheeriojs/cheerio/releases) - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md) - [Commits](cheeriojs/cheerio@v1.0.0-rc.10...v1.0.0-rc.11) --- updated-dependencies: - dependency-name: cheerio dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 9397b07 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue May 10 10:58:35 2022 +0200 Bump sinon from 13.0.2 to 14.0.0 (website-scraper#488) Bumps [sinon](https://github.com/sinonjs/sinon) from 13.0.2 to 14.0.0. - [Release notes](https://github.com/sinonjs/sinon/releases) - [Changelog](https://github.com/sinonjs/sinon/blob/main/docs/changelog.md) - [Commits](sinonjs/sinon@v13.0.2...v14.0.0) --- updated-dependencies: - dependency-name: sinon dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 66b1e49 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Tue May 3 14:49:33 2022 +0200 Bump mocha from 9.2.2 to 10.0.0 (website-scraper#486) Bumps [mocha](https://github.com/mochajs/mocha) from 9.2.2 to 10.0.0. - [Release notes](https://github.com/mochajs/mocha/releases) - [Changelog](https://github.com/mochajs/mocha/blob/master/CHANGELOG.md) - [Commits](mochajs/mocha@v9.2.2...v10.0.0) --- updated-dependencies: - dependency-name: mocha dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit 848ca1a Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun Feb 6 19:26:26 2022 +0200 Bump sinon from 12.0.1 to 13.0.0 (website-scraper#483) Bumps [sinon](https://github.com/sinonjs/sinon) from 12.0.1 to 13.0.0. - [Release notes](https://github.com/sinonjs/sinon/releases) - [Changelog](https://github.com/sinonjs/sinon/blob/master/docs/changelog.md) - [Commits](sinonjs/sinon@v12.0.1...v13.0.0) --- updated-dependencies: - dependency-name: sinon dependency-type: direct:development update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Hi Sophie,
I noticed that some non-English content end up not getting converted correctly. As a random example, try scraping this page: http://scp-wiki-cn.wikidot.com/scp-083
You'll see that the page is saved with a bunch of � characters.
It looks like the crux of the issue is that the content is always converted to "latin1" encoding.
The reason this is happening is that the request encoding is overridden to 'binary' (in config/defaults.js) but the responseType is the default of 'text'. So, when we get the request body, 'got' calls toString on its internal Buffer, passing in the 'binary' encoding, which is mapped to latin1 here:
https://github.com/nodejs/node/blob/fdce138e1dd86f63f141cbd5ec6b670eeec68986/lib/buffer.js#L708
if (encoding === 'latin1' || encoding === 'binary')
return encodingOps.latin1;
In order to address this, I get the response body as a Buffer (not text) and then later convert to utf8. Specifically, I'm doing the following:
case Buffer.isBuffer(result):
return {
body: result,
metadata: result.metadata || null
};
I haven't generalized this enough to submit a pull request but let me know if you need this.
The text was updated successfully, but these errors were encountered: