Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-English characters are not converted correctly #454

Closed
tverona1 opened this issue Aug 16, 2021 · 6 comments · Fixed by #496
Closed

Non-English characters are not converted correctly #454

tverona1 opened this issue Aug 16, 2021 · 6 comments · Fixed by #496
Labels

Comments

@tverona1
Copy link

Hi Sophie,

I noticed that some non-English content end up not getting converted correctly. As a random example, try scraping this page: http://scp-wiki-cn.wikidot.com/scp-083

You'll see that the page is saved with a bunch of � characters.

It looks like the crux of the issue is that the content is always converted to "latin1" encoding.

The reason this is happening is that the request encoding is overridden to 'binary' (in config/defaults.js) but the responseType is the default of 'text'. So, when we get the request body, 'got' calls toString on its internal Buffer, passing in the 'binary' encoding, which is mapped to latin1 here:

https://github.com/nodejs/node/blob/fdce138e1dd86f63f141cbd5ec6b670eeec68986/lib/buffer.js#L708
if (encoding === 'latin1' || encoding === 'binary')
return encodingOps.latin1;

In order to address this, I get the response body as a Buffer (not text) and then later convert to utf8. Specifically, I'm doing the following:

  1. In beforeRequest: Set request options to: requestOptions = {...requestOptions, encoding: 'binary', responseType: 'buffer'}
  2. In afterResponse: If mime type is html / css / js, return reponse.body.toString('utf8'). Otherwise, return response.body (i.e. the Buffer object).
  3. In request.js: In transformResult(), added an additional case:
    case Buffer.isBuffer(result):
    return {
    body: result,
    metadata: result.metadata || null
    };
  4. In save-resource-to-fs-plugin.js, if type is html / css / js, set encoding to 'utf8' instead of 'binary'.

I haven't generalized this enough to submit a pull request but let me know if you need this.

@s0ph1e
Copy link
Member

s0ph1e commented Aug 18, 2021

Hi @tverona1 👋

Thank you for submitting the issue, really interesting finding about latin1 encoding.

First of all - could you please check which version are you using? Would be nice to understand if there is a difference in using request vs using got for http calls.
Now latest published to npm version is 4.x (it uses request, not got) - it's in separate branch 4.x
Main branch contains work-in-progress for future version 5 which uses got which is not published to npm yet.

I'm open to review a PR if this encoding changes do not break other functionality. Would be super-nice if we can have a test which demonstrates a problem and code change which fixes it.

Do you think this changes should be applied by default?

@tverona1
Copy link
Author

tverona1 commented Aug 18, 2021

I'm using got, although request will behave similarly according to its documentation - since you're setting encoding to binary (aka latin1), both request and got will pass this as the encoding to Buffer's toString(). See this: https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

I think the desired default behavior for the scraper should be: Always get the body as a Buffer object (not text). When saving to file, save it as either binary or convert to utf8 according to its mime type. There's a pretty good library called mime-types with a function called mime.charset that will return the default charset of a content-type.

The above may be a breaking change, so should probably only be introduced in v5.

I don't have a ready made repro on the existing code base - I've actually forked it and changed it to use got a while back. I'll see if I can give it a try in the next couple of weeks. It should be easy to repro with a test case that has some non-latin content.

@tverona1
Copy link
Author

Also, we should make sure there's a test case for scraping binary content (i.e. png) to make sure thesr changes don't regress that.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 11, 2022

Reopen because a version with a fix 5.1.0 was deprecated and reverted

@s0ph1e s0ph1e reopened this Jun 11, 2022
@phawxby
Copy link
Contributor

phawxby commented Jun 18, 2022

@s0ph1e do you have any suggestions on the best way to resolve this? We may need to use this plugin fairly soon to scrape a site which will suffer this issue so I can assign some engineering time to resolving the problem. However we're unfamiliar with the project so some guidance may be useful.

@s0ph1e
Copy link
Member

s0ph1e commented Jun 20, 2022

Hi @phawxby

Thank you for your suggestion 👍

As a very quick solution you can use version 5.1.0, it's deprecated because it breaks functionality related to plugins, but if you don't need plugins you can give it a try

I don't have a clear vision how to implement a fix in a good way without breaking compatibility and/or major refactoring.

I'll try to have a closer look on your PR #496 this week

s0ph1e pushed a commit that referenced this issue Jun 23, 2022
* fix: non-english char encoding

* fix: optional chaining not supported

* fix: various tests

not finished yet

* fix: tests now pass

* fix: testing and code complexity

* fix: code climate

* fix: change comments

* fix: simplify code complexity

* refactor: simplify logic and add more testing

* fix: reduce cognitive complexity

* fix: more cognitive complexity

* chore: damn space

* fix: broke a test

* chore: update docs

* test: re-enable utf8 test

* chore: update request test
yamanyadav8810 added a commit to yamanyadav8810/node-website-scraper that referenced this issue Feb 14, 2023
commit 65b8056
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Dec 6 01:50:13 2022 +0200

    Bump fs-extra from 10.1.0 to 11.1.0 (website-scraper#523)

    Bumps [fs-extra](https://github.com/jprichardson/node-fs-extra) from 10.1.0 to 11.1.0.
    - [Release notes](https://github.com/jprichardson/node-fs-extra/releases)
    - [Changelog](https://github.com/jprichardson/node-fs-extra/blob/master/CHANGELOG.md)
    - [Commits](jprichardson/node-fs-extra@10.1.0...11.1.0)

    ---
    updated-dependencies:
    - dependency-name: fs-extra
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 52a341a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Dec 5 20:36:07 2022 +0000

    Bump sinon from 14.0.2 to 15.0.0 (website-scraper#524)

commit faa8bab
Author: Illia Antypenko <[email protected]>
Date:   Thu Nov 17 00:11:58 2022 +0200

    Update stale.yml

commit 5886558
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 23:24:22 2022 +0200

    Update stale.yml

commit 2797213
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 21:09:28 2022 +0000

    Bump actions/setup-node from 2 to 3 (website-scraper#519)

commit d9d147a
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 23:06:02 2022 +0200

    Delete no-response.yml

commit 7b6cbb5
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 20:54:40 2022 +0000

    Bump paambaati/codeclimate-action from 3.0.0 to 3.2.0 (website-scraper#520)

commit 28d6b40
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 22:40:53 2022 +0200

    Delete codeql-analysis.yml

commit 145f31d
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 22:40:44 2022 +0200

    Create codeql.yml

commit 4237388
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 17:15:31 2022 +0000

    Bump normalize-url from 7.2.0 to 8.0.0 (website-scraper#517)

commit 9dc9ef6
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 19:04:39 2022 +0200

    Auto-update of Github Actions

commit b169e86
Author: Sofiia Antypenko <[email protected]>
Date:   Sun Oct 9 09:56:22 2022 +0200

    Fix ENOENT when running from working directory without package.json (website-scraper#516)

commit de75674
Author: ryanolee <[email protected]>
Date:   Wed Sep 21 21:28:24 2022 +0100

    Remove plugin exports and update docs (website-scraper#511), closes website-scraper#510

    * chore(website-scraper#510): Remove plugin exports and update docs

    * core(website-scraper#510) Update messaging on internal plugins

    Co-authored-by: ryanolee <[email protected]>

commit 6c5cfc5
Author: Sofiia Antypenko <[email protected]>
Date:   Thu Sep 15 21:55:48 2022 +0200

    Revert "RFC: Refactor builtin plugins to expose class properties (website-scraper#506)" (website-scraper#509)

    This reverts commit 5dde8ec.

commit b845fb0
Author: Sofiia Antypenko <[email protected]>
Date:   Mon Sep 12 10:01:34 2022 +0200

    Remove lodash from dependencies (website-scraper#508)

commit 5dde8ec
Author: ryanolee <[email protected]>
Date:   Mon Sep 12 08:55:31 2022 +0100

    RFC: Refactor builtin plugins to expose class properties (website-scraper#506)

    * core: Refactor builtin plugins to expose class properties

    * chore: EsLint

    Co-authored-by: ryan lee <[email protected]>

commit c6ff84e
Author: Illia Antypenko <[email protected]>
Date:   Thu Sep 1 11:55:54 2022 +0300

    Remove demo link from the README

commit 8d6b5aa
Author: s0ph1e <[email protected]>
Date:   Tue Aug 30 12:23:56 2022 +0200

    5.3.0

commit 6bb4e20
Author: Sophia Antipenko <[email protected]>
Date:   Mon Aug 29 21:56:22 2022 +0200

    Use encoding from resource text (website-scraper#504)

commit 07c4c02
Author: Illia Antypenko <[email protected]>
Date:   Sat Jul 2 00:36:57 2022 +0300

    Use codeclimate project id instead of secret

commit 525dc50
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Jul 1 09:20:14 2022 +0200

    Bump cheerio from 1.0.0-rc.11 to 1.0.0-rc.12 (website-scraper#499)

    Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.11 to 1.0.0-rc.12.
    - [Release notes](https://github.com/cheeriojs/cheerio/releases)
    - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md)
    - [Commits](cheeriojs/cheerio@v1.0.0-rc.11...v1.0.0-rc.12)

    ---
    updated-dependencies:
    - dependency-name: cheerio
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 07e6858
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 23 08:57:14 2022 +0200

    Bump version 5.2.0

commit e8c6f9d
Author: Paul Hawxby <[email protected]>
Date:   Thu Jun 23 07:24:45 2022 +0100

    fix: non-english char encoding, close website-scraper#454, close website-scraper#466 (website-scraper#496)

    * fix: non-english char encoding

    * fix: optional chaining not supported

    * fix: various tests

    not finished yet

    * fix: tests now pass

    * fix: testing and code complexity

    * fix: code climate

    * fix: change comments

    * fix: simplify code complexity

    * refactor: simplify logic and add more testing

    * fix: reduce cognitive complexity

    * fix: more cognitive complexity

    * chore: damn space

    * fix: broke a test

    * chore: update docs

    * test: re-enable utf8 test

    * chore: update request test

commit 3bd2555
Author: Sophia Antipenko <[email protected]>
Date:   Sat Jun 11 14:43:32 2022 +0200

    Revert encoding fix website-scraper#482 (website-scraper#495)

commit c6f60b8
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 9 14:18:08 2022 +0200

    Remove link to gitter from CONTRIBUTING.md

commit 087d943
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 9 14:07:10 2022 +0200

    Bump version 5.1.0

commit 5a58f48
Author: Illia Antypenko <[email protected]>
Date:   Thu Jun 9 14:04:51 2022 +0200

    Fix encoding issue for non-English websites, closes website-scraper#454 website-scraper#466 (website-scraper#482)

commit d80e9b0
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat May 28 08:57:53 2022 +0200

    Bump cheerio from 1.0.0-rc.10 to 1.0.0-rc.11 (website-scraper#491)

    Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.10 to 1.0.0-rc.11.
    - [Release notes](https://github.com/cheeriojs/cheerio/releases)
    - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md)
    - [Commits](cheeriojs/cheerio@v1.0.0-rc.10...v1.0.0-rc.11)

    ---
    updated-dependencies:
    - dependency-name: cheerio
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 9397b07
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue May 10 10:58:35 2022 +0200

    Bump sinon from 13.0.2 to 14.0.0 (website-scraper#488)

    Bumps [sinon](https://github.com/sinonjs/sinon) from 13.0.2 to 14.0.0.
    - [Release notes](https://github.com/sinonjs/sinon/releases)
    - [Changelog](https://github.com/sinonjs/sinon/blob/main/docs/changelog.md)
    - [Commits](sinonjs/sinon@v13.0.2...v14.0.0)

    ---
    updated-dependencies:
    - dependency-name: sinon
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 66b1e49
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue May 3 14:49:33 2022 +0200

    Bump mocha from 9.2.2 to 10.0.0 (website-scraper#486)

    Bumps [mocha](https://github.com/mochajs/mocha) from 9.2.2 to 10.0.0.
    - [Release notes](https://github.com/mochajs/mocha/releases)
    - [Changelog](https://github.com/mochajs/mocha/blob/master/CHANGELOG.md)
    - [Commits](mochajs/mocha@v9.2.2...v10.0.0)

    ---
    updated-dependencies:
    - dependency-name: mocha
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 848ca1a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Feb 6 19:26:26 2022 +0200

    Bump sinon from 12.0.1 to 13.0.0 (website-scraper#483)

    Bumps [sinon](https://github.com/sinonjs/sinon) from 12.0.1 to 13.0.0.
    - [Release notes](https://github.com/sinonjs/sinon/releases)
    - [Changelog](https://github.com/sinonjs/sinon/blob/master/docs/changelog.md)
    - [Commits](sinonjs/sinon@v12.0.1...v13.0.0)

    ---
    updated-dependencies:
    - dependency-name: sinon
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants