Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: non-english char encoding #496

Merged
merged 16 commits into from
Jun 23, 2022
Merged

fix: non-english char encoding #496

merged 16 commits into from
Jun 23, 2022

Conversation

phawxby
Copy link
Contributor

@phawxby phawxby commented Jun 18, 2022

This closes #454 and works in a nice generic way by passing through an encoding param which is used to save the file. This required some changes to the afterResponse API to allow it to work correctly however via normalizeResponse it will retain compatibility. Now I just need to fix-up the tests

@phawxby phawxby marked this pull request as ready for review June 18, 2022 18:43
@phawxby
Copy link
Contributor Author

phawxby commented Jun 20, 2022

@s0ph1e moving the conversation here. I've refactored a bit further this morning and I believe what I've got now is solid and retains compatibility with the existing API. However if people using it want it to work right then afterResponse handlers should be updated to return the modified response object rather than just the body. This way it can read the encoding from the incoming response to pass through to the resource it make sure it is written correctly.

@phawxby
Copy link
Contributor Author

phawxby commented Jun 20, 2022

@s0ph1e for full transparency I think this PR might make #386 worse. utf8 representations of strings in memory are larger than binary representations. You get a more accurate result at the expense of memory.

console.log({
  utf8: Buffer.byteLength('Обладнання та ПЗ', 'utf8'),
  binary: Buffer.byteLength('Обладнання та ПЗ', 'binary'),
});

Result: { utf8: 30, binary: 16 }

Copy link
Member

@s0ph1e s0ph1e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phawxby 👋

Thank you for your PR
I tested it locally and looks like everything works fine 💯 🎉
Few comments from my side:

// Wow, happy to see Ukrainian here 🇺🇦 😉 
Buffer.byteLength('Обладнання та ПЗ', 'utf8') 

* `body` (response body, string)
* `encoding` (`binary` or `utf8`) used to save the file, binary used by default.
* `metadata` (object) - everything you want to save for this resource (like headers, original text, timestamps, etc.), scraper will not use this field at all, it is only for result.
* a binary `string`. This is advised against because of the binary assumption being made can foul up saving of `utf8` responses to the filesystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bump major version for this change then we can get rid of string case here. That will simplify the code a lot

@aivus
Copy link
Member

aivus commented Jun 22, 2022

For some reason code coverage job fails - do you think it may be related to the changes? (on master it passes)

We can ignore this currently. It's GitHub security limitations.

@phawxby
Copy link
Contributor Author

phawxby commented Jun 22, 2022

Hi @phawxby 👋

Thank you for your PR I tested it locally and looks like everything works fine 💯 🎉 Few comments from my side:

// Wow, happy to see Ukrainian here 🇺🇦 😉 
Buffer.byteLength('Обладнання та ПЗ', 'utf8') 

No problem, this module has helped us in a pinch and mostly served our needs apart from these issues so was happy to contribute back to resolve them, Ukrainian characters were misbehaving and the string above was more problematic than most for some reason I didn't have time to investigate.

  1. Happy to remove string support but perhaps we do it in feat: improve memory performance and custom container classes #497 to avoid a major version bump.
  2. Done, I've also expanded the test and added some more elsewhere.
  3. I don't know but it's suggested we can ignore it.

@phawxby phawxby requested a review from s0ph1e June 22, 2022 14:24
Copy link
Member

@s0ph1e s0ph1e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you @phawxby ✅ 🚀

Going to merge & release a new version today

@s0ph1e s0ph1e merged commit e8c6f9d into website-scraper:master Jun 23, 2022
yamanyadav8810 added a commit to yamanyadav8810/node-website-scraper that referenced this pull request Feb 14, 2023
commit 65b8056
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Dec 6 01:50:13 2022 +0200

    Bump fs-extra from 10.1.0 to 11.1.0 (website-scraper#523)

    Bumps [fs-extra](https://github.com/jprichardson/node-fs-extra) from 10.1.0 to 11.1.0.
    - [Release notes](https://github.com/jprichardson/node-fs-extra/releases)
    - [Changelog](https://github.com/jprichardson/node-fs-extra/blob/master/CHANGELOG.md)
    - [Commits](jprichardson/node-fs-extra@10.1.0...11.1.0)

    ---
    updated-dependencies:
    - dependency-name: fs-extra
      dependency-type: direct:production
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 52a341a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Dec 5 20:36:07 2022 +0000

    Bump sinon from 14.0.2 to 15.0.0 (website-scraper#524)

commit faa8bab
Author: Illia Antypenko <[email protected]>
Date:   Thu Nov 17 00:11:58 2022 +0200

    Update stale.yml

commit 5886558
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 23:24:22 2022 +0200

    Update stale.yml

commit 2797213
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 21:09:28 2022 +0000

    Bump actions/setup-node from 2 to 3 (website-scraper#519)

commit d9d147a
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 23:06:02 2022 +0200

    Delete no-response.yml

commit 7b6cbb5
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 20:54:40 2022 +0000

    Bump paambaati/codeclimate-action from 3.0.0 to 3.2.0 (website-scraper#520)

commit 28d6b40
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 22:40:53 2022 +0200

    Delete codeql-analysis.yml

commit 145f31d
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 22:40:44 2022 +0200

    Create codeql.yml

commit 4237388
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 16 17:15:31 2022 +0000

    Bump normalize-url from 7.2.0 to 8.0.0 (website-scraper#517)

commit 9dc9ef6
Author: Illia Antypenko <[email protected]>
Date:   Wed Nov 16 19:04:39 2022 +0200

    Auto-update of Github Actions

commit b169e86
Author: Sofiia Antypenko <[email protected]>
Date:   Sun Oct 9 09:56:22 2022 +0200

    Fix ENOENT when running from working directory without package.json (website-scraper#516)

commit de75674
Author: ryanolee <[email protected]>
Date:   Wed Sep 21 21:28:24 2022 +0100

    Remove plugin exports and update docs (website-scraper#511), closes website-scraper#510

    * chore(website-scraper#510): Remove plugin exports and update docs

    * core(website-scraper#510) Update messaging on internal plugins

    Co-authored-by: ryanolee <[email protected]>

commit 6c5cfc5
Author: Sofiia Antypenko <[email protected]>
Date:   Thu Sep 15 21:55:48 2022 +0200

    Revert "RFC: Refactor builtin plugins to expose class properties (website-scraper#506)" (website-scraper#509)

    This reverts commit 5dde8ec.

commit b845fb0
Author: Sofiia Antypenko <[email protected]>
Date:   Mon Sep 12 10:01:34 2022 +0200

    Remove lodash from dependencies (website-scraper#508)

commit 5dde8ec
Author: ryanolee <[email protected]>
Date:   Mon Sep 12 08:55:31 2022 +0100

    RFC: Refactor builtin plugins to expose class properties (website-scraper#506)

    * core: Refactor builtin plugins to expose class properties

    * chore: EsLint

    Co-authored-by: ryan lee <[email protected]>

commit c6ff84e
Author: Illia Antypenko <[email protected]>
Date:   Thu Sep 1 11:55:54 2022 +0300

    Remove demo link from the README

commit 8d6b5aa
Author: s0ph1e <[email protected]>
Date:   Tue Aug 30 12:23:56 2022 +0200

    5.3.0

commit 6bb4e20
Author: Sophia Antipenko <[email protected]>
Date:   Mon Aug 29 21:56:22 2022 +0200

    Use encoding from resource text (website-scraper#504)

commit 07c4c02
Author: Illia Antypenko <[email protected]>
Date:   Sat Jul 2 00:36:57 2022 +0300

    Use codeclimate project id instead of secret

commit 525dc50
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Jul 1 09:20:14 2022 +0200

    Bump cheerio from 1.0.0-rc.11 to 1.0.0-rc.12 (website-scraper#499)

    Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.11 to 1.0.0-rc.12.
    - [Release notes](https://github.com/cheeriojs/cheerio/releases)
    - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md)
    - [Commits](cheeriojs/cheerio@v1.0.0-rc.11...v1.0.0-rc.12)

    ---
    updated-dependencies:
    - dependency-name: cheerio
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 07e6858
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 23 08:57:14 2022 +0200

    Bump version 5.2.0

commit e8c6f9d
Author: Paul Hawxby <[email protected]>
Date:   Thu Jun 23 07:24:45 2022 +0100

    fix: non-english char encoding, close website-scraper#454, close website-scraper#466 (website-scraper#496)

    * fix: non-english char encoding

    * fix: optional chaining not supported

    * fix: various tests

    not finished yet

    * fix: tests now pass

    * fix: testing and code complexity

    * fix: code climate

    * fix: change comments

    * fix: simplify code complexity

    * refactor: simplify logic and add more testing

    * fix: reduce cognitive complexity

    * fix: more cognitive complexity

    * chore: damn space

    * fix: broke a test

    * chore: update docs

    * test: re-enable utf8 test

    * chore: update request test

commit 3bd2555
Author: Sophia Antipenko <[email protected]>
Date:   Sat Jun 11 14:43:32 2022 +0200

    Revert encoding fix website-scraper#482 (website-scraper#495)

commit c6f60b8
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 9 14:18:08 2022 +0200

    Remove link to gitter from CONTRIBUTING.md

commit 087d943
Author: Sophia Antipenko <[email protected]>
Date:   Thu Jun 9 14:07:10 2022 +0200

    Bump version 5.1.0

commit 5a58f48
Author: Illia Antypenko <[email protected]>
Date:   Thu Jun 9 14:04:51 2022 +0200

    Fix encoding issue for non-English websites, closes website-scraper#454 website-scraper#466 (website-scraper#482)

commit d80e9b0
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sat May 28 08:57:53 2022 +0200

    Bump cheerio from 1.0.0-rc.10 to 1.0.0-rc.11 (website-scraper#491)

    Bumps [cheerio](https://github.com/cheeriojs/cheerio) from 1.0.0-rc.10 to 1.0.0-rc.11.
    - [Release notes](https://github.com/cheeriojs/cheerio/releases)
    - [Changelog](https://github.com/cheeriojs/cheerio/blob/main/History.md)
    - [Commits](cheeriojs/cheerio@v1.0.0-rc.10...v1.0.0-rc.11)

    ---
    updated-dependencies:
    - dependency-name: cheerio
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 9397b07
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue May 10 10:58:35 2022 +0200

    Bump sinon from 13.0.2 to 14.0.0 (website-scraper#488)

    Bumps [sinon](https://github.com/sinonjs/sinon) from 13.0.2 to 14.0.0.
    - [Release notes](https://github.com/sinonjs/sinon/releases)
    - [Changelog](https://github.com/sinonjs/sinon/blob/main/docs/changelog.md)
    - [Commits](sinonjs/sinon@v13.0.2...v14.0.0)

    ---
    updated-dependencies:
    - dependency-name: sinon
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 66b1e49
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue May 3 14:49:33 2022 +0200

    Bump mocha from 9.2.2 to 10.0.0 (website-scraper#486)

    Bumps [mocha](https://github.com/mochajs/mocha) from 9.2.2 to 10.0.0.
    - [Release notes](https://github.com/mochajs/mocha/releases)
    - [Changelog](https://github.com/mochajs/mocha/blob/master/CHANGELOG.md)
    - [Commits](mochajs/mocha@v9.2.2...v10.0.0)

    ---
    updated-dependencies:
    - dependency-name: mocha
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 848ca1a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Feb 6 19:26:26 2022 +0200

    Bump sinon from 12.0.1 to 13.0.0 (website-scraper#483)

    Bumps [sinon](https://github.com/sinonjs/sinon) from 12.0.1 to 13.0.0.
    - [Release notes](https://github.com/sinonjs/sinon/releases)
    - [Changelog](https://github.com/sinonjs/sinon/blob/master/docs/changelog.md)
    - [Commits](sinonjs/sinon@v12.0.1...v13.0.0)

    ---
    updated-dependencies:
    - dependency-name: sinon
      dependency-type: direct:development
      update-type: version-update:semver-major
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-English characters are not converted correctly
3 participants