Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double encoding of special chars in URL doesn't work well in all cases #45

Open
NicolasMassart opened this issue Mar 11, 2021 · 1 comment
Labels

Comments

@NicolasMassart
Copy link
Contributor

See for instance tcort/markdown-link-check#155
User is looking to check https://en.wikipedia.org/wiki/%3F:
But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :.
We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.

@NicolasMassart NicolasMassart self-assigned this Mar 11, 2021
@jan-guenter
Copy link

I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree

With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.

In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.

Example:

new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()

results in

https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded

So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.

diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
 
         let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
 
-        // Decoding and encoding is required to prevent encoding already encoded URLs
-        // We decode using the decodeURIComponent as it will decode a wider range of 
-        // characters that were not necessary to be encoded at first, then we re-encode
-        // only the required ones using encodeURI.
-        // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
-        // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
-        const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+        // rebase relative urls and normalize url encoding
+        const url = new URL(link, opts.baseUrl).toString();

         const options = {
             user_agent: user_agent,

Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants