Double encoding of special chars in URL doesn't work well in all cases #45

NicolasMassart · 2021-03-11T11:18:45Z

See for instance tcort/markdown-link-check#155
User is looking to check https://en.wikipedia.org/wiki/%3F:
But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :.
We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.

The text was updated successfully, but these errors were encountered:

jan-guenter · 2023-04-18T10:05:30Z

I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree

With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.

In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.

Example:

new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()

results in

https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded

So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.

diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
 
         let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
 
-        // Decoding and encoding is required to prevent encoding already encoded URLs
-        // We decode using the decodeURIComponent as it will decode a wider range of 
-        // characters that were not necessary to be encoded at first, then we re-encode
-        // only the required ones using encodeURI.
-        // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
-        // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
-        const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+        // rebase relative urls and normalize url encoding
+        const url = new URL(link, opts.baseUrl).toString();

         const options = {
             user_agent: user_agent,

Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.

NicolasMassart self-assigned this Mar 11, 2021

NicolasMassart added the bug label Mar 11, 2021

NicolasMassart removed their assignment May 10, 2023

cc-stjm mentioned this issue Oct 11, 2024

%23 in URL interpreted as # tcort/markdown-link-check#264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double encoding of special chars in URL doesn't work well in all cases #45

Double encoding of special chars in URL doesn't work well in all cases #45

NicolasMassart commented Mar 11, 2021

jan-guenter commented Apr 18, 2023

Double encoding of special chars in URL doesn't work well in all cases #45

Double encoding of special chars in URL doesn't work well in all cases #45

Comments

NicolasMassart commented Mar 11, 2021

jan-guenter commented Apr 18, 2023