Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle UTF8 and long-paths on Windows #189

Open
Safihre opened this issue Sep 11, 2023 · 11 comments · May be fixed by #194
Open

Properly handle UTF8 and long-paths on Windows #189

Safihre opened this issue Sep 11, 2023 · 11 comments · May be fixed by #194

Comments

@Safihre
Copy link
Contributor

Safihre commented Sep 11, 2023

As requested by @animetosho, reporting the original bug here 💯

On Windows, par2cmdline can basically only handle ASCII filenames. This was fine when par2cmdline was created long ago, but the world is UTF8 now.
Simply try to open a file named 你好世界.par2, will fail.

par2 file must not have a wildcard in it.
failed to set the main par file

This was fixed long ago in par2-tbb, but that fork has been unmaintained for a long long time.
It also displays any UTF8-charahters as question marks in the console output.

Additionally, par2cmdline does not seem to really support long-path notation on Windows. Which is needed to open files where the path exceeds 255 characters.
For example \\\\?\\C:\\test_win_unicode\\test_win_unicode.par2 results also in:

par2 file must not have a wildcard in it.
failed to set the main par file

We use that by default so we can avoid any problems on longer paths and having to check path-lengths every time.

animetosho/issues/13
animetosho/issues/17
sabnzbd/sabnzbd/pull/2674

@Safihre
Copy link
Contributor Author

Safihre commented Nov 26, 2023

@animetosho any chance we can get you interested in fixing this one?
Then we could use par2cmdline-turbo also on Windows and don't need Multipar anymore.

@animetosho
Copy link
Contributor

If someone submits a pull request to par2cmdline, I can consider merging it into par2cmdline-turbo.
Otherwise, I have enough on my plate for now.

@mnightingale
Copy link

I intend to submit a PR for this; I've got UTF8 working including in the console it just needs tidying up.

I haven't really looked at the long paths issue yet, but from a quick glance I think there are numerous issues;

  • par2 file must not have a wildcard in it. is because ? is not allowed - but it's needed for \\?\
  • I'm note sure the FileExists or GetFileSize methods will work for long paths, various places try to use _stat
    • I think may also limit filesize to 4GB
    • Maybe GetFileAttributes and GetFileSize are needed
  • _MAX_PATH has been defined but only used for renames, GetCanonicalPathname uses MAX_PATH (260) but I haven't looked into the implications of that.

@Safihre
Copy link
Contributor Author

Safihre commented Aug 19, 2024

@mnightingale Any updates on this work?

@mnightingale
Copy link

Sorry I have been busy with other things, the Unicode problem is much involved than expected, needs a lot of MultiByteToWideChar and convert the other way for outputting to the code page in use the console.

I have some changes that work, but it is very messy and would break non-Windows builds.

The long path problem appears a simpler fix at least for verify, create needs more work on how it splits a path into components and the MAX_PATH it used for buffers.

@Safihre
Copy link
Contributor Author

Safihre commented Aug 20, 2024

Is there anything we can learn from nzbget handling of this?
https://github.com/nzbgetcom/nzbget/blob/develop/daemon/util/FileSystem.cpp

@mnightingale
Copy link

mnightingale commented Aug 21, 2024

Looks like their might be interest from their to fix it (mentioned issue), they certainly have more experienced C++ programmers than me 😄

But yes nzbget handles some of what needs implementing here, see helpers UtfPathToWidePath and WidePathToUtfPath along with handling long paths as required it looks like there are several string helpers to make the manipulation easier.

How MultiPar handles it may also be useful to someone else looking https://github.com/Yutaka-Sawada/MultiPar/blob/d733ada21ae81405de468dd2cd458bcbf09ab9ea/source/par2j/common2.c#L66-L204 specifically the parts for converting to the code page used by the console and handling the possible errors (try again with dwFlags = 0).

edit: just to mention the other part which nzbget doesn't have to worry about is every filename including from par packets, needs preparing for the console when printing log/debug lines, so you don't get "?" or broken characters.

@mnightingale
Copy link

mnightingale commented Aug 23, 2024

@Safihre I've cleaned up what I'd started working on https://github.com/mnightingale/par2cmdline/tree/unicode_filenames it won't compile outside of Windows currently and it only attempts to addresses the problems with UTF8 filenames.

I haven't made a draft PR for now as I don't want to put someone else having a go.

edit: main thing is figuring out how to remove TCHAR, at the time just wanted to get it to compile.

@mnightingale
Copy link

I think it's about there now, I've added support for \\?\ paths, an improvement could be made to always use such paths even if the path passed in isn't but I wasn't going to bother with that.

utf8::compatible tries to deal with normalisation of filenames in packets on Windows.
I'm not sure how to do similar for other platforms, may need a separate library or just not bother so both are broken.

Changes can be applied applied with minimal changes to par2cmdline-turbo

@Safihre
Copy link
Contributor Author

Safihre commented Aug 26, 2024

@mnightingale We only have to care for Windows, on the other platforms there are no problems. The SABnzbd unit-tests already show that :)
So all things can be wrapped in macros to only build on Windows platforms!

@mnightingale
Copy link

utf8::compatible was specifically to deal with test_win_unicode sabnzbd/sabnzbd#1633 which as far as I can tell fails on all platforms.

Test file has:

00000070  84 00 00 00 00 00 00 00  66 72 e8 6e 63 68 5f 67  |........fr.nch_g|
00000080  65 72 6d 61 6e 5f 64 65  6d f6 2e 72 61 72 00 00  |erman_dem..rar..|

But generating new recovery volumns with par2cmdline uses unicode, against the spec but I don't think it's the only implementation to do as such.

000000d0  00 00 00 00 66 72 c3 a8  6e 63 68 5f 67 65 72 6d  |....fr..nch_germ|
000000e0  61 6e 5f 64 65 6d c3 b6  2e 72 61 72 50 41 52 32  |an_dem...rarPAR2|

I think I'll just remove the part trying to handle this, then it will behave the same broken way on all platforms.

@mnightingale mnightingale linked a pull request Aug 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants