-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue - Filename too long #71
Comments
I am also thinking the same, to limit the characters to 80. Share your thoughts. |
Yes, the common names are an issue. So, I came up with another idea about adding another config variable for alternative filenames. I added another variable on config.ini named, 'filename_alt' and put there an alternative name for the file without any extension. Then added a condition in the script to check the filename length. If it exceeds 80 then it will take the I have tried it with 3 books so far and all worked as expected. I can send you a pull request if you like. |
Great. Share the example book URLs and do a pull request. Thanks. |
Issue
Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.
Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.
Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়_ইতিহাস_(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.
Possible solution:
I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.
do_ocr.py:109
basename = os.path.basename(original_url)
filename = basename[:80] #limiting the filename if longer that 80 chars
mediawiki_uploader.py:212
pagename = basename.encode('utf-8') + "/" + indic_page_number
This is a very rough idea, but I think you get my point.
Thanks.
The text was updated successfully, but these errors were encountered: