Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cp1250 is not detected #18

Open
jaruba opened this issue Jul 31, 2015 · 5 comments
Open

cp1250 is not detected #18

jaruba opened this issue Jul 31, 2015 · 5 comments

Comments

@jaruba
Copy link

jaruba commented Jul 31, 2015

It does not work with Romanian subtitle files. OpenSubtitles detects these files as "cp1250", jschardet detects the encoding as "windows-1252".

Wrong characters: ã þ º
Correct romanian special characters: ă Ă â Â î Î ş Ş ţ Ţ

Test file: http://dl.opensubtitles.org/en/download/file/1954820326.srt

I've tested with many more files though, if I use iconv-lite with "cp1250" (instead of "windows-1252" as detected) it encodes the file to "utf8" correctly.

@aadsm
Copy link
Owner

aadsm commented Aug 9, 2015

Am I right to assume that cp1250 is the same as Windows-1250?
I looked into it and these are the higher confidence levels I get when processing that file:

windows-1250 confidence = 0.8075649621709267
windows-1252 confidence 0.9498656452563347

The problem is that the latin1 prober (the windows-1252) is winning, they have really similar encoding tables actually and since there are many more latin1 in the text (compared to the more distinct ă Ă â Â î Î ş Ş ţ Ţ chars) that's probably the reason the confidence is higher and it wins in the end...
The confidence for the latin1 is already artificially lowered because of this exact reason but maybe it's not enough for some situations... (https://github.com/aadsm/jschardet/blob/master/src/latin1prober.js#L151-L155)

Charset encoding detection is purely based on heuristics, these encodings have a statistical model based on frequency so it's never going to be 100% accurate.

I wonder how OpenSubtitles do it, maybe the browser sends that information when posting the file?

@jaruba
Copy link
Author

jaruba commented Aug 9, 2015

Am I right to assume that cp1250 is the same as Windows-1250?

Yes, you are correct.

I wonder how OpenSubtitles do it, maybe the browser sends that information when posting the file?

Nope, that info is not sent to the browser as the subtitle files are never encoded correctly.

Here is some info about how OpenSubtitles detects/converts encodings:
http://forum.opensubtitles.org/viewtopic.php?f=1&t=14992&p=30697#p30697

@wdtbrchan
Copy link

Here is another similar problem in the czech language (windows-1250).
windows1250.txt
It detect as CP866.

@digitalnature
Copy link

Wouldn't it be better to use dictionaries when multiple encodings return high confidence levels?
A dictionary for each encoding with 20-30 most common words in the languages that use it. This would certainly work for large pieces of text like subtitles.

@SylwesterZarebski
Copy link

SylwesterZarebski commented Oct 12, 2017

Heuristics should work based on most readable/writeable characters from file, not just probability of characters because difference between European encodings is small.
It should answer to question, which encoding properly display most characters as readable text - words, sentences etc.
CP-1252 do not display many East European characters, which are in CP-1250 encoding.

Polish diacritics characters are:

ąćęłńóśżź
ĄĆĘŁŃÓŚŻŹ

which are incorrectly shown in CP-1252 (when written originally as CP-1250):

¹æê³ñ󜿟
¥ÆÊ£ÑÓŒ¯�

It is clear that some of above characters in CP-1252 are not a word characters, thus it do not fulfil requirement (= probability is low) to contain all word characters in this encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants