cp1250 is not detected #18

jaruba · 2015-07-31T04:59:33Z

It does not work with Romanian subtitle files. OpenSubtitles detects these files as "cp1250", jschardet detects the encoding as "windows-1252".

Wrong characters: ã þ º
Correct romanian special characters: ă Ă â Â î Î ş Ş ţ Ţ

Test file: http://dl.opensubtitles.org/en/download/file/1954820326.srt

I've tested with many more files though, if I use iconv-lite with "cp1250" (instead of "windows-1252" as detected) it encodes the file to "utf8" correctly.

The text was updated successfully, but these errors were encountered:

aadsm · 2015-08-09T17:40:42Z

Am I right to assume that cp1250 is the same as Windows-1250?
I looked into it and these are the higher confidence levels I get when processing that file:

windows-1250 confidence = 0.8075649621709267
windows-1252 confidence 0.9498656452563347

The problem is that the latin1 prober (the windows-1252) is winning, they have really similar encoding tables actually and since there are many more latin1 in the text (compared to the more distinct ă Ă â Â î Î ş Ş ţ Ţ chars) that's probably the reason the confidence is higher and it wins in the end...
The confidence for the latin1 is already artificially lowered because of this exact reason but maybe it's not enough for some situations... (https://github.com/aadsm/jschardet/blob/master/src/latin1prober.js#L151-L155)

Charset encoding detection is purely based on heuristics, these encodings have a statistical model based on frequency so it's never going to be 100% accurate.

I wonder how OpenSubtitles do it, maybe the browser sends that information when posting the file?

jaruba · 2015-08-09T21:49:23Z

Am I right to assume that cp1250 is the same as Windows-1250?

Yes, you are correct.

I wonder how OpenSubtitles do it, maybe the browser sends that information when posting the file?

Nope, that info is not sent to the browser as the subtitle files are never encoded correctly.

Here is some info about how OpenSubtitles detects/converts encodings:
http://forum.opensubtitles.org/viewtopic.php?f=1&t=14992&p=30697#p30697

wdtbrchan · 2017-04-27T11:00:45Z

Here is another similar problem in the czech language (windows-1250).
windows1250.txt
It detect as CP866.

digitalnature · 2017-05-04T13:50:47Z

Wouldn't it be better to use dictionaries when multiple encodings return high confidence levels?
A dictionary for each encoding with 20-30 most common words in the languages that use it. This would certainly work for large pieces of text like subtitles.

SylwesterZarebski · 2017-10-12T12:01:36Z

Heuristics should work based on most readable/writeable characters from file, not just probability of characters because difference between European encodings is small.
It should answer to question, which encoding properly display most characters as readable text - words, sentences etc.
CP-1252 do not display many East European characters, which are in CP-1250 encoding.

Polish diacritics characters are:

ąćęłńóśżź
ĄĆĘŁŃÓŚŻŹ

which are incorrectly shown in CP-1252 (when written originally as CP-1250):

¹æê³ñóœ¿Ÿ
¥ÆÊ£ÑÓŒ¯�

It is clear that some of above characters in CP-1252 are not a word characters, thus it do not fulfil requirement (= probability is low) to contain all word characters in this encoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp1250 is not detected #18

cp1250 is not detected #18

jaruba commented Jul 31, 2015

aadsm commented Aug 9, 2015

jaruba commented Aug 9, 2015

wdtbrchan commented Apr 27, 2017

digitalnature commented May 4, 2017

SylwesterZarebski commented Oct 12, 2017 •

edited

Loading

cp1250 is not detected #18

cp1250 is not detected #18

Comments

jaruba commented Jul 31, 2015

aadsm commented Aug 9, 2015

jaruba commented Aug 9, 2015

wdtbrchan commented Apr 27, 2017

digitalnature commented May 4, 2017

SylwesterZarebski commented Oct 12, 2017 • edited Loading

SylwesterZarebski commented Oct 12, 2017 •

edited

Loading