unreliable detection - windows1250 #70

peminator · 2021-09-16T11:56:23Z

the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ

Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.

peminator · 2024-06-06T07:38:38Z

Hey bro, still bad ??? Just tested the new VSCode insiders using jschardet and it
DOES NOT DETECT windows1250 AT ALL

check here:
microsoft/vscode#208550 (comment)

aadsm · 2024-06-11T22:23:56Z

Hey! I just saw your comment on microsoft/vscode#208550. Let me take a look into this. Also, don't-brow-me ;P.

aadsm · 2024-06-11T23:54:49Z

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

peminator · 2024-06-12T05:10:36Z

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

Afaik Microsoft used this sentence to test fonts (pangram = showcase accented characters)
**Příliš žluťoučký kůň úpěl ďábelské ódy.**
Thats a sentence in Czech, another country sure using it was slovak, i would add
**Päť tôní, ľahký skok**
-- just copy the sentences and save it using Windows-1250 in VsCode or other editor which can do it.

I'd suggest just if unsure, if it is one of Windows-1252 or ISO8859-2 detected with any confidence, just add also 1250 to possible results with a bit lower confidence level. Its similarity + characters also mentioned here on wikipedia Windows-1250 So if there is a chance its 1252, theres also chance its 1250.

That way, current users it would still get what expected before, and in VS Code where i could provide multiple quess candidates in recent insiders build, if i configure it to look only for chance of 1250. i would get what i need...

That would be an immediate quick way, with "to be perfected" for later

Also i suggest u rename it, it never was named Hungarian afaik, its official name is "Central European" bc its used to many countries there, naming it after one country may trigger others
(exactly to how it triggered me as slovak, bc hungary used to try absorb slovakia in past as part of their hungarian empire, the used to behave with great arogance back then and slovak ppl were opressed , many slovak ppl still feel great hate about them), it never was an open war, but often was not from it, so naming it hungarian feels to me like the "empire strikes back" even long after dismissed ;D

This is in relation to #70. The differences between the two are minimal, so this will be a workaround for now. These two encodings use very different models for detection. The windows-1252 detector is purely based on the occurance probability of each character's class. The windows-1250 uses a Hungarian language model to detect the text is in Hungarian. This is brittle as there are other languages using windows-1250.

aadsm · 2024-06-18T01:42:09Z

I was busy on the weekend. I think your suggestion makes sense so I went ahead and implemented just that.

The reason it's named "windows-1250 (Hungarian)" is because it uses a Hungarian language model to predict if the text is in Hungarian. Like you mentioned, other countries used the same encoding, so I imagine that's the reason we're getting no match at all. But it could also be that windows-1252 is just being detected with less characters than windows-1250 needs to come up with any confidence. This is something that I might look into in the future, as it could also affect other encodings.

Ah yeah, the slavic countries have been a source of invasion and dispute for centuries 😬. I didn't follow the breakup of those countries after the fall of the ussr, but still remember yugoslavia and czechoslovakia being on the news.
The word slave actually comes from "slav" due to the slavery of slavs that happened during the caliphate :(.

peminator changed the title ~~unreliable detection, wrong naming for 1250~~ unreliable detection - windows1250 Jun 6, 2024

aadsm self-assigned this Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unreliable detection - windows1250 #70

unreliable detection - windows1250 #70

peminator commented Sep 16, 2021

peminator commented Jun 6, 2024 •

edited

Loading

aadsm commented Jun 11, 2024

aadsm commented Jun 11, 2024

peminator commented Jun 12, 2024 •

edited

Loading

aadsm commented Jun 18, 2024

unreliable detection - windows1250 #70

unreliable detection - windows1250 #70

Comments

peminator commented Sep 16, 2021

peminator commented Jun 6, 2024 • edited Loading

aadsm commented Jun 11, 2024

aadsm commented Jun 11, 2024

peminator commented Jun 12, 2024 • edited Loading

aadsm commented Jun 18, 2024

peminator commented Jun 6, 2024 •

edited

Loading

peminator commented Jun 12, 2024 •

edited

Loading