Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unreliable detection - windows1250 #70

Open
peminator opened this issue Sep 16, 2021 · 5 comments
Open

unreliable detection - windows1250 #70

peminator opened this issue Sep 16, 2021 · 5 comments
Assignees

Comments

@peminator
Copy link

the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ

Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.

@peminator
Copy link
Author

peminator commented Jun 6, 2024

Hey bro, still bad ??? Just tested the new VSCode insiders using jschardet and it
DOES NOT DETECT windows1250 AT ALL

check here:
microsoft/vscode#208550 (comment)

@peminator peminator changed the title unreliable detection, wrong naming for 1250 unreliable detection - windows1250 Jun 6, 2024
@aadsm
Copy link
Owner

aadsm commented Jun 11, 2024

Hey! I just saw your comment on microsoft/vscode#208550. Let me take a look into this. Also, don't-brow-me ;P.

@aadsm aadsm self-assigned this Jun 11, 2024
@aadsm
Copy link
Owner

aadsm commented Jun 11, 2024

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

@peminator
Copy link
Author

peminator commented Jun 12, 2024

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

Afaik Microsoft used this sentence to test fonts (pangram = showcase accented characters)
**Příliš žluťoučký kůň úpěl ďábelské ódy.**
Thats a sentence in Czech, another country sure using it was slovak, i would add
**Päť tôní, ľahký skok**
-- just copy the sentences and save it using Windows-1250 in VsCode or other editor which can do it.

I'd suggest just if unsure, if it is one of Windows-1252 or ISO8859-2 detected with any confidence, just add also 1250 to possible results with a bit lower confidence level. Its similarity + characters also mentioned here on wikipedia Windows-1250 So if there is a chance its 1252, theres also chance its 1250.

That way, current users it would still get what expected before, and in VS Code where i could provide multiple quess candidates in recent insiders build, if i configure it to look only for chance of 1250. i would get what i need...

That would be an immediate quick way, with "to be perfected" for later

Also i suggest u rename it, it never was named Hungarian afaik, its official name is "Central European" bc its used to many countries there, naming it after one country may trigger others
(exactly to how it triggered me as slovak, bc hungary used to try absorb slovakia in past as part of their hungarian empire, the used to behave with great arogance back then and slovak ppl were opressed , many slovak ppl still feel great hate about them), it never was an open war, but often was not from it, so naming it hungarian feels to me like the "empire strikes back" even long after dismissed ;D

aadsm added a commit that referenced this issue Jun 18, 2024
This is in relation to #70. The differences between the two are minimal, so this will be a workaround for now. These two encodings use very different models for detection. The windows-1252 detector is purely based on the occurance probability of each character's class. The windows-1250 uses a Hungarian language model to detect the text is in Hungarian. This is brittle as there are other languages using windows-1250.
@aadsm
Copy link
Owner

aadsm commented Jun 18, 2024

I was busy on the weekend. I think your suggestion makes sense so I went ahead and implemented just that.

The reason it's named "windows-1250 (Hungarian)" is because it uses a Hungarian language model to predict if the text is in Hungarian. Like you mentioned, other countries used the same encoding, so I imagine that's the reason we're getting no match at all. But it could also be that windows-1252 is just being detected with less characters than windows-1250 needs to come up with any confidence. This is something that I might look into in the future, as it could also affect other encodings.

Ah yeah, the slavic countries have been a source of invasion and dispute for centuries 😬. I didn't follow the breakup of those countries after the fall of the ussr, but still remember yugoslavia and czechoslovakia being on the news.
The word slave actually comes from "slav" due to the slavery of slavs that happened during the caliphate :(.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants