Adds custom boost::locale::numpunct #65

salvoilmiosi · 2021-07-04T13:50:28Z

#64 This PR attempts to fix the issue with the suggestion I've been making
It removes the std::numpunct implementation and instead replaces it with a simple, custom class with a similar interface that returns string_type instead of CharType.

artyom-beilis · 2021-07-04T15:28:54Z

Unfortunately it neither fixes the problem but also introduces new problems.

Lets start from what it broke. boost.locale uses std::numpunct facet to format numbers for different locales. It gives partial solution (since it can't provide NBSP or custom digits etc) but at least gives something. Once you removed it as::number wouldn't work in std/posix/winapi backends.

Additionally note thousands and decimal separator aren't enough - there is grouping - how to format it isn't always groups of three, sometimes it is two or other for example "10,00,00,00,00,000". So you need to extract both grouping and marks. But in ICU formatting of groups is very different from std::locale/posix locale. In you case if grouping returns "" - than no thousands separators are used.

I recommend you to study the subject deeply before you write such a patch - in both terms localization and how it is done and how it is implemented in Boost.Locale. You can always ask questions

salvoilmiosi · 2021-07-04T15:56:03Z

I must admit that my patch was a bit rushed. I have also now added grouping info to my icu implementation.
I think we could add a std::numpunct facet by just pulling the first character from the string output and defaulting them to ', .' if they're more than one byte, just like it did in win32 and posix implementation.

artyom-beilis · 2021-07-04T15:59:30Z

It is little bit complicated than that.

https://en.wikipedia.org/wiki/Decimal_separator#Examples_of_use

I really suggest first define what are you trying to solve. Than look what exists in icu and multiple locales (usually you'll be surprised how euro-centric your thinking will be) and how to solve it properly.

It is highly complex topic usually not solved by such a simple manner... And I'm not going to merge half baked solutions.

salvoilmiosi · 2021-07-04T17:04:11Z

I have made my boost::locale::numpunct a subclass of std::numpunct to un-break winapi. Now this solves my problem of being able to access decimal_point, thousands_sep and grouping from icu backend and is compatible with all api's. I have pretty much just used the same technique that have already been in place for posix and winapi.

artyom-beilis · 2021-07-04T18:31:05Z

I have made my boost::locale::numpunct a subclass of std::numpunct to un-break winapi. Now this solves my problem of being able to access decimal_point, thousands_sep and grouping from icu backend and is compatible with all api's. I have pretty much just used the same technique that have already been in place for posix and winapi.

But it does not solve the issue: std::numpunct should return "C" locale separators for std/posix/winapi backend as it is done in ICU. What you suggest is something entirely different.

And I don't think it is correct thing to do.

salvoilmiosi · 2021-07-04T18:44:44Z

So you're telling me that the current std::numpunct implementation in winapi and posix is broken but it's necessary because it's called by libstc++ in order for number formatting and parsing to work? Also why should numpunct always return "C" locale separators? Can't you just imbue std::locale::classic() if you want that functionality?

artyom-beilis · 2021-07-04T18:47:03Z

So you're telling me that the current std::numpunct implementation in winapi and posix is broken but it's necessary because it's called by libstc++ in order for number formatting and parsing to work?

Exactly.

artyom-beilis · 2021-07-04T18:49:41Z

Also why should numpunct always return "C" locale separators? Can't you just imbue std::locale::classic() if you want that functionality?

Because by default the formatting (without specific flags) by boost.locale is std::locale::classic() formatting

salvoilmiosi · 2021-07-04T18:51:23Z

I guess I'll live with my fork in my own project then.

Flamefire · 2021-07-06T06:53:57Z

I think it all boils down to this:

I really suggest first define what are you trying to solve.

See #64 (comment) for why it is working as defined for ICU and #64 (comment) for what is currently NOT working

Patches and fixes are always great, but first the expected behavior needs to be defined and evaluated. Especially when it diverges from what was initially designed and reviewed(!)

salvoilmiosi · 2021-07-06T09:35:38Z

All I need to do is to get decimal point, thousands separator and grouping because I need to define a regex to find and parse numbers from strings, and the library that I use to deal with numbers only supports std::numpunct. There is no way currently to get that information from boost::locale without hacks.

Flamefire · 2021-07-06T09:47:04Z

So you need a std::numpunct for the output of bl::as::number? But that would be impossible as the interface cannot be fulfilled. See https://en.cppreference.com/w/cpp/locale/numpunct

Let's take #64 (comment)

If you want to format number it is also wrong to use facet because number formatting is much more than just separator for example ١٢ is very valid and good number.

If that is a possible output of bl::as::number then the whole point of numpunct is mood. How would your library deal with that?

-> Don't use bl::as::number, fix the posix numpunct in Boost.Locale and use that.

bialasjaroslaw · 2022-07-24T11:00:06Z

It is long. TL;DR -> skip 3 paragraphs

I came here because I wanted to have a locale aware formatting functionality, that does not force me to use any heavy commercial product (if there is anything else that can work cross-platform I will be happy to try it).

How hard it could be, right? There is a std::locale, so that is all, done. Except, it is not working if locale is not generated on the OS (exception is thrown to give ma a clear indication that I am screwed). That's not an option, are there any alternatives? Boost.Locale for the rescue. I checked difference between 3 backends: std, posix and icu (working on Linux). Looks fine, as it is enough to use boost::generator to generate a locale and imbue it on ostream. Although only icu backend works for locales that aren't generated for current OS, it is fine for me.

Finally I can use generated locale and inject it to formatting library to have double an int localized - not so fast. FMT (which I am using) does not care about locale I am injecting. Why? Because it is using numpunct facet methods (like grouping and thousands_sep) and those are broken because numpunct is broken.

TL;DR

To sum up. I can not even have ' ' for thousand separator, ',' for decimal separator and grouping size of 3, because there is NBSP - C2 A0, instead of ' ' in a locale that I want to support. There is no easy alternative provided, because (as far as I understand) no solution is better option, than broken (but close enough and maybe for some people acceptable) solution.

Workaround

I can always use stringstream and boost::locale::as::number to get similar results, but it will also require to change some code that I already have in my codebase with locale object injected to 3rd party library. I can at least try to apply some patching for FMT in case a different facet can be used with correct separators, but I can see none in the documentation except one that is mentioned in this toppic.
One can also said that because I only need 3 simple pieces of data for each locale, it will be simpler to write a script that will generate that for my code.

Flamefire · 2022-07-24T15:23:18Z

I still don't understand the issue that is trying to be solved here.

I need to define a regex to find and parse numbers from strings, and the library that I use to deal with numbers only supports std::numpunct

std::numpunct does not support multiple bytes for the separators, so returning the correct values for the selected locale is impossible.

There is no easy alternative provided, because (as far as I understand) no solution is better option, than broken (but close enough and maybe for some people acceptable) solution.

So the issue is that the current std::numpunct implementation used by Boost.Locale should fall back to the values of the C locale if the separators would be multi-byte to provide a "best-effort solution" for those cases, correct?

Can you show a few simple test codes and expected outputs?

bialasjaroslaw · 2022-07-24T18:13:06Z

I have put some code in the repository on GitHub
I have used couple options to show what is going on with std, icu and posix backends for 3 different locales and couple ways of formating output (fmt and stringstream). I have also add a possibility to check how Qt behaves with same locales.

It is worth mentioning what locales I have on my OS:

$ locale -a
C
C.UTF-8
en_US.utf8
pl_PL.utf8
POSIX

Question is: Should different backends behave differently in the same conditions?
Right now ICU locale for FMT is always behaving like there is only C locale available, even if there is only single character group delimiter. Backends like std and posix are doing that only for locales not available in the OS, for others they provide "best-effort solution" (empty char instead of nbsp for pl_PL). In case of en_US it is correct delimiter as there is only single character.

Stream based formating works well, but only for ICU, as std and posix provide only "best-effort" and produce the same output as in case of FMT format, so C for missing locales, and "best-effort" for multi character ones.

QtLocale formatting is using ICU under the hood. As you can see numbers are formatted correctly in all cases.

To sum up.
I don't know if there should be "reasonable" replacement in case of multi-character separators or just fallback to C locale. I want to have one because it is convenient, but that might be not good enough (there might be some projects that rely on that behavior). On the other hand there is no easy way of formatting int/double values correctly for multiple locales except for ICU + streams, which is not always a possibility (like libfmt).

There should be easily accessible and correct solution for a community. It was not "so easy" to write a code to inherit from std::numpunct to get at least "best-effort" results, and believe me I was determined.
For now I can see couple options:

modify ICU backend and return "best-effort" separator. Problem is if it should be coherent with std and posix and return empty separator for nbsp or ' '. Other issue is some people may depend on current "broken" solution, but I don't know if it is a big deal
provide a header with custom facet that can be used to obtain separators, grouping and delimiters but as a correct type. Maybe there is already one somewhere, but I was not able to find it.
Force C++ standardization group to fix std::numpunct facet - how long could it take, and how hard can it be, right?

Flamefire · 2022-07-25T08:11:27Z

Ok, I'll tackle that soonish although I can't provide estimates. First I want to get the tests of this lib back passing to avoid introducing regressions.

Question is: Should different backends behave differently in the same conditions?

Yes. IMO (I didn't invent this library, I only help maintaining) each backend should do formatting as good as possible. ICU is basically the "gold standard" in this regards so other backends may differ from ICU due to lack of the data tables etc. that ICU provides.

provide a header with custom facet that can be used to obtain separators, grouping and delimiters but as a correct type.

So basically the approach done by this PR.

So in summary this would be the solution:

provide a boost::locale::numpunct for all backends which for non-ICU backends behaves like the current std::numpunct solution (i.e. no change except having a facet in the Boost namespace which exists for all backends)
Enhance the facet to return strings via additional functions. For non-ICU this returns 1-char strings, for ICU it returns the correctly encoded string
For the ICU backend fall back to the C locale separator when it would be multi-char

Hence power users could do:

std::use_facet<boost::locale::numpunct<char>>(loc).decimal_point()
std::use_facet<boost::locale::numpunct<char>>(loc).decimal_point_str()

and other libs can always do std::use_facet<std::numpunct<char>>(loc).decimal_point() and get the same result as std::use_facet<boost::locale::numpunct<char>>(loc).decimal_point()

This addresses all 3 points of your options:

best-effort result also for ICU
custom facet with string-type separators
something which can be added to the standard if sufficient and widely used and someone has the time and energy to write a paper and propose it for the standard

Flamefire · 2023-05-14T10:06:04Z

FYI: I'm working on this and will open a new PR containing your change, rebased to develop with conflicts resolved already. That was non-trivial so I didn't want to burden you with that especially as it took that long till I got to this.

I'll also add tests for this which was a lot harder earlier but now should be pretty straight forward. So nothing do to from your side.

Anyway: Thanks for your proposal and work and sorry for not getting to this earlier.

salvoilmiosi force-pushed the numpunct_fix branch from 77d1e5c to d43f0d0 Compare July 4, 2021 13:52

salvoilmiosi force-pushed the numpunct_fix branch from d43f0d0 to 3672482 Compare July 4, 2021 15:52

salvoilmiosi force-pushed the numpunct_fix branch from 3672482 to d1598d3 Compare July 4, 2021 15:58

salvoilmiosi force-pushed the numpunct_fix branch 4 times, most recently from b1f5d00 to 770ab80 Compare July 4, 2021 16:58

salvoilmiosi force-pushed the numpunct_fix branch 3 times, most recently from 348b7a8 to 8ddfc7e Compare July 5, 2021 18:10

Added custom boost::locale::numpunct

ff22efc

salvoilmiosi force-pushed the numpunct_fix branch from 8ddfc7e to ff22efc Compare July 5, 2021 18:15

fix icu_numpunct constructor

856ebf2

salvoilmiosi force-pushed the numpunct_fix branch from bb6342c to 856ebf2 Compare July 30, 2022 12:44

Merge remote-tracking branch 'upstream/develop' into numpunct_fix

0bddea8

salvoilmiosi force-pushed the numpunct_fix branch 15 times, most recently from d033f1e to 8f3a906 Compare July 31, 2022 20:01

fixes

6d6942c

salvoilmiosi force-pushed the numpunct_fix branch from 8f3a906 to 6d6942c Compare July 31, 2022 20:02

Merge remote-tracking branch 'upstream/develop' into numpunct_fix

f854557

Flamefire mentioned this pull request May 21, 2023

Add custom boost::locale::numpunct #170

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds custom boost::locale::numpunct #65

Adds custom boost::locale::numpunct #65

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021 •

edited

Loading

salvoilmiosi commented Jul 4, 2021 •

edited

Loading

artyom-beilis commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

Flamefire commented Jul 6, 2021

salvoilmiosi commented Jul 6, 2021

Flamefire commented Jul 6, 2021

bialasjaroslaw commented Jul 24, 2022 •

edited

Loading

Flamefire commented Jul 24, 2022

bialasjaroslaw commented Jul 24, 2022

Flamefire commented Jul 25, 2022

Flamefire commented May 14, 2023

Adds custom boost::locale::numpunct #65

Are you sure you want to change the base?

Adds custom boost::locale::numpunct #65

Conversation

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021 • edited Loading

salvoilmiosi commented Jul 4, 2021 • edited Loading

artyom-beilis commented Jul 4, 2021

artyom-beilis commented Jul 4, 2021

salvoilmiosi commented Jul 4, 2021

Flamefire commented Jul 6, 2021

salvoilmiosi commented Jul 6, 2021

Flamefire commented Jul 6, 2021

bialasjaroslaw commented Jul 24, 2022 • edited Loading

TL;DR

Workaround

Flamefire commented Jul 24, 2022

bialasjaroslaw commented Jul 24, 2022

Flamefire commented Jul 25, 2022

Flamefire commented May 14, 2023

artyom-beilis commented Jul 4, 2021 •

edited

Loading

salvoilmiosi commented Jul 4, 2021 •

edited

Loading

bialasjaroslaw commented Jul 24, 2022 •

edited

Loading