A small gem to aid with locale-sensitive string comparison (collation), which ruby lacks by default. Roughly based on Matz' rather ancient code. However, instead of creating a wrapper around these functions, I call them using FFI.
Everything this library does could be accomplished by adding two functions to ffi-libc. However, I didn't need any of the extra bindings ffi-libc would bring, and decided to separate the functionality.
The library offers only 4 functions, all of them thin wrappers over libc functionality:
You don't need ffi-locale if you:
- are using your ORM & RDBMS to sort strings - both major opensource DBs have had good or decent support for years
- will only ever be using ASCII
- think i18n is only about translating some messages
You need ffi-locale if you:
- are OCD about proper sorting
- process messy textual data from third-party sources
- keep your strings in a byte-oriented or otherwise localization-oblivious storage
- twitter_cldr offers the same functionality, and much, much more.
- ICU has collation, encoding detection and more.
- sort-alphabetical does a kind of collation that sorts accented letters same as their non-accented counterparts. It's not proper locale-sensitive collation, but might fit your needs.
- clocale basically a clone of ffi-locale, but as a C Ruby extension which makes it work on non-GNU C libraries (BSD, Windows)
Add this line to your Gemfile
:
gem 'ffi-locale', github: 'k3rni/ffi-locale'
You need to install the GitHub version of this gem, because it was never pushed to RubyGems due to naming conflicts. RubyGems has seanohalpin's very similar gem under this name. Check for that before reporting errors.
irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'pl_PL.UTF8'
irb> FFILocale.strcoll "łyk", "myk"
-1 # Correct collation order. In Polish alphabet, 'ł' comes between 'l' and 'm'.
irb> "łyk" <=> "myk"
1 # Incorrect collation. Correct with respect to Ruby semantics, which compares bytewise.
irb> %w(m l ł).sort { |a, b| FFILocale.strcoll a, b }
["l", "ł", "m"]
strxfrm
approach (mass string sorting: bulk-transform first, then rely on Ruby built-in string comparison):
irb> strings = %w(Ágnes Andor Cecil Cvi Csaba Elemér Éva Géza Gizella György Győző Lóránd Lotár Lőrinc Lukács Orsolya Ödön Ulrika Üllő)
irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'hu-HU.UTF8'
irb> sorted = strings.shuffle.sort_by{|s| FFILocale.strxfrm(s)}
=> ["Ágnes", "Andor", "Cecil", "Cvi", "Csaba", "Elemér", "Éva", "Géza", "Gizella", "György", "Győző", "Lóránd", "Lotár", "Lőrinc", "Lukács", "Orsolya", "Ödön", "Ulrika", "Üllő"]
irb> sorted == strings
true
One advantage of using strxfrm
with sort_by
is performace: the collation transform is computed only once for each item; another is that sort_by
makes it easier to sort by a compound value (e.g. multiple columns):
irb> FFILocale.setlocale FFILocale::LC_COLLATE, 'hu-HU.UTF8'
irb> [{name: "Ágnes", id: 789}, {name: "Andor", id: 456}, {name: "Ágnes", id: 123}].sort_by{|u| [FFILocale.strxfrm(u[:name]), u[:id]] }
=> [{:name=>"Ágnes", :id=>123}, {:name=>"Ágnes", :id=>789}, {:name=>"Andor", :id=>456}]
- Extensions to String class, to facilitate collation.
- Altering default String sort order. Bad idea - won't be implemented.
- Extensions to Array or Enumerable, to add or alter sort methods. Unnecessary, because passing
blocks to
sort
andsort_by
solves the issue (see example above). - Not tested beyond Linux. Patches are welcome.
Copyright © 2011-2017 Krzysztof Zych. See LICENSE.txt for further details.