Skip to content

Commit

Permalink
Handle non-utf8 string comparison
Browse files Browse the repository at this point in the history
When normalizing field contents for comparison AND the contents include
invalid utf-8 byte encodings, try to transcode the string from marc-8 to
utf-8 before continuing with the normalization. When the string is also
not valid marc-8, use the unnormalized string for comparison.
  • Loading branch information
ldss-jm committed Feb 26, 2020
1 parent f894fd5 commit ef1fb28
Show file tree
Hide file tree
Showing 4 changed files with 45 additions and 2 deletions.
22 changes: 21 additions & 1 deletion lib/marc_wrangler/comparable_field.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,32 @@ def self.spec=(spec)
end

def self.norm_string(str)
fs = str.force_encoding('UTF-8').unicode_normalize.gsub(/ +$/, '')
# ruby-marc assumes files are utf-8 unless another encoding is specified;
# LDR/09 value is not considered. If the string includes invalid encoding
# for utf-8, try assuming marc-8 encoding. If string is not valid marc-8
# either, return the string for comparison unnormalized (which for diffing
# two strings should be better than scrubbing invalid bytes).
begin
fs = str.force_encoding('UTF-8').unicode_normalize
rescue ArgumentError
begin
fs = marc8_transcoder.transcode(str).unicode_normalize
rescue StandardError
return str.dup
end
end
fs.rstrip!
fs.gsub!(/(.)\uFE20(.)\uFE21/, "\\1\u0361\\2") if fs =~ /\uFE20/
fs.gsub!(/\.$/, '') if @ignore_trailing_periods
fs
end

def self.marc8_transcoder
return @marc8_transcoder if @marc8_transcoder
require 'marc/marc8/to_unicode'
@marc8_transcoder = MARC::Marc8::ToUnicode.new
end

def self.omitted_subfields_string(field)
return field.to_s unless @tags_w_sf_omissions&.key?(field.tag)

Expand Down
2 changes: 1 addition & 1 deletion lib/marc_wrangler/version.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module MarcWrangler
VERSION = '0.1.2.2'.freeze
VERSION = '0.1.3'.freeze
end
3 changes: 3 additions & 0 deletions marc_wrangler.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,7 @@ Gem::Specification.new do |spec|
spec.add_runtime_dependency 'highline', "~> 2.0.1"
spec.add_runtime_dependency 'marc', "~> 1.0.2"
spec.add_runtime_dependency 'enhanced_marc', "~> 0.3.2"

# unf_ext 0.0.7.6 was released without windows binaries
spec.add_runtime_dependency 'unf_ext', "0.0.7.5"
end
20 changes: 20 additions & 0 deletions spec/comparable_fields_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
require 'spec_helper'

RSpec.describe MarcWrangler::ComparableField do
describe '#norm_string' do
xit 'normalizes utf-8 strings' do
end

it 'also handles marc-8 encoded strings' do
marc8 = "$c\xC32008"
expect(described_class.norm_string(marc8)).to eq('$c©2008')
end

context 'string is not valid utf-8 or marc-8' do
it 'returns original, unnormalized, string' do
str = "\xC8"
expect(described_class.norm_string(str)).to eq(str)
end
end
end
end

0 comments on commit ef1fb28

Please sign in to comment.