Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Encoding info losses for non-ASCII column names #335

Open
shrektan opened this issue Apr 9, 2018 · 1 comment
Open

[R] Encoding info losses for non-ASCII column names #335

shrektan opened this issue Apr 9, 2018 · 1 comment

Comments

@shrektan
Copy link

shrektan commented Apr 9, 2018

If the column names contain non-ASCII strings, the Encoding info will be lost when reading from the local feather file. The following example is run on my Mac.

It will be even worse if it's run on a Windows machine because it seems like feather will try to convert the column names to native encoding from unknown encoding, leading to garbage column names that can never be converted back.

Minimal Reproducible Example

utf8_strings <- c("çile", "façile", "El. paÅ¡tas", "¡tas", "Þ")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
tbl <- data.frame(utf8_strings, latin1_strings, stringsAsFactors = FALSE)
colnames(tbl) <- c(utf8_strings[2], latin1_strings[2])
tbl2 <- local({
  tmp_file <- tempfile(fileext = ".feather")
  on.exit(unlink(tmp_file), add = TRUE)
  feather::write_feather(tbl, tmp_file)
  feather::read_feather(tmp_file)
})
colnames(tbl)
#> [1] "façile" "façile"
colnames(tbl2)
#> [1] "façile"    "fa\xe7ile" ############SEE HERE############
Encoding(colnames(tbl))
#> [1] "UTF-8"  "latin1"
Encoding(colnames(tbl2))
#> [1] "unknown" "unknown"
Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
colnames(tbl2)
#> [1] "façile" "façile"

sessionInfo()

#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.16    digest_0.6.15   rprojroot_1.3-2 backports_1.1.2
#>  [5] formatR_1.5     magrittr_1.5    evaluate_0.10.1 pillar_1.2.1   
#>  [9] rlang_0.2.0     stringi_1.1.7   rmarkdown_1.9   tools_3.4.3    
#> [13] stringr_1.3.0   feather_0.3.1   hms_0.4.2       yaml_2.1.18    
#> [17] compiler_3.4.3  pkgconfig_2.0.1 htmltools_0.3.6 knitr_1.20     
#> [21] tibble_1.4.2

On Windows the output will become

> colnames(tbl)
[1] "façile"    "fa<e7>ile"
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile"    
> Encoding(colnames(tbl))
[1] "UTF-8"  "latin1"
> Encoding(colnames(tbl2))
[1] "unknown" "unknown"
> Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile ######NOTICE THE FIRST ONE###########
@wesm
Copy link
Owner

wesm commented Apr 10, 2020

Does this issue persist in the arrow library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants