Skip to content

Commit

Permalink
update documentation for CRAN release
Browse files Browse the repository at this point in the history
  • Loading branch information
leeper committed Jun 17, 2017
1 parent 77419f7 commit 68a1071
Show file tree
Hide file tree
Showing 6 changed files with 123 additions and 59 deletions.
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CHANGES TO v0.5.5

* Expanded test suite and increased test coverage.
* Expanded test suite and increased test coverage, fixing a few tests that were failing on certain CRAN builds.

# CHANGES TO v0.5.4

Expand Down
17 changes: 10 additions & 7 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,13 @@ The core advantage of **rio** is that it makes assumptions that the user is prob

3. **rio**, wherever possible, does not import character strings as factors.

4. **rio** stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:
4. **rio** supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.

5. **rio** imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.

6. **rio** wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the **foreign** package. It uses [**data.table**](https://cran.r-project.org/package=data.table) for delimited formats, [**haven**](https://cran.r-project.org/package=haven) for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and [**readxl**](https://cran.r-project.org/package=readxl) and [**openxlsx**](https://cran.r-project.org/package=openxlsx) for reading and writing Excel workbooks.

7. **rio** stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:

- `label`: a description of variable
- `labels`: a vector mapping numeric values to character strings those values represent
Expand All @@ -161,14 +167,11 @@ The core advantage of **rio** is that it makes assumptions that the user is prob
})
export(spread_attrs(dat), "data.dta")
```

In addition, two functions (added in v0.5.5) provide easy ways to create character and factor variables from these "labels" attributes. `characterize()` converts a single variable or all variables in a data frame that have "labels" attributes into character vectors based on the mapping of values to value labels. `factorize()` does the same but returns factor variables. This can be especially helpful for converting these rich file formats into open formats (e.g., `export(characterize(import("file.dta")), "file.csv")`.

5. **rio** supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.

6. **rio** imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.

7. **rio** imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to **rio** by registering S3 methods. These methods should take the form `.import.rio_X()` and `.export.rio_X()`, where `X` is the file extension of a file type. An example is provided in the [rio.db package](https://github.com/leeper/rio.db).
8. **rio** imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to **rio** by registering S3 methods. These methods should take the form `.import.rio_X()` and `.export.rio_X()`, where `X` is the file extension of a file type. An example is provided in the [rio.db package](https://github.com/leeper/rio.db).

8. **rio** wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the **foreign** package. It uses [**data.table**](https://cran.r-project.org/package=data.table) for delimited formats, [**haven**](https://cran.r-project.org/package=haven) for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and [**readxl**](https://cran.r-project.org/package=readxl) and [**openxlsx**](https://cran.r-project.org/package=openxlsx) for reading and writing Excel workbooks.

## Package Installation

Expand Down
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,13 @@ The core advantage of **rio** is that it makes assumptions that the user is prob

3. **rio**, wherever possible, does not import character strings as factors.

4. **rio** stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:
4. **rio** supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.

5. **rio** imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.

6. **rio** wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the **foreign** package. It uses [**data.table**](https://cran.r-project.org/package=data.table) for delimited formats, [**haven**](https://cran.r-project.org/package=haven) for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and [**readxl**](https://cran.r-project.org/package=readxl) and [**openxlsx**](https://cran.r-project.org/package=openxlsx) for reading and writing Excel workbooks.

7. **rio** stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:

- `label`: a description of variable
- `labels`: a vector mapping numeric values to character strings those values represent
Expand All @@ -193,14 +199,11 @@ The core advantage of **rio** is that it makes assumptions that the user is prob
})
export(spread_attrs(dat), "data.dta")
```

In addition, two functions (added in v0.5.5) provide easy ways to create character and factor variables from these "labels" attributes. `characterize()` converts a single variable or all variables in a data frame that have "labels" attributes into character vectors based on the mapping of values to value labels. `factorize()` does the same but returns factor variables. This can be especially helpful for converting these rich file formats into open formats (e.g., `export(characterize(import("file.dta")), "file.csv")`.

5. **rio** supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.

6. **rio** imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.

7. **rio** imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to **rio** by registering S3 methods. These methods should take the form `.import.rio_X()` and `.export.rio_X()`, where `X` is the file extension of a file type. An example is provided in the [rio.db package](https://github.com/leeper/rio.db).
8. **rio** imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to **rio** by registering S3 methods. These methods should take the form `.import.rio_X()` and `.export.rio_X()`, where `X` is the file extension of a file type. An example is provided in the [rio.db package](https://github.com/leeper/rio.db).

8. **rio** wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the **foreign** package. It uses [**data.table**](https://cran.r-project.org/package=data.table) for delimited formats, [**haven**](https://cran.r-project.org/package=haven) for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and [**readxl**](https://cran.r-project.org/package=readxl) and [**openxlsx**](https://cran.r-project.org/package=openxlsx) for reading and writing Excel workbooks.

## Package Installation

Expand Down
77 changes: 40 additions & 37 deletions tests/testthat/test_gather_attrs.R
Original file line number Diff line number Diff line change
@@ -1,39 +1,42 @@
context("Gather attrs")
e <- try(import("http://www.stata-press.com/data/r13/auto.dta"))

e <- import("http://www.stata-press.com/data/r13/auto.dta")

test_that("Gather attrs from Stata", {
g <- gather_attrs(e)
expect_true(length(attributes(e[[1]])) >= 1)
expect_true(length(attributes(g[[1]])) == 0)
expect_true(length(attributes(e)) == 5)
expect_true(length(attributes(g)) == 8)
expect_true("label" %in% names(attributes(e[[1]])))
expect_true(!"label" %in% names(attributes(g[[1]])))
expect_true("label" %in% names(attributes(g)))
expect_true("labels" %in% names(attributes(g)))
})

test_that("Spread gathered attributes", {
g <- gather_attrs(e)
expect_true(all.equal(spread_attrs(g), e, check.attributes = TRUE))
})

test_that("Gather empty attributes", {
require("datasets")
g <- gather_attrs(iris)
expect_true(length(attributes(iris[[1]])) == 0)
expect_true(length(attributes(g[[1]])) == 0)
expect_true(length(attributes(iris)) == 3)
expect_true(length(attributes(g)) == 3)
})

test_that("gather_attrs() fails on non-data frame", {
expect_error(gather_attrs(letters))
})

test_that("spread_attrs() fails on non-data frame", {
expect_error(spread_attrs(letters))
})

rm(e)
if (!inherits(e, "try-error")) {

test_that("Gather attrs from Stata", {
g <- gather_attrs(e)
expect_true(length(attributes(e[[1]])) >= 1)
expect_true(length(attributes(g[[1]])) == 0)
expect_true(length(attributes(e)) == 5)
expect_true(length(attributes(g)) == 8)
expect_true("label" %in% names(attributes(e[[1]])))
expect_true(!"label" %in% names(attributes(g[[1]])))
expect_true("label" %in% names(attributes(g)))
expect_true("labels" %in% names(attributes(g)))
})

test_that("Spread gathered attributes", {
g <- gather_attrs(e)
expect_true(all.equal(spread_attrs(g), e, check.attributes = TRUE))
})

test_that("Gather empty attributes", {
require("datasets")
g <- gather_attrs(iris)
expect_true(length(attributes(iris[[1]])) == 0)
expect_true(length(attributes(g[[1]])) == 0)
expect_true(length(attributes(iris)) == 3)
expect_true(length(attributes(g)) == 3)
})

test_that("gather_attrs() fails on non-data frame", {
expect_error(gather_attrs(letters))
})

test_that("spread_attrs() fails on non-data frame", {
expect_error(spread_attrs(letters))
})

rm(e)

}
23 changes: 16 additions & 7 deletions tests/testthat/test_remote.R
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
context("Remote Files")

test_that("Import Remote Stata File", {
expect_true(is.data.frame(import("http://www.stata-press.com/data/r13/auto.dta")))
f <- try(import("http://www.stata-press.com/data/r13/auto.dta"))
if (!inherits(f, "try-error")) {
expect_true(is.data.frame(f))
}
})

test_that("Import Remote GitHub File", {
rfile <- "https://raw.githubusercontent.com/leeper/csvy/master/inst/examples/example.csvy"
expect_true(inherits(import(rfile), "data.frame"), label = "Import remote file")

rfile_imported <- try(import(rfile))
if (!inherits(rfile_imported, "try-error")) {
expect_true(inherits(rfile_imported, "data.frame"), label = "Import remote file")
}
lfile <- remote_to_local(rfile)
expect_true(file.exists(lfile), label = "Remote file copied successfully")
expect_true(inherits(import(lfile), "data.frame"), label = "Import local copy successfully")
if (!inherits(lfile, "try-error")) {
expect_true(file.exists(lfile), label = "Remote file copied successfully")
expect_true(inherits(import(lfile), "data.frame"), label = "Import local copy successfully")
}
})

test_that("Import Remote File from Shortened URL", {
Expand All @@ -19,6 +26,8 @@ test_that("Import Remote File from Shortened URL", {
})

test_that("Import from Google Sheets", {
googleurl <- "https://docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo/edit#gid=0"
expect_true(inherits(import(googleurl), "data.frame"), label = "Import google sheets")
googleurl1 <- "https://docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo/edit#gid=0"
expect_true(inherits(import(googleurl1), "data.frame"), label = "Import google sheets (specified sheet)")
googleurl2 <- "https://docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo/edit"
expect_true(inherits(import(googleurl2), "data.frame"), label = "Import google sheets (unspecified sheet)")
})
46 changes: 46 additions & 0 deletions vignettes/rio.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -203,9 +203,55 @@ unlink("fwf.csv")
unlink(fwf)
```

With metadata-rich file formats (e.g., Stata, SPSS, SAS), it can also be useful to pass imported data through `characterize()` or `factorize()` when converting to an open, text-delimited format: `characterize()` converts a single variable or all variables in a data frame that have "labels" attributes into character vectors based on the mapping of values to value labels (e.g., `export(characterize(import("file.dta")), "file.csv")`). An alternative approach is exporting to CSVY format, which records metadata in a YAML-formatted header at the beginning of a CSV file.

It is also possible to use **rio** on the command-line by calling `Rscript` with the `-e` (expression) argument. For example, to convert a file from Stata (.dta) to comma-separated values (.csv), simply do the following:

```
Rscript -e "rio::convert('mtcars.dta', 'mtcars.csv')"
```

## Package Philosophy

The core advantage of **rio** is that it makes assumptions that the user is probably willing to make. Eight of these are important:

1. **rio** uses the file extension of a file name to determine what kind of file it is. This is the same logic used by Windows OS, for example, in determining what application is associated with a given file type. By removing the need to manually match a file type (which a beginner may not recognize) to a particular import or export function, **rio** allows almost all common data formats to be read with the same function. And if a file extension is incorrect, users can force a particular import method by specifying the `format` argument. Other packages do this as well, but **rio** aims to be more complete and more consistent than each:

- [**reader**](https://cran.r-project.org/package=reader) handles certain text formats and R binary files
- [**io**](https://cran.r-project.org/package=io) offers a set of custom formats
- [**ImportExport**](https://cran.r-project.org/package=ImportExport) focuses on select binary formats (Excel, SPSS, and Access files) and provides a Shiny interface.
- [**SchemaOnRead**](https://cran.r-project.org/package=SchemaOnRead) iterates through a large number of possible import methods until one works successfully

2. **rio** uses `data.table::fread()` for text-delimited files to automatically determine the file format regardless of the extension. So, a CSV that is actually tab-separated will still be correctly imported. It's also crazy fast.

3. **rio**, wherever possible, does not import character strings as factors.

4. **rio** supports web-based imports natively, including from SSL (HTTPS) URLs, from shortened URLs, from URLs that lack proper extensions, and from (public) Google Documents Spreadsheets.

5. **rio** imports from from single-file .zip and .tar archives automatically, without the need to explicitly decompress them. Export to compressed directories is also supported.

6. **rio** wraps a variety of faster, more stream-lined I/O packages than those provided by base R or the **foreign** package. It uses [**data.table**](https://cran.r-project.org/package=data.table) for delimited formats, [**haven**](https://cran.r-project.org/package=haven) for SAS, Stata, and SPSS files, smarter and faster fixed-width file import and export routines, and [**readxl**](https://cran.r-project.org/package=readxl) and [**openxlsx**](https://cran.r-project.org/package=openxlsx) for reading and writing Excel workbooks.

7. **rio** stores metadata from rich file formats (SPSS, Stata, etc.) in variable-level attributes in a consistent form regardless of file type or underlying import function. These attributes are identified as:

- `label`: a description of variable
- `labels`: a vector mapping numeric values to character strings those values represent
- `format`: a character string describing the variable storage type in the original file

The `gather_attrs()` function makes it easy to move variable-level attributes to the data frame level (and `spread_attrs()` reverses that gathering process). These can be useful, especially, during file conversion to more easily modify attributes that are handled differently across file formats. As an example, the following idiom can be used to trim SPSS value labels to the 32-character maximum allowed by Stata:

```R
dat <- gather_attrs(rio::import("data.sav"))
attr(dat, "labels") <- lapply(attributes(dat)$labels, function(x) {
if (!is.null(x)) {
names(x) <- substring(names(x), 1, 32)
}
x
})
export(spread_attrs(dat), "data.dta")
```

In addition, two functions (added in v0.5.5) provide easy ways to create character and factor variables from these "labels" attributes. `characterize()` converts a single variable or all variables in a data frame that have "labels" attributes into character vectors based on the mapping of values to value labels. `factorize()` does the same but returns factor variables. This can be especially helpful for converting these rich file formats into open formats (e.g., `export(characterize(import("file.dta")), "file.csv")`.

8. **rio** imports and exports files based on an internal S3 class infrastructure. This means that other packages can contain extensions to **rio** by registering S3 methods. These methods should take the form `.import.rio_X()` and `.export.rio_X()`, where `X` is the file extension of a file type. An example is provided in the [rio.db package](https://github.com/leeper/rio.db).

0 comments on commit 68a1071

Please sign in to comment.