Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Islbs port #77

Merged
merged 4 commits into from
Jan 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ License: GPL-3
Encoding: UTF-8
LazyData: true
LazyDataCompression: xz
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Suggests:
broom,
dplyr,
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Developmental

* Added new datasets:
* `LEAP`, `arenosa`, `cdc`, `cdc.samp`, `census.2010`, `danish.ed.primary`, `danish.ed.validation`, `dds.discr`, `famuss`, `forest.birds`, `frog`, `hyperuricemia`, `hyperuricemia.samp`, `infant_mortality_2022`, `mcas`, `nhanes.samp`, `nhanes.samp.adult`, `nhanes.samp.adult.500`, `opp_insights_colleges`, `opp_insights_colleges_4year`, `prevend`, `prevend.samp`, `sugar.levels.A`, `sugar.levels.B`, `swim`, `tb.interruption`, `thermometry`, `wdi_2022` ported from ISLBS by [@npaterno](https://github.com/npaterno)

# openintro 2.5.0

* Added new datasets:
Expand Down
44 changes: 44 additions & 0 deletions R/data-LEAP.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#' Patient level data on the randomized trial Learning Early About Peanut (LEAP) allergies.
#'
#' This study examined whether early exposure to peanuts increased tolerance and
#' protection from developing a peanut allergy in children who are allergic to
#' eggs or who have severe eczema. Participants between 4 and 11 months old were
#' randomized to either avoid versus consume peanut based products during the
#' first three years of life. The longer title of the study is Induction of
#' Tolerance Through Early Introduction of Peanut in High-Risk Children and can
#' be found in \url{https://clinicaltrials.gov/} as study NCT00329784.
#'
#' More variables are available at the site in the source.
#'
#' @docType data
#' @format A data frame with 640 rows and 7 columns
#' \describe{
#' \item{\code{participant.ID}}{Character vector, unique identifier for each participant.}
#' \item{\code{stratum}}{Factor, outcome of a skin prick test (SPT) conducted
#' before randomization, with levels \code{SPT-Negative}, participant
#' shows no evidence of peanut allergy, and \code{SPT-Positive}, evidence
#' of a peanut allergy. Participants were
#' randomized separately within each stratum. The primary analysis of the
#' study is typically restricted to the SPT-Negative group.}
#' \item{\code{treatment.group}}{Factor, randomized assignment for each participant,
#' with levels \code{Peanut Avoidance} and \code{Peanut Consumption}}.
#' \item{\code{age.months}}{Participant age in months at randomization.}
#' \item{\code{sex}}{Factor, sex of participant with levels \code{Female} and
#' \code{Male}}
#' \item{\code{primary.ethnicity}}{Factor variable with levels \code{Asian},
#' \code{Black}, \code{Other}, \code{Mixed}, and \code{White}.}
#' \item{\code{overall.V60.outcome}}{Factor, indicating whether after 5 years,
#' the participant had an allergic reaction in the OFC,
#' with levels for having a reaction to a peanut based oral food challenge,
#' with levels (\code{FAIL OFC}) (allergic reaction),
#' (\code{PASS OFC}) (no allergic reaction)}
#' }
#' @source These data are a subset of variables from the file ADSTART0_2015-03-03_14-20-10.txt,
#' available by downloading study files from
#' \url{https://www.immport.org/shared/study/SDY660}
#' @references Du Toit, George, et al. "Randomized trial of peanut consumption in
#' infants at risk for peanut allergy."
#' New England Journal of Medicine 372.9 (2015): 803-813.
#' doi 10.1056/nejmoa1414850
#'
"LEAP"
39 changes: 39 additions & 0 deletions R/data-arenosa.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#' arenosa
#'
#' Published results used RNA-Seq to investigate how cold responsiveness differs
#' in two populations of A. arenosa:
#' TBG (collected from Triberg, Germany) and
#' KA (collected from Kasparstein, Austria). Each row corresponds to a gene;
#' the first column contains the gene name; other columns correspond to expression
#' measured in a plant sample. Three plants of each population were exposed
#' to cold (vernalized, denoted by v), and three were not (non-vernalized,
#' denoted by nv). Expression was measured in gene counts
#' (i.e. the number of RNA transcripts present in a sample);
#' the data were then normalized to allow comparison between samples.
#'
#' @name arenosa
#' @docType data
#' @format A tibble with 1088 rows and 13 variables:
#' \describe{
#' \item{\code{gene.name}}{a character vector}
#' \item{\code{ka.nv.1}}{a numeric vector}
#' \item{\code{ka.nv.2}}{a numeric vector}
#' \item{\code{ka.nv.3}}{a numeric vector}
#' \item{\code{ka.v.1}}{a numeric vector}
#' \item{\code{ka.v.2}}{a numeric vector}
#' \item{\code{ka.v.3}}{a numeric vector}
#' \item{\code{tbg.nv.1}}{a numeric vector}
#' \item{\code{tbg.nv.2}}{a numeric vector}
#' \item{\code{tbg.nv.3}}{a numeric vector}
#' \item{\code{tbg.v.1}}{a numeric vector}
#' \item{\code{tbg.v.2}}{a numeric vector}
#' \item{\code{tbg.v.3}}{a numeric vector}
#' }
#' @references Pierre Baduel, Brian Arnold, Cara M. Weisman, Ben Hunter, Kirsten Bomblies,
#' Habitat-Associated Life History and
#' Stress-Tolerance Variation in Arabidopsis arenosa, Plant Physiology,
#' Volume 171, Issue 1, May 2016, Pages 437–451
#' https://doi.org/10.1104/pp.15.01875https://doi.org/10.1104/pp.15.01875
#' @source K Bomblies Harvard University lab.
#'
"arenosa"
27 changes: 27 additions & 0 deletions R/data-cdc.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#' cdc
#'
#' A dataset from the 2000 Behavioral Risk Factors Surveillance System (BRFSS)
#' conducted by the US Centers for Disease Control and Prevention used to
#' illustrate inference on demographic data.
#'
#' @name cdc
#' @docType data
#' @format A dataframe with 20,000 rows and 9 variables:
#' \describe{
#' \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
#' \code{good}, \code{fair}, \code{poor}}
#' \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
#' past month and 0 otherwise.}
#' \item{\code{hlthplan}}{Numeric; 1 if the respondent has some form
#' of health coverage and 0 otherwise.}
#' \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
#' cigarettes in their entire life and 0 otherwise.}
#' \item{\code{height}}{Numeric; respondent's height in inches.}
#' \item{\code{weight}}{Numeric; respondent's weight in pounds.}
#' \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
#' \item{\code{age}}{Numeric; respondent's age in years.}
#' \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
#' }
#' @source("https://www.cdc.gov/brfss/index.html")
#'
"cdc"
26 changes: 26 additions & 0 deletions R/data-cdc.samp.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#' cdc.samp
#'
#' A sample of 60 individuals from the 2000 Behavioral Risk Factors Surveillance System
#' (BRFSS) conducted by the US Centers for Disease Control.
#'
#' @name cdc.samp
#' @docType data
#' @format A tibble with 60 rows and 9 variables:
#' \describe{
#' \item{\code{genhlth}}{Factor with levels \code{excellent}, \code{very good}
#' \code{good}, \code{fair}, \code{poor}}
#' \item{\code{exerany}}{Numeric vector; 1 if the respondent exercised in the
#' past month and 0 otherwise.}
#' \item{\code{hlthplan}}{Numeric vector; 1 if the respondent has some form
#' of health coverage and 0 otherwise.}
#' \item{\code{smoke100}}{Numeric; 1 if the respondent has smoked at least 100
#' cigarettes in their entire life and 0 otherwise.}
#' \item{\code{height}}{Numeric; respondent's height in inches.}
#' \item{\code{weight}}{Numeric; respondent's weight in pounds.}
#' \item{\code{wtdesire}}{Numeric; respondent's desired weight in pounds.}
#' \item{\code{age}}{Numeric; respondent's age in years.}
#' \item{\code{gender}}{Factor with two levels \code{m} \code{f}}
#' }
#' @source("http://www.openintro.org/stat/data/cdc.R")
#'
"cdc.samp"
26 changes: 26 additions & 0 deletions R/data-census.2010.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#' census.2010
#'
#' United States 2010 infant mortality and number of physicians by state,
#' including the District of Columbia.
#'
#' Data were abstracted from the 2010 Statistical Abstract of the United States.
#' Due to a lag in recording state level data, the infant mortality data is from
#' 2009 and the data on physicians from 2007. Both measurements are subject to
#' change annually, so these data are not current and should not be used for
#' inference about infant mortality. More current data can be found at the US
#' Centers for Disease Control and Prevention (\url{https://www.cdc.gov/nchs/pressroom/sosmap/infant_mortality_rates/infant_mortality.htm}), and in the dataset \code{infant_mort_2022}.
#'
#' @name census.2010
#' @docType data
#' @format A data frame with 51 rows and 3 columns.
#' \describe{
#' \item{\code{state}}{Character vector vector, US State including the District of Columbia}
#' \item{\code{inf.mort}}{Numeric vector, number of deaths per 1000 live births between 1 day
#' and 1 year of age}
#' \item{\code{doctors}}{Numeric vector, active physicians per 100,000 population}
#' }
#' @source \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/births-deaths-marriages-divorces.html},
#' \url{https://www.census.gov/library/publications/2009/compendia/statab/129ed/health-nutrition.html}
#'
"census.2010"

56 changes: 56 additions & 0 deletions R/data-danish.ed.primary.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#' danish.ed.primary
#'
#' Data from a Danish study on triage in an emergency department (ED)
#'
#' Data from a prospective cohort study of triage scoring for an emergency
#' department (ED). The study examined whether the use of patient level
#' measurements would improve an existing triage score. These data are the
#' training data (called primary data in the original manuscript) used for model
#' building. Some variable names have been changed for readability, but the data
#' on 21 variables for the 6,249 participants are otherwise unchanged.
#'
#' @name danish.ed.primary
#' @docType data
#' @format A tibble with 6249 rows and 21 variables:
#' \describe{
#' \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
#' otherwise}
#' \item{\code{triage}}{factor, triage score given at arrival to ED.
#' Values \code{green}, \code{yellow}, \code{orange}, \code{red}, from lowest
#' to highest priority
#' for treatment. The value \code{blue} normally denotes severity not
#' warranting admission to the ED, but no participants coded blue
#' are in these data.}
#' \item{\code{age}}{numeric, age in years, rounded to lower integer}
#' \item{\code{sex}}{factor, values \code{female}, \code{male}}
#' \item{\code{albumin}}{numeric, serum albumin, in g/L}
#' \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
#' \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
#' \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
#' \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
#' \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
#' \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
#' \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, as a percent}
#' \item{\code{resp.rate}}{numeric, respiratory rate per minute}
#' \item{\code{heart.rate}}{numeric, heart rate, beats/min}
#' \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
#' \item{\code{glasgow.coma.scale}}{numeric, extent
#' of impaired consciousness in patients with acute medical condition or
#' trauma, scored between 3 and 15, 3 being the worst and 15 the best. Score
#' is based on 3 subscales, best eye, verbal and motor responses.}
#' \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
#' values \code{yes}, \code{no}}
#' \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
#' \item{\code{icu.time}}{numeric, number of days in the intensive care unit.
#' value 99999 indicates patient not admitted to ICU}
#' \item{\code{icu.status}}{factor, patient admitted to ICU, values \code{yes},
#' \code{no}}
#' }
#' #' @references Kristensen, Michael, et al. "Routine blood tests are associated
#' with short term mortality and can improve emergency department triage: a cohort
#' study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
#' Emergency Medicine 25 (2017): 1-8.
#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
#' @source \url{doi:10.5061/dryad.m2bq5}
#'
"danish.ed.primary"
50 changes: 50 additions & 0 deletions R/data-danish.ed.validation.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#' Data from a Danish study on triage in an emergency department (ED)
#'
#' Data from a prospective cohort study of triage scoring for an emergency
#' department (ED). The study examined whether the use of patient level
#' measurements would improve an existing triage score. These data were used as
#' a test set (called validation in the manuscript) to examine the performance
#' of the model built using the training (primary) cohort. Some variable names
#' have been changed for readability and for consistency with the primary dataset,
#' but the data on 18 variables for the 6,383 participants are otherwise unchanged.
#' Some variables in the primary dataset do not appear in these data.
#'
#' @name danish.ed.validation
#' @docType data
#' @format A tibble with 6383 rows and 18 variables:
#' \describe{
#' \item{\code{mort30}}{numeric, 1 if patient died within 30 days of admission, 0
#' otherwise}
#' \item{\code{triage}}{factor, triage score given at arrival to ED.
#' Values \code{blue}, \code{green}, \code{yellow}, \code{orange}, \code{red},
#' from lowest to highest priority
#' for treatment. The value \code{blue} normally denotes severity not
#' warranting admission to the ED. Participants coded \code{blue}
#' are in these data but not in the primary data.}
#' \item{\code{age}}{numeric, age in years, rounded to lower integer}
#' \item{\code{sex}}{factor, \code{female}, \code{male}}
#' \item{\code{albumin}}{numeric, serum albumin, in g/L}
#' \item{\code{creatinine}}{numeric, serum creatinine, in umol/L}
#' \item{\code{hemaglobin}}{numeric, serum hemaglobin, in mmol/L }
#' \item{\code{potassium}}{numeric, serum potassium, in mmol/L}
#' \item{\code{leuk.count}}{blood leukocyte count, in 10E9/L}
#' \item{\code{sodium}}{numeric, serum sodium, in mmol/L}
#' \item{\code{c.react.protein}}{numeric, serum C-reactive protein}
#' \item{\code{oxygen.sat}}{numeric, peripheral arterial oxygen saturation, %}
#' \item{\code{resp.rate}}{numeric, respiratory rate per minute}
#' \item{\code{heart.rate}}{numeric, heart rate, beats/min}
#' \item{\code{systolic.bp}}{numeric, systolic blood pressure, in mmHg}
#' \item{\code{readmit.hosp}}{factor, readmitted to hospital within 30 days,
#' with values \code{yes}, \code{no}}
#' \item{\code{days.in.hosp}}{numeric, number of days admitted to hospital}
#' \item{\code{icu.status}}{factor, patient admitted to ICU, with values
#' \code{yes}, \code{no}}
#' }
#' @references Kristensen, Michael, et al. "Routine blood tests are associated
#' with short term mortality and can improve emergency department triage: a cohort
#' study of> 12,000 patients." Scandinavian Journal of Trauma, Resuscitation and
#' Emergency Medicine 25 (2017): 1-8.
#' \url{https://sjtrem.biomedcentral.com/articles/10.1186/s13049-017-0458-x?report=reader}
#' @source \url{doi:10.5061/dryad.m2bq5}
#'
"danish.ed.validation"
36 changes: 36 additions & 0 deletions R/data-dds.dscr.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#' A dataset on disbursements from the California Department of Developmental Services (DDS)
#'
#' The dataset represents a sample of 1,000 DDS consumers (out of a total
#' population of approximately 250,000),and includes information about age,
#' gender, ethnicity, and the amount of financial support per consumer provided
#' by the DDS.The dataset is based on recorded attributes of consumers, but has
#' been altered to maintain consumer privacy. From the Taylor and Mickel paper:
#' "The data set originated from DDS’s Client Master File. In order to remain in
#' compliance with California State Legislation, the data have been altered to
#' protect the rights and privacy of specific individual consumers. The provided
#' data set is based on actual attributes of consumers."
#'
#' @name dds.dscr
#' @docType data
#' @format A dataframe with 1000 rows and 6 variables:
#' \describe{
#' \item{\code{id}}{Numeric, Unique identification code for each resident}
#' \item{\code{age.cohort}}{A factor, \code{0-5} years,
#' \code{6-12} years, \code{13-17} years, \code{18-21} years, \code{22-50} years,
#' and \code{51+} years}
#' \item{\code{age}}{Numeric, Age measured in years}
#' \item{\code{gender}}{A factor, with levels \code{Female} or \code{Male}}
#' \item{\code{expenditures}}{Numeric, Amount of expenditures spent by the
#' State on an individual annually, measured in USD}
#' \item{\code{ethnicity}}{Factor, Ethnic group, recorded as
#' \code{American Indian}, \code{Asian}, \code{Black}, \code{Hispanic},
#' \code{Multi Race}, \code{Native Hawaiian}, \code{Other},
#' \code{White not Hispanic}}
#' }
#' #' @references www.amstat.org/publications/jse/v22n1/mickel.pdf Taylor, Stanley A.,
#' and Amy E. Mickel. Simpson's paradox: A data set and discrimination case study
#' exercise. Journal of Statistics Education 22.1 (2014).
#' Data contained in supplement B of Taylor and Mickel.
#'
"dds.discr"

42 changes: 42 additions & 0 deletions R/data-famuss.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#' A dataset to examine the relationship between muscle strength and the single nucleotide polymorphism (SNP) actn3.r577x.
#'
#' This dataset is a subset of the larger data set from the Functional SNPs
#' Associated with Muscle Size and Strength (FAMuSS) by Thompson et.al. It
#' contains demographic, response and coding for the SNP for the study participants.
#' Unlike the data in the previous version of the \code{oibiostat} data package,
#' this dataset retains the missing values. The data are also discussed in the
#' Foulkes text. Strength was measured in both dominant and non-dominant arms
#' before and after resistance training. The particular gene of interest was
#' ACTN3, the "sports gene."
#'
# '@name famuss
#' @docType data
#' @format A tibble with 1397 rows and 10 variables
#' \describe{
#' \item{\code{ndrm.ch}}{A numeric vector, the percent change in strength
#' in a non-dominant arm, from before training and after.}
#' \item{\code{drm.ch}}{A numeric vector, percent change in strength in
#' dominant arm.}
#' \item{\code{sex}}{A factor with levels \code{Female} and \code{Male}}
#' \item{\code{age}}{A numeric vector, age in years.}
#' \item{\code{race}}{A factor with levels \code{African Am} \code{Asian}
#' \code{Caucasian} \code{Hispanic} \code{Other}}
#' \item{\code{height}}{A numeric vector,
#' height in inches.}
#' \item{\code{weight}}{A numeric vector, weight in pounds.}
#' \item{\code{actn3.r577x}}{A factor with levels \code{CC} \code{CT} \code{TT},
#' that shows the genotype at residue rs540874 (location r577x) within the ACTN3
#' SNP.}
#' \item{\code{bmi}}{A numeric vector, body mass index}
#' }
#' @source Personal communication from A. Foulkes
#' @references Thompson PMoyna NSeip R et al. Medicine and Science in Sports and
#' Exercise, (2004), 1132-1139, 36(7). Clarkson P, et al., Journal of Applied
#' Physiology 99: 154-163, 2005.Pescatello L, et al. Highlights from the
#' functional single nucleotide polymorphisms associated with human muscle
#' size and strength or FAMuSS study, BioMed Research International 2013. Foulkes, Andrea S.
#' Applied Statistical Genetics using R for Population Association Studies.
#' Springer, 2009).
#'
"famuss"

Loading
Loading