diff --git a/_freeze/modules/Module00-Welcome/execute-results/html.json b/_freeze/modules/Module00-Welcome/execute-results/html.json index c575693..40023c1 100644 --- a/_freeze/modules/Module00-Welcome/execute-results/html.json +++ b/_freeze/modules/Module00-Welcome/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "ed52f4aca1e410664cf99c84a47733ab", + "hash": "935876f3d43f893eeb810bf6e7aa04ef", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Welcome to SISMID Workshop: Introduction to R\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n## Welcome to SISMID Workshop: Introduction to R!\n\n**Amy Winter (she/her)** \n\nAssistant Professor, Department of Epidemiology and Biostatistics\n\nEmail: awinter@uga.edu\n\n
\n\n**Zane Billings (he/him)** \n\nPhD Candidate, Department of Epidemiology and Biostatistics\n\nEmail: Wesley.Billings@uga.edu\n\n\n## Introductions\n\n* Name?\n* Current position / institution?\n* Past experience with other statistical programs, including R?\n* Why do you want to learn R?\n* Favorite useful app\n* Favorite guilty pleasure app\n\n\n## What is R?\n\n- R is a language and environment for statistical computing and graphics developed in 1991\n\n- R is the open source implementation of the [S language](https://en.wikipedia.org/wiki/S_(programming_language)), which was developed by [Bell laboratories](https://ca.slack-edge.com/T023TPZA8LF-U024EN26Q0L-113294823b2c-512) in the 70s.\n\n- The aim of the S language, as expressed by John Chambers, is \"to turn ideas into software, quickly and faithfully\"\n\n## What is R?\n\n- **R**oss Ihaka and **R**obert Gentleman at the University of Auckland, New Zealand developed R\n\n\n- R is both [open source](https://en.wikipedia.org/wiki/Open_source) and [open development](https://en.wikipedia.org/wiki/Open-source_software_development)\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://www.r-project.org/logo/Rlogo.png){fig-align='center' fig-alt='R logo' width=20%}\n:::\n:::\n\n\n\n## What is R?\n\n* R possesses an extensive catalog of statistical and graphical methods \n * includes machine learning algorithm, linear regression, time series, statistical inference to name a few. \n\n* Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results\n\n\n## What is R?\n\n- Program: R is a clear and accessible programming tool\n- Transform: R is made up of a collection of packages/libraries designed specifically for statistical computing\n- Discover: Investigate the data, refine your hypothesis and analyze them\n- Model: R provides a wide array of tools to capture the right model for your data\n- Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world\n\n\n## Why R?\n\n* Free (open source)\n\n* High level language designed for statistical computing\n\n* Powerful and flexible - especially for data wrangling and visualization\n\n* Extensive add-on software (packages)\n\n* Strong community \n\n\n## Why not R?\n\n \n* Little centralized support, relies on online community and package developers\n\n* Annoying to update\n\n* Slower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)\n\n\n## Is R Difficult?\n\n* Short answer – It has a steep learning curve, like all programming languages\n* Years ago, R was a difficult language to master. \n* Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.\n\n\n## Overall Workshop Objectives\n\nBy the end of this workshop, you should be able to \n\n1. start a new project, read in data, and conduct basic data manipulation, analysis, and visualization\n2. know how to use and find packages/functions that we did not specifically learn in class\n3. troubleshoot errors\n\n\n## This workshop differs from \"Introduction to Tidyverse\"\n\nWe will focus this class on using **Base R** functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\n\nThe Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more **intuitive** as compared to the original older Base R. **Tidyverse advantages**: \n\n-\t**consistent structure** - making it easier to learn how to use different packages\n-\tparticularly good for **wrangling** (manipulating, cleaning, joining) data \n-\tmore flexible for **visualizing** data \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://tidyverse.tidyverse.org/logo.png){fig-align='center' fig-alt='Tidyverse hex sticker' width=10%}\n:::\n:::\n\n\n\n\n## Workshop Overview\n\n14 lecture blocks that will each:\n\n- Start with learning objectives\n- End with summary slides\n- Include mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n\n- Reproducibility\n- Good coding techniques\n- Thinking algorithmically\n- [Basic terms / R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf)\n\n\n## Reproducibility\n\n* **Reproducible research**: the idea that other people should be able to\nverify the claims you make -- usually by being able to see your data and run\nyour code.\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](../images/repspectrum.JPG){fig-align='center'}\n:::\n:::\n\n\n\n* **2023 was the US government's year of open science** -- specific\naspects of reproducibility will be mandated for federally funded research!\n* Sharing and documenting your code is a massive step towards making your\nwork reproducible, and the R ecosystem can play a big role in that!\n\n\n## Useful (+ Free) Resources\n\n**Want more?** \n\n- R for Data Science: http://r4ds.had.co.nz/ \n(great general information)\n\n- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ \n\n- R for Epidemiology: https://www.r4epi.com/\n\n- The Epidemiologist R Handbook: https://epirhandbook.com/en/\n\n- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html\n(great general information)\n \n- Open Case Studies: https://www.opencasestudies.org/ \n(resource for specific public health cases with statistical implementation and interpretation)\n\n## Useful (+Free) Resources\n\n**Need help?** \n\n- Various \"Cheat Sheets\": https://github.com/rstudio/cheatsheets/\n\n- R reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf \n\n- R jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n## Installing R\n\n\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone's laptops.\n\n- Install the latest version from: [http://cran.r-project.org/](http://cran.r-project.org/ )\n- [Install RStudio](https://www.rstudio.com/products/rstudio/download/)\n\n\n", - "supporting": [], + "markdown": "---\ntitle: \"Welcome to SISMID Workshop: Introduction to R\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n\n## Welcome to SISMID Workshop: Introduction to R!\n\n**Amy Winter (she/her)** \n\nAssistant Professor, Department of Epidemiology and Biostatistics\n\nEmail: awinter@uga.edu\n\n
\n\n**Zane Billings (he/him)** \n\nPhD Candidate, Department of Epidemiology and Biostatistics\n\nEmail: Wesley.Billings@uga.edu\n\n\n## Introductions\n\n* Name?\n* Current position / institution?\n* Past experience with other statistical programs, including R?\n* Why do you want to learn R?\n* Favorite useful app\n* Favorite guilty pleasure app\n\n## Course website\n\n* All of the materials for this course can be found online here: [here](https://uga-idd.github.io/SISMID-2024/).\n* This contains the schedule, course resources, and online versions of all of\nour slide decks.\n* The **Course Resources** page contains download links for all of the data,\nexercises, and slides for this class.\n* Please feel free to download these resources and share them -- all of the\ncourse content is under the [Creative Commons BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/).\n\n\n## What is R?\n\n- R is a language and environment for statistical computing and graphics developed in 1991\n\n- R is the open source implementation of the [S language](https://en.wikipedia.org/wiki/S_(programming_language)), which was developed by [Bell laboratories](https://ca.slack-edge.com/T023TPZA8LF-U024EN26Q0L-113294823b2c-512) in the 70s.\n\n- The aim of the S language, as expressed by John Chambers, is \"to turn ideas into software, quickly and faithfully\"\n\n## What is R?\n\n- **R**oss Ihaka and **R**obert Gentleman at the University of Auckland, New Zealand developed R\n\n\n- R is both [open source](https://en.wikipedia.org/wiki/Open_source) and [open development](https://en.wikipedia.org/wiki/Open-source_software_development)\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://www.r-project.org/logo/Rlogo.png){fig-align='center' fig-alt='R logo' width=20%}\n:::\n:::\n\n\n\n\n## What is R?\n\n* R possesses an extensive catalog of statistical and graphical methods \n * includes machine learning algorithm, linear regression, time series, statistical inference to name a few. \n\n* Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results\n\n\n## What is R?\n\n- Program: R is a clear and accessible programming tool\n- Transform: R is made up of a collection of packages/libraries designed specifically for statistical computing\n- Discover: Investigate the data, refine your hypothesis and analyze them\n- Model: R provides a wide array of tools to capture the right model for your data\n- Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world\n\n\n## Why R?\n\n* Free (open source)\n\n* High level language designed for statistical computing\n\n* Powerful and flexible - especially for data wrangling and visualization\n\n* Extensive add-on software (packages)\n\n* Strong community \n\n\n## Why not R?\n\n \n* Little centralized support, relies on online community and package developers\n\n* Annoying to update\n\n* Slower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)\n\n\n## Is R Difficult?\n\n* Short answer – It has a steep learning curve, like all programming languages\n* Years ago, R was a difficult language to master. \n* Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.\n\n\n## Overall Workshop Objectives\n\nBy the end of this workshop, you should be able to \n\n1. start a new project, read in data, and conduct basic data manipulation, analysis, and visualization\n2. know how to use and find packages/functions that we did not specifically learn in class\n3. troubleshoot errors\n\n\n## This workshop differs from \"Introduction to Tidyverse\"\n\nWe will focus this class on using **Base R** functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\n\nThe Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more **intuitive** as compared to the original older Base R. **Tidyverse advantages**: \n\n-\t**consistent structure** - making it easier to learn how to use different packages\n-\tparticularly good for **wrangling** (manipulating, cleaning, joining) data \n-\tmore flexible for **visualizing** data \n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://tidyverse.tidyverse.org/logo.png){fig-align='center' fig-alt='Tidyverse hex sticker' width=10%}\n:::\n:::\n\n\n\n\n\n## Workshop Overview\n\n14 lecture blocks that will each:\n\n- Start with learning objectives\n- End with summary slides\n- Include mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n\n- Reproducibility\n- Good coding techniques\n- Thinking algorithmically\n- [Basic terms / R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf)\n\n\n## Reproducibility\n\n* **Reproducible research**: the idea that other people should be able to\nverify the claims you make -- usually by being able to see your data and run\nyour code.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](../images/repspectrum.JPG){fig-align='center'}\n:::\n:::\n\n\n\n\n* **2023 was the US government's year of open science** -- specific\naspects of reproducibility will be mandated for federally funded research!\n* Sharing and documenting your code is a massive step towards making your\nwork reproducible, and the R ecosystem can play a big role in that!\n\n\n## Useful (+ Free) Resources\n\n**Want more?** \n\n- R for Data Science: http://r4ds.had.co.nz/ \n(great general information)\n\n- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ \n\n- R for Epidemiology: https://www.r4epi.com/\n\n- The Epidemiologist R Handbook: https://epirhandbook.com/en/\n\n- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html\n(great general information)\n \n- Open Case Studies: https://www.opencasestudies.org/ \n(resource for specific public health cases with statistical implementation and interpretation)\n\n## Useful (+Free) Resources\n\n**Need help?** \n\n- Various \"Cheat Sheets\": https://github.com/rstudio/cheatsheets/\n\n- R reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf \n\n- R jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n## Installing R\n\n\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone's laptops.\n\n- Install the latest version from: [http://cran.r-project.org/](http://cran.r-project.org/ )\n- [Install RStudio](https://www.rstudio.com/products/rstudio/download/)\n\n\n", + "supporting": [ + "Module00-Welcome_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], @@ -16,4 +18,4 @@ "preserve": {}, "postProcess": true } -} +} \ No newline at end of file diff --git a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json index 1f76fc9..b98cb35 100644 --- a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json +++ b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "71acdd2b69ea5af23987d320525e59a1", + "hash": "659422f556ed54450a8839eee24c84dd", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat:\n revealjs:\n smaller: true\n scrollable: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`. `head()` allows us to look at the first `n` observations.\n\n\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/head_args.png){fig-align='left' width=100%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n## Adding new columns with `$` operator\n\nYou can add a new column, called `log_IgG` to `df`, using the `$` operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n```\n\n\n:::\n:::\n\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.\n\n## Adding new columns with `transform()`\n\nWe can also add a new column using the `transform()` function:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?transform\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n```\n\n\n:::\n:::\n\n\n\n## Adding new columns with `transform()`\n\nFor example, adding a binary column for seropositivity called `seropos`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |\n:::\n:::\n\n\n\n\n## Creating conditional variables\n\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE`.\"\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ifelse\n```\n:::\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |young |\n:::\n:::\n\n\n\n## `ifelse` example\nLet's delve into what is actually happening, with a focus on the NA values in `age` variable.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\n## Nesting two `ifelse` statements example\n\n`ifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2))`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n```\n:::\n\n\n\nLet's use the `table()` function to check if it worked.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n\n\n|age/ | middle| old| young| NA|\n|:----|------:|---:|-----:|--:|\n|1 | 0| 0| 44| 0|\n|2 | 0| 0| 72| 0|\n|3 | 0| 0| 79| 0|\n|4 | 0| 0| 80| 0|\n|5 | 0| 0| 41| 0|\n|6 | 38| 0| 0| 0|\n|7 | 38| 0| 0| 0|\n|8 | 39| 0| 0| 0|\n|9 | 20| 0| 0| 0|\n|10 | 44| 0| 0| 0|\n|11 | 0| 41| 0| 0|\n|12 | 0| 23| 0| 0|\n|13 | 0| 35| 0| 0|\n|14 | 0| 37| 0| 0|\n|15 | 0| 11| 0| 0|\n|NA | 0| 0| 0| 9|\n:::\n:::\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nNote, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"double\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n:::\n\n\n\n\n## Logical\n\nReminder `logical` is a type that only has three possible elements: `TRUE` and `FALSE` and `NA`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"logical\"\n```\n\n\n:::\n:::\n\n\n\nNote that when creating `logical` object the `TRUE` and `FALSE` are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a another class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"1\" \"4\" \"7\"\n```\n\n\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 7\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\nIn some cases the coercing is not possible; if executed, will return `NA`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 NA\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE NA\n```\n\n\n:::\n:::\n\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"factor\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"middle\" \"old\" \"young\" \n```\n\n\n:::\n:::\n\n\n\nNote 1, that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\nNote 2, we can also make ordered factors using `factor(... ordered=TRUE)`, but we won't talk more about that.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\nLet's look at the `relevel()` help file\n\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\n\n\n
\n\nLet's look at the `factor()` help file\n\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and the model-fitting functions treat the two classes\n quite differently.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n\n## Changing factor reference examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nOR\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\n\n```\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data frame \n\nYou can transform an existing matrix into data frames using `as.data.frame()` \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n\n\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n\n-\t`mean()`: takes the mean of x\n-\t`sd()`: takes the standard deviation of x\n-\t`median()`: takes the median of x\n-\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t`range()`: displays the range. Same as `c(min(), max())`\n-\t`sum()`: sum of x\n-\t`max()`: maximum value in x\n-\t`min()`: minimum value in x\n- `colSums()`: get the columns sums of a data frame\n- `rowSums()`: get the row sums of a data frame\n- `colMeans()`: get the columns means of a data frame\n- `rowMeans`()`: get the row means of a data frame\n\nNote, the top 8 functions have an `na.rm` **argument for missing data**\n\n## Numeric variable data summary\n\nLet's look at a help file for `mean()` to make note of the `na.rm` argument\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?range\n```\n:::\n\nRange of Values\n\nDescription:\n\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\n\nUsage:\n\n range(..., na.rm = FALSE)\n \n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n \nArguments:\n\n ...: any 'numeric' or character objects.\n\n na.rm: logical, indicating if 'NA''s should be omitted.\n\n finite: logical, indicating if all non-finite elements should be\n omitted.\n\nDetails:\n\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\n\nS4 methods:\n\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\n\nExamples:\n\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)\n\n\n\n## Numeric variable data summary examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | seropos | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:-------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Mode :logical |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |FALSE:360 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |TRUE :281 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA's :10 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA NA\n```\n\n\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 15\n```\n\n\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.665753\n```\n\n\n:::\n:::\n\n\n\n\n## Character variable data summaries\n\nData summarization on character or factor vectors/variables using `table()`\n\n\t\t\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?table\n```\n:::\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: The name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n\n## Summary\n\n- You can create new columns/variable to a data frame by using `$` or the `transform()` function\n- One useful function for creating new variables based on existing variables is the `ifelse()` function, which returns a value depending on whether the element of test is `TRUE` or `FALSE`\n- The `class()` function allows you to evaluate the class of an object.\n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `False` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- There are many fairly intuitive data summary functions you can perform on a vector (i.e., `mean()`, `sd()`, `range()`) or on rows or columns of a data frame (i.e., `colSums()`, `colMeans()`, `rowSums()`)\n- The `table()` function builds frequency tables of the counts at each combination of categorical levels\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", - "supporting": [], + "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat:\n revealjs:\n smaller: true\n scrollable: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`. `head()` allows us to look at the first `n` observations.\n\n\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/head_args.png){fig-align='left' width=100%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n## Adding new columns with `$` operator\n\nYou can add a new column, called `log_IgG` to `df`, using the `$` operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n```\n\n\n:::\n:::\n\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.\n\n## Adding new columns with `transform()`\n\nWe can also add a new column using the `transform()` function:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?transform\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n```\n\n\n:::\n:::\n\n\n\n## Adding new columns with `transform()`\n\nFor example, adding a binary column for seropositivity called `seropos`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |\n:::\n:::\n\n\n\n\n## Creating conditional variables\n\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE`.\"\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ifelse\n```\n:::\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |young |\n:::\n:::\n\n\n\n## `ifelse` example\nLet's delve into what is actually happening, with a focus on the NA values in `age` variable.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\n## Nesting two `ifelse` statements example\n\n`ifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2))`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n```\n:::\n\n\n\nLet's use the `table()` function to check if it worked.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n\n\n|age/ | middle| old| young| NA|\n|:----|------:|---:|-----:|--:|\n|1 | 0| 0| 44| 0|\n|2 | 0| 0| 72| 0|\n|3 | 0| 0| 79| 0|\n|4 | 0| 0| 80| 0|\n|5 | 0| 0| 41| 0|\n|6 | 38| 0| 0| 0|\n|7 | 38| 0| 0| 0|\n|8 | 39| 0| 0| 0|\n|9 | 20| 0| 0| 0|\n|10 | 44| 0| 0| 0|\n|11 | 0| 41| 0| 0|\n|12 | 0| 23| 0| 0|\n|13 | 0| 35| 0| 0|\n|14 | 0| 37| 0| 0|\n|15 | 0| 11| 0| 0|\n|NA | 0| 0| 0| 9|\n:::\n:::\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nNote, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"double\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n:::\n\n\n\n\n## Logical\n\nReminder `logical` is a type that only has three possible elements: `TRUE` and `FALSE` and `NA`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"logical\"\n```\n\n\n:::\n:::\n\n\n\nNote that when creating `logical` object the `TRUE` and `FALSE` are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a another class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"1\" \"4\" \"7\"\n```\n\n\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 7\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\nIn some cases the coercing is not possible; if executed, will return `NA`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 NA\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE NA\n```\n\n\n:::\n:::\n\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"factor\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"middle\" \"old\" \"young\" \n```\n\n\n:::\n:::\n\n\n\nNote 1, that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\nNote 2, we can also make ordered factors using `factor(... ordered=TRUE)`, but we won't talk more about that.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\nLet's look at the `relevel()` help file\n\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\n\n\n
\n\nLet's look at the `factor()` help file\n\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and model-fitting functions may treat the two classes\n quite differently, see 'options(\"contrasts\")'.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n\n## Changing factor reference examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nOR\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\n\n```\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data frame \n\nYou can transform an existing matrix into data frames using `as.data.frame()` \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n\n\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\nYou can create a new data frame out of vectors (and potentially lists, but\nthis is an advanced feature and unusual) by using the `data.frame()` function.\nRecall that all of the vectors that make up a data frame must be the same\nlength.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlotr <- \n data.frame(\n name = c(\"Frodo\", \"Sam\", \"Aragorn\", \"Legolas\", \"Gimli\"),\n race = c(\"Hobbit\", \"Hobbit\", \"Human\", \"Elf\", \"Dwarf\"),\n age = c(53, 38, 87, 2931, 139)\n )\n```\n:::\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n\n-\t`mean()`: takes the mean of x\n-\t`sd()`: takes the standard deviation of x\n-\t`median()`: takes the median of x\n-\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t`range()`: displays the range. Same as `c(min(), max())`\n-\t`sum()`: sum of x\n-\t`max()`: maximum value in x\n-\t`min()`: minimum value in x\n- `colSums()`: get the columns sums of a data frame\n- `rowSums()`: get the row sums of a data frame\n- `colMeans()`: get the columns means of a data frame\n- `rowMeans()`: get the row means of a data frame\n\nNote, all of these functions have an `na.rm` **argument for missing data**.\n\n## Numeric variable data summary\n\nLet's look at a help file for `mean()` to make note of the `na.rm` argument\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?range\n```\n:::\n\nRange of Values\n\nDescription:\n\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\n\nUsage:\n\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n\n ...: any 'numeric' or character objects.\n\n na.rm: logical, indicating if 'NA''s should be omitted.\n\n finite: logical, indicating if all non-finite elements should be\n omitted.\n\nisNumeric: a 'function' returning 'TRUE' or 'FALSE' when called on\n 'c(..., recursive = TRUE)', 'is.numeric()' for the default\n 'range()' method.\n\nDetails:\n\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\n\nS4 methods:\n\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\n\nExamples:\n\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)\n\n\n\n## Numeric variable data summary examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | seropos | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:-------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Mode :logical |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |FALSE:360 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |TRUE :281 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA's :10 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA NA\n```\n\n\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 15\n```\n\n\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.665753\n```\n\n\n:::\n:::\n\n\n\n\n## Character variable data summaries\n\nData summarization on character or factor vectors/variables using `table()`\n\n\t\t\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?table\n```\n:::\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: the name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), dnn = NULL) # dimnames() have no names\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n\n## Summary\n\n- You can create new columns/variable to a data frame by using `$` or the `transform()` function\n- One useful function for creating new variables based on existing variables is the `ifelse()` function, which returns a value depending on whether the element of test is `TRUE` or `FALSE`\n- The `class()` function allows you to evaluate the class of an object.\n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `False` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- There are many fairly intuitive data summary functions you can perform on a vector (i.e., `mean()`, `sd()`, `range()`) or on rows or columns of a data frame (i.e., `colSums()`, `colMeans()`, `rowSums()`)\n- The `table()` function builds frequency tables of the counts at each combination of categorical levels\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "supporting": [ + "Module07-VarCreationClassesSummaries_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/modules/Module095-DataAnalysisWalkthrough/execute-results/html.json b/_freeze/modules/Module095-DataAnalysisWalkthrough/execute-results/html.json new file mode 100644 index 0000000..a290184 --- /dev/null +++ b/_freeze/modules/Module095-DataAnalysisWalkthrough/execute-results/html.json @@ -0,0 +1,19 @@ +{ + "hash": "ddd75cf90c1c86ca2d3a9928b1ac32ef", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Data Analysis Walkthrough\"\nformat:\n revealjs:\n toc: false\nexecute: \n echo: false\n---\n\n\n\n\n## Learning goals\n\n* Use logical operators, subsetting functions, and math calculations in R\n* Translate human-understandable problem descriptions into instructions that\nR can understand.\n\n# Remember, R always does EXACTLY what you tell it to do!\n\n## Instructions\n\n* Make a new R script for this case study, and save it to your code folder.\n* We'll use the diphtheria serosample data from Exercise 1 for this case study.\nLoad it into R and use the functions we've learned to look at it.\n\n## Instructions\n\n* Make a new R script for this case study, and save it to your code folder.\n* We'll use the diphtheria serosample data from Exercise 1 for this case study.\nLoad it into R and use the functions we've learned to look at it.\n* The `str()` of your dataset should look like this.\n\n\n\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...\n```\n\n\n:::\n:::\n\n\n\n\n## Q1: Was the overall prevalence higher in urban or rural areas?\n\n::: {.incremental}\n\n1. How do we calculate the prevalence from the data?\n1. How do we calculate the prevalence separately for urban and rural areas?\n1. How do we determine which prevalence is higher and if the difference is\nmeaningful?\n\n:::\n\n## Q1: How do we calculate the prevalence from the data?\n\n::: {.incremental}\n\n* The variable `DP_infection` in our dataset is binary / dichotomous.\n* The prevalence is the number or percent of people who had the disease over\nsome duration.\n* The average of a binary variable gives the prevalence!\n\n:::\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(diph$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8\n```\n\n\n:::\n:::\n\n\n\n\n## Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.8235294\n```\n\n\n:::\n\n```{.r .cell-code}\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.778626\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* There are many ways you could write this code! You can use `subset()` or you\ncan write the indices many ways.\n* Using `tbl_df` objects from `haven` uses different `[[` rules than a base R\ndata frame.\n\n## Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n* One easy way is to use the `aggregate()` function.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294\n```\n\n\n:::\n:::\n\n\n\n\n## Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n::: {.incremental}\n\n* We probably need to include a confidence interval in our calculation.\n* This is actually not so easy without more advanced tools that we will learn\nin upcoming modules.\n* Right now the best options are to do it by hand or google a function.\n\n:::\n\n## Q1: By hand\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n\t\"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n\tround(p_urban - 1.96 * se_urban, 2), \", \",\n\tround(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n\t\"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n\tround(p_rural - 1.96 * se_rural, 2), \", \",\n\tround(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)\n```\n\n\n:::\n:::\n\n\n\n\n## Q1: By hand\n\n* We can see that the 95% CI's overlap, so the groups are probably not that\ndifferent. **To be sure, we need to do a 2-sample test! But this is not a\nstatistics class.**\n* Some people will tell you that coding like this is \"bad\". **But 'bad' code\nthat gives you answers is better than broken code!** We will learn techniques for writing this with less work and less repetition\nin upcoming modules.\n\n## Q1: Googling a package\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254\n```\n\n\n:::\n:::\n\n\n\n\n## You try it!\n\n* Using any of the approaches you can think of, answer this question!\n* **How many children under 5 were vaccinated? In children under 5, did\nvaccination lower the prevalence of infection?**\n\n## You try it!\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# How many children under 5 were vaccinated\nsum(diph$DP_vacc[diph$age_months < 60])\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 91\n```\n\n\n:::\n\n```{.r .cell-code}\n# Prevalence in both vaccine groups for children under 5\naggregate(\n\tDP_infection ~ DP_vacc,\n\tdata = subset(diph, age_months < 60),\n\tFUN = DescTools::MeanCI\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 0 0.4285714 0.1977457 0.6593972\n2 1 0.6373626 0.5366845 0.7380407\n```\n\n\n:::\n:::\n\n\n\n\nIt appears that prevalence was HIGHER in the vaccine group? That is\ncounterintuitive, but the sample size for the unvaccinated group is too small\nto be sure.\n\n## Congratulations for finishing the first case study!\n\n* What R functions and skills did you practice?\n* What other questions could you answer about the same dataset with the skills\nyou know now?\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module11-RMarkdown/execute-results/html.json b/_freeze/modules/Module11-RMarkdown/execute-results/html.json new file mode 100644 index 0000000..f16f643 --- /dev/null +++ b/_freeze/modules/Module11-RMarkdown/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "f0e1a70494222d96692b28e0509006d1", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Module 11: Literate Programming\"\nformat:\n revealjs:\n toc: false\n---\n\n\n\n\n\n\n\n## Learning goals\n\n1. Define literate programming\n1. Implement literate programming in `R` using `knitr` and either `R Markdown`\nor `Quarto`\n1. Include plots, tables, and references along with your code in a written\nreport.\n1. Locate additional resources for literate programming with `R Markdown` or\n`Quarto`.\n\n## What is literate programming?\n\n* Programming files contain **code** along with **text**, **code results**,\nand other supporting information.\n* Instead of having separate code and text, that you glue together in Word,\nwe have one document which combines code and text.\n\n## What is literate programming?\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![R markdown example, from https://rmarkdown.rstudio.com/authoring_quick_tour.html](../images/rmdexample.png)\n:::\n:::\n\n\n\n\n## Literate programming examples {.smaller}\n\n* Writing a research paper with R Markdown: [https://github.com/wzbillings/Patient-vs-Clinician-Symptom-Reports](https://github.com/wzbillings/Patient-vs-Clinician-Symptom-Reports)\n* Writing a book with R Markdown: [https://github.com/moderndive/ModernDive_book](https://github.com/moderndive/ModernDive_book)\n* Personal websites (like my tutorial!): [https://jadeyryan.com/blog/2024-02-19_beginner-quarto-netlify/](https://jadeyryan.com/blog/2024-02-19_beginner-quarto-netlify/)\n* Other examples: [https://bookdown.org/yihui/rmarkdown/basics-examples.html](https://bookdown.org/yihui/rmarkdown/basics-examples.html)\n\n## `R Markdown` and `Quarto` {.smaller}\n\n* `R Markdown` and `Quarto` are both implementations of literate programming\nusing R, with the `knitr` package for the backend. Both are supported by RStudio.\n* To use `R Markdown`, you need to `install.packages(\"rmarkdown\")`.\n* `Quarto` comes with new versions of RStudio, but you can also install the\nlatest version from the [Quarto website](https://quarto.org/docs/get-started/).\n* `R Markdown` is older and now very commonly used. `Quarto` is newer and so\nhas many fancy new features, but more bugs that are constantly being found and\nfixed.\n* In this class, we will use **R Markdown**. But if you decide to use quarto,\n90% of your knowledge will transfer since they are very similar.\n - Advantages of R Markdown: more online resources, most common bugs have\n been fixed over the years, many people are familiar with it.\n - Advantages of Quarto: supports other programming languages like Python\n and Julia, uses more modern syntax, less slapped together overall.\n\n# Getting started with R Markdown\n\n## A few sticking points {.smaller .incremental}\n\n* Knitting to `html` format is really easy, but most scientist don't like\nhtml format for some reason. If you want to knit to `pdf`, you should install\nthe package `tinytex` and read the [intro](https://bookdown.org/yihui/rmarkdown-cookbook/install-latex.html).\n* If you want to knit to `word` (what many journals in epidemiology require),\nyou need to have Word installed on your computer. **Note that with word,\nyou are a bit more restricted in your formatting options, so if weird things\nhappen you'll have to try some other options.**\n* You maybe noticed in the tutorial that I used the `here::here()` function\nfor all of my file paths. This is because **R Markdown and Quarto files use\na different working directory from the R Project.** Using `here::here()`\ntranslates relative paths into absolute paths based on your R Project, so it\nmakes sure your R Markdown files can always find the right path!\n\n# Research paper example in R Markdown\n\n## You try it! {.smaller}\n\n1. Create an R Markdown document. Write about either the measles or diphtheria\nexample data sets, and include a figure and a table.\n2. BONUS EXERCISE: read the intro of the `bookdown` book, and create a\n`bookdown` document. Modify your writeup to have a few references with a\nbibliography, and cross-references with your figures and tables.\n3. BONUS: Try to structure your document like a report, with a section stating\nthe questions you want to answer (intro), a section with your R code and\nresults, and a section with your interpretations (discussion). This is a very\nopen ended exercise but by now I believe you can do it, and you'll have a nice\ndocument you can put on your portfolio or show employers!\n", + "supporting": [ + "Module11-RMarkdown_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module11-Rmarkdown-Demo/execute-results/html.json b/_freeze/modules/Module11-Rmarkdown-Demo/execute-results/html.json new file mode 100644 index 0000000..5c7089c --- /dev/null +++ b/_freeze/modules/Module11-Rmarkdown-Demo/execute-results/html.json @@ -0,0 +1,17 @@ +{ + "hash": "619f44d89afa1121169dd28cf07347f7", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"R Markdown Notes\"\nauthor: \"Zane and Amy\"\ndate: \"2024-07-13\"\noutput:\n html_document:\n fig_caption: true\n number_sections: false\nbibliography: example-bib.bib\n---\n\n\n\n\n\n# This is an example R Markdown document\n\n* The top part of this document (between the `---`) is called the **YAML\nheader**. You specify options here that change the configuration of the document.\n* Text in the R Markdown **body** is formatted in the Pandoc Markdown language.\nMost of the syntax can be found on the cheat sheets in the references section.\n* To include a bibliography in your document, add the `bibliography` option to\nyour YAML header and include a BIBTEX file. A bibtex file looks like this:\n\n\n\n\n```{.bibtex}\n@Book{rmarkdown-cookbook,\n title = {R Markdown Cookbook},\n author = {Yihui Xie and Christophe Dervieux and Emily Riederer},\n publisher = {Chapman and Hall/CRC},\n address = {Boca Raton, Florida},\n year = {2020},\n isbn = {9780367563837},\n url = {https://bookdown.org/yihui/rmarkdown-cookbook},\n}\n\n@Manual{rmarkdown-package,\n title = {rmarkdown: Dynamic Documents for R},\n author = {JJ Allaire and Yihui Xie and Christophe Dervieux and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},\n year = {2024},\n note = {R package version 2.27},\n url = {https://github.com/rstudio/rmarkdown},\n}\n```\n\n\n\n* You can then add citations from your bibliography by adding special text in\nyour R Markdown document: `@rmarkdown-cookbook`. That's how we can get this\ncitation here [@rmarkdown-cookbook].\n\n# Including R code in your Markdown document\n\nYou have to put all of your code in a \"Code chunk\" and tell `knitr` that you\nare using R code.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\"))\nstr(meas)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t12438 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...\n```\n\n\n:::\n:::\n\n\n\nYou can make plots and add captions in Markdown as well.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas_plot <- subset(meas, country == \"India\" & vaccine_antigen == \"MCV1\")\nplot(\n\tmeas_plot$time, meas_plot$Cases,\n\txlab = \"Year\",\n\tylab = \"Measles cases by year in India\",\n\ttype = \"b\"\n)\n```\n\n::: {.cell-output-display}\n![Meases cases over time in India.](Module11-Rmarkdown-Demo_files/figure-html/indiaplot-1.png){width=672}\n:::\n:::\n\n\n\nNote that if you want to **automatically reference your figures** like you would\nneed to for a research paper, you will also need to use the `bookdown` package,\nand you can read about it [here](https://bookdown.org/yihui/rmarkdown-cookbook/cross-ref.html). For\nthis document, we would have to write out \"Figure 1.\" manually in our text.\n\n# Including tables and figures from files\n\nIncluding tables is a bit more complicated, because unlike `plot()`, R cannot\nproduce any tables on its own. Instead we need to use another package. The\neasiest option is to use the `knitr` package which has a function called\n`knitr::kable()` that can make a table for us, like this.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas_table <- data.frame(\n\t\"Median cases\" = median(meas_plot$Cases),\n\t\"IQR cases\" = IQR(meas_plot$Cases)\n)\n\nknitr::kable(\n\tmeas_table,\n\tcaption = \"Median and IQR number of measles cases across all years in India.\"\n)\n```\n\n::: {.cell-output-display}\n\n\nTable: Median and IQR number of measles cases across all years in India.\n\n| Median.cases| IQR.cases|\n|------------:|---------:|\n| 47072| 44015.5|\n\n\n:::\n:::\n\n\n\nYou can also use the `kableExtra` package to format your table more nicely.\nIn general there are a lot of nice table making packages in R, like we saw\nwith the `tinytable` package in the exercise.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntinytable::tt(meas_table)\n```\n\n::: {.cell-output-display}\n\n```{=html}\n \n\n \n \n \n tinytable_2alx60l5mnkbv7h7p4gx\n \n \n \n \n \n\n \n
\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
Median.casesIQR.cases
4707244015.5
\n
\n\n \n\n \n\n\n```\n\n:::\n:::\n\n\n\nFinally, if you want to include a figure that you already saved somewhere,\nyou can do that with `knitr` also.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nknitr::include_graphics(here::here(\"images\", \"xkcd.png\"))\n```\n\n::: {.cell-output-display}\n![](../images/xkcd.png)\n:::\n:::\n\n\n\n\n# R Markdown resources\n\n* Yihui Xie, the creator of R Markdown, has written three very helpful and FREE\nbooks on R Markdown, which can answer many of your questions.\n - [The Definitive Guide to R Markdown](https://bookdown.org/yihui/rmarkdown/)\n - [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/)\n - [Bookdown: Authoring books and technical documents with R Markdown](https://bookdown.org/yihui/bookdown/)\n* Before Quarto came around, R Studio created a bunch of great R Markdown resources.\n - RStudio has created a cheatsheet with the most common commands, that you can get [here](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf).\n - There's also a slightly longer [reference guide](https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf?_ga=2.135548086.688985490.1593521771-610318113.1566927154).\n - RStudio also has a [series of online lessons](https://rmarkdown.rstudio.com/lesson-1.html) about R Markdown.\n* To learn more about making and presenting tables in R and R Markdown, you\ncan check out [this free online course material](https://andreashandel.github.io/MADAcourse/content/module-data-presentation/presenting-results-overview.html).\n* And if you still don't quite get that R Project and `here` package stuff,\n[here](http://jenrichmond.rbind.io/post/how-to-use-the-here-package/) and\n[here](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) are some good readings to help.\n \n# References\n\n\n\n\n", + "supporting": [ + "Module11-Rmarkdown-Demo_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module11-Rmarkdown-Demo/figure-html/indiaplot-1.png b/_freeze/modules/Module11-Rmarkdown-Demo/figure-html/indiaplot-1.png new file mode 100644 index 0000000..9e47b8b Binary files /dev/null and b/_freeze/modules/Module11-Rmarkdown-Demo/figure-html/indiaplot-1.png differ diff --git a/_freeze/modules/Module13-Iteration/execute-results/html.json b/_freeze/modules/Module13-Iteration/execute-results/html.json new file mode 100644 index 0000000..ed661e1 --- /dev/null +++ b/_freeze/modules/Module13-Iteration/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "12a353d15ef4018df2e7e11b0007bf59", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Module 13: Iteration in R\"\nformat:\n revealjs:\n toc: false\n---\n\n\n\n\n\n\n\n## Learning goals\n\n1. Replace repetitive code with a `for` loop\n1. Use vectorization to replace unnecessary loops\n\n## What is iteration?\n\n* Whenever you repeat something, that's iteration.\n* In `R`, this means running the same code multiple times in a row.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Parts of a loop\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,9\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\n\n\nThe **header** declares how many times we will repeat the same code. The header\ncontains a **control variable** that changes in each repetition and a\n**sequence** of values for the control variable to take.\n\n## Parts of a loop\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2-8\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\n\n\nThe **body** of the loop contains code that will be repeated a number of times\nbased on the header instructions. In `R`, the body has to be surrounded by\ncurly braces.\n\n## Header parts\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n\n\n* `for`: keyword that declares we are doing a for loop.\n* `(...)`: parentheses after `for` declare the control variable and sequence.\n* `this_island`: the control variable.\n* `in`: keyword that separates the control varibale and sequence.\n* `levels(penguins$island)`: the sequence.\n* `{}`: curly braces will contain the body code.\n\n## Header parts\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n\n\n* Since `levels(penguins$island)` evaluates to\n`c(\"Biscoe\", \"Dream\", \"Torgersen\")`, our loop will repeat 3 times.\n\n| Iteration | `this_island` |\n|-----------|---------------|\n| 1 | \"Biscoe\" |\n| 2 | \"Dream\" |\n| 3 | \"Torgersen\" |\n\n* Everything inside of `{...}` will be repeated three times.\n\n## Loop iteration 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Loop iteration 2\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Dream Island was 18.34 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Loop iteration 3\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## The loop structure automates this process for us so we don't have to copy and paste our code!\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Side note: the pipe operator `|>` {.scrollable}\n\n* This operator allows us to chain commands together so the output of the\nprevious statement is passed into the next statement.\n* E.g. the code\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n```\n:::\n\n\n\n\nwill be transformed by R into\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tround(\n\t\tmean(\n\t\t\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"],\n\t\t\tna.rm = TRUE\n\t\t),\n\t\tdigits = 2\n\t)\n```\n:::\n\n\n\n\nbefore it gets run. So using the pipe is a way to avoid deeply nested functions.\n\nNote that another alernative could be like this:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_data <- penguins$bill_depth_mm[penguins$island == \"Torgersen\"]\nisland_mean_raw <- mean(island_data, na.rm = TRUE)\nisland_mean <- round(island_mean_raw, digits = 2)\n```\n:::\n\n\n\n\nSo using `|>` can also help us to avoid a lot of assignments.\n\n* **Whichever style you prefer is fine!** Some people like the pipe, some\npeople like nesting, and some people like intermediate assignments. All three\nare perfectly fine as long as your code is neat and commented.\n* If you go on to the `tidyverse` class, you will use a lot of piping -- it\nis a very popular coding style in R these days thanks to the inventors of\nthe `tidyverse` packages.\n* Also note that you need R version 4.1.0 or better to use `|>`. If you are\non an older version of R, it will not be available.\n\n**Now, back to loops!**\n\n## Remember: write DRY code!\n\n* DRY = \"Don't Repeat Yourself\"\n* Instead of copying and pasting, write loops and functions.\n* Easier to debug and change in the future!\n\n. . .\n\n* Of course, we all copy and paste code sometimes. If you are running on a\ntight deadline or can't get a loop or function to work, you might need to.\n**DRY code is good, but working code is best!**\n\n## {#tweet-slide data-menu-title=\"Hadley tweet\" .center}\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](../images/hadley-tweet.PNG)\n:::\n:::\n\n\n\n\n## You try it!\n\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints\nthe squared number.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n\tcat(i ^ 2, \"\\n\")\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100 \n```\n\n\n:::\n:::\n\n\n\n\n## Wait, did we need to do that? {.incremental}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n## Wait, did we need to do that? {.scrollable}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 3 5 7 9 11 13 15 17 19\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* So you should really try vectorization first, then use loops only when\nyou can't use vectorization.\n\n## Loop walkthrough\n\n* Let's walk through a complex but useful example where we can't use\nvectorization.\n* Load the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n\tsubset(vaccine_antigen == \"MCV1\")\nstr(meas)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...\n```\n\n\n:::\n:::\n\n\n\n\n## Loop walkthrough\n\n* First, make an empty `list`. This is where we'll store our results. Make it\nthe same length as the number of countries in the dataset.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n```\n:::\n\n\n\n\n* This is called *preallocation* and it can make your loops much faster.\n\n## Loop walkthrough\n\n* Loop through every country in the dataset, and get the median, first and third\nquartiles, and range for each country. Store those summary statistics in a data frame.\n* What should the header look like?\n\n. . .\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n```\n:::\n\n\n\n\n. . .\n\n* Note that we use the **index** as the control variable. When you need to\ndo complex operations inside a loop, this is easier than the **for-each**\nconstruction we used earlier.\n\n## Loop walkthrough {.scrollable}\n\n* Now write out the body of the code. First we need to subset the data, to get\nonly the data for the current country.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n}\n```\n:::\n\n\n\n\n. . .\n\n* Next we need to get the summary of the cases for that country.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n}\n```\n:::\n\n\n\n\n. . .\n\n* Next we save the summary statistics into a data frame.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n}\n```\n:::\n\n\n\n\n. . .\n\n* And finally, we save the data frame as the next element in our storage list.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\t# Save the results to our container\n\tres[[i]] <- country_summary\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* Let's take a look at the results.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(res)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n```\n\n\n:::\n:::\n\n\n\n\n* How do we deal with this to get it into a nice form?\n\n. . .\n\n* We can use a *vectorization* trick: the function `do.call()` seems like\nancient computer science magic. And it is. But it will actually help us a\nlot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres_df <- do.call(rbind, res)\nhead(res_df)\n```\n\n::: {.cell-output-display}\n\n\n|country | min| Q1| median| Q3| max|\n|:--------------------|---:|-------:|------:|-------:|------:|\n|Afghanistan | 353| 1154.00| 2205| 5166.0| 31107|\n|Angola | 29| 700.00| 3271| 14474.0| 30067|\n|Albania | 0| 1.00| 12| 29.0| 136034|\n|Andorra | 0| 0.00| 1| 2.0| 5|\n|United Arab Emirates | 22| 89.75| 320| 1128.0| 2913|\n|Argentina | 0| 0.00| 17| 4591.5| 42093|\n:::\n:::\n\n\n\n\n* It combined our data frames together! Let's take a look at the `rbind` and\n`do.call()` help packages to see what happened.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?rbind\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. Old-style categories (integer vectors with\n levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?do.call\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* OK, so basically what happened is that\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndo.call(rbind, list)\n```\n:::\n\n\n\n\n* Gets transformed into\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n```\n:::\n\n\n\n\n* That's vectorization magic!\n\n## You try it! (if we have time) {.smaller}\n\n* Use the code you wrote before the get the incidence per 1000 people on the\nentire measles data set (add a column for incidence to the full data).\n* Use the code `plot(NULL, NULL, ...)` to make a blank plot. You will need to\nset the `xlim` and `ylim` arguments to sensible values, and specify the axis\ntitles as \"Year\" and \"Incidence per 1000 people\".\n* Using a `for` loop and the `lines()` function, make a plot that shows all of\nthe incidence curves over time, overlapping on the plot.\n* HINT: use `col = adjustcolor(black, alpha.f = 0.25)` to make the curves\npartially transparent, so you can see the overlap.\n* BONUS PROBLEM: using the function `cumsum()`, make a plot of the cumulative\ncases (not standardized) over time for all of the countries. (Dealing with\nthe NA's here is tricky!!)\n\n## Main problem solution\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(0, 50),\n\txlab = \"Year\",\n\tylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cases_per_thousand,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n```\n:::\n\n\n\n\n## Main problem solution\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module13-Iteration_files/figure-revealjs/unnamed-chunk-32-1.png){width=960}\n:::\n:::\n\n\n\n\n## Bonus problem solution\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n\tx = ifelse(is.na(meas$Cases), 0, meas$Cases),\n\tmeas$country,\n\tFUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(1, 6.2e6),\n\txlab = \"Year\",\n\tylab = paste0(\"Cumulative cases since\", min(meas$time))\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cumulative_cases,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n\ntext(\n\tx = 2020,\n\ty = 6e6,\n\tlabels = \"China →\"\n)\n```\n:::\n\n\n\n\n## Bonus problem solution\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module13-Iteration_files/figure-revealjs/unnamed-chunk-34-1.png){width=960}\n:::\n:::\n\n\n\n\n## More practice on your own {.smaller}\n\n* Merge the `countries-regions.csv` data with the `measles_final.Rds` data.\nReshape the measles data so that `MCV1` and `MCV2` vaccine coverage are two\nseparate columns. Then use a loop to fit a poisson regression model for each\ncontinent where `Cases` is the outcome, and `MCV1 coverage` and `MCV2 coverage`\nare the predictors. Discuss your findings, and try adding an interation term.\n* Assess the impact of `age_months` as a confounder in the Diphtheria serology\ndata. First, write code to transform `age_months` into age ranges for each\nyear. Then, using a loop, calculate the crude odds ratio for the effect of\nvaccination on infection for each of the age ranges. How does the odds ratio\nchange as age increases? Can you formalize this analysis by fitting a logistic\nregression model with `age_months` and vaccination as predictors?\n\n\n", + "supporting": [ + "Module13-Iteration_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-32-1.png b/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-32-1.png new file mode 100644 index 0000000..84a077e Binary files /dev/null and b/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-32-1.png differ diff --git a/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-34-1.png b/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-34-1.png new file mode 100644 index 0000000..009fce5 Binary files /dev/null and b/_freeze/modules/Module13-Iteration/figure-revealjs/unnamed-chunk-34-1.png differ diff --git a/_quarto.yml b/_quarto.yml index f4808f9..a058234 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -1,23 +1,11 @@ project: type: website output-dir: docs - # render: - # - "*.qmd" - # - "!*/slides.qmd" + render: + - "*.qmd" website: title: "SISMID Module NUMBER Materials (2025)" - # subtitle: "Introduction to R" - # author: - # - name: "Zane Billings" - # url: https://wzbillings.com/ - # affiliation: University of Georgia - # orcid: 0000-0002-0184-6134 - # - name: "Amy Winter" - # url: https://publichealth.uga.edu/faculty-member/amy-k-winter/ - # affiliation: University of Georgia - # orcid: 0000-0003-2737-7003 - # date: "2024-07-15" sidebar: style: docked search: true @@ -34,18 +22,40 @@ website: href: references.qmd - section: "Day 1" contents: + - href: modules/Module00-Welcome.qmd + target: _blank - href: modules/Module01-Intro.qmd target: _blank - - href: modules/CaseStudy01.qmd + - href: modules/Module02-Functions.qmd + target: _blank + - href: modules/Module03-WorkingDirectories.qmd + target: _blank + - href: modules/Module04-RProject.qmd + target: _blank + - href: modules/Module05-DataImportExport.qmd + target: _blank + - href: modules/Module06-DataSubset.qmd + target: _blank + - href: modules/Module07-VarCreationClassesSummaries.qmd target: _blank - section: "Day 2" contents: - - href: modules/ModuleXX-Iteration.qmd + - href: modules/Module08-DataMergeReshape.qmd + target: _blank + - href: modules/Module09-DataAnalysis.qmd + target: _blank + - href: modules/Module095-DataAnalysisWalkthrough.qmd + target: _blank + - href: modules/Module10-DataVisualization.qmd + target: _blank + - href: modules/Module11-RMarkdown.qmd target: _blank - section: "Day 3" + contents: + - href: modules/Module13-Iteration.qmd + target: _blank repo-url: https://github.com/UGA-IDD/SISMID-2024 reader-mode: true - # favicon: ./images/sismid2023-logo-green.png # Default for table of contents toc: true @@ -73,10 +83,6 @@ author: # Default fields for citation license: "CC BY-NC" -# bibliography: SISMID-Module.bib -# nocite: | -# @* - format: html: theme: diff --git a/docs/archive/CaseStudy01.html b/docs/archive/CaseStudy01.html deleted file mode 100644 index 81d333a..0000000 --- a/docs/archive/CaseStudy01.html +++ /dev/null @@ -1,1008 +0,0 @@ - - - - - - - - - - - - - - - SISMID Module NUMBER Materials (2025) - Algorithmic Thinking Case Study 1 - - - - - - - - - - - - - - - -
-
- -
-

Algorithmic Thinking Case Study 1

-

SISMID 2024 – Introduction to R

- -
- - -
- -
-
-

Learning goals

-
    -
  • Use logical operators, subsetting functions, and math calculations in R
  • -
  • Translate human-understandable problem descriptions into instructions that R can understand.
  • -
-
-
-
-

Remember, R always does EXACTLY what you tell it to do!

- -
-
-

Instructions

-
    -
  • Make a new R script for this case study, and save it to your code folder.
  • -
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • -
-
-
-

Instructions

-
    -
  • Make a new R script for this case study, and save it to your code folder.
  • -
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • -
  • The str() of your dataset should look like this.
  • -
-
-
-
tibble [250 × 5] (S3: tbl_df/tbl/data.frame)
- $ age_months  : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...
- $ group       : chr [1:250] "urban" "rural" "urban" "urban" ...
- $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...
- $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...
- $ DP_vacc     : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...
-
-
-
-
-

Q1: Was the overall prevalence higher in urban or rural areas?

-
-
    -
  1. How do we calculate the prevalence from the data?
  2. -
  3. How do we calculate the prevalence separately for urban and rural areas?
  4. -
  5. How do we determine which prevalence is higher and if the difference is meaningful?
  6. -
-
-
-
-

Q1: How do we calculate the prevalence from the data?

-
-
    -
  • The variable DP_infection in our dataset is binary / dichotomous.
  • -
  • The prevalence is the number or percent of people who had the disease over some duration.
  • -
  • The average of a binary variable gives the prevalence!
  • -
-
-
-
-
mean(diph$DP_infection)
-
-
[1] 0.8
-
-
-
-
-
-

Q1: How do we calculate the prevalence separately for urban and rural areas?

-
-
-
mean(diph[diph$group == "urban", ]$DP_infection)
-
-
[1] 0.8235294
-
-
mean(diph[diph$group == "rural", ]$DP_infection)
-
-
[1] 0.778626
-
-
-
-
-
    -
  • There are many ways you could write this code! You can use subset() or you can write the indices many ways.
  • -
  • Using tbl_df objects from haven uses different [[ rules than a base R data frame.
  • -
-
-
-
-

Q1: How do we calculate the prevalence separately for urban and rural areas?

-
    -
  • One easy way is to use the aggregate() function.
  • -
-
-
aggregate(DP_infection ~ group, data = diph, FUN = mean)
-
-
  group DP_infection
-1 rural    0.7786260
-2 urban    0.8235294
-
-
-
-
-

Q1: How do we determine which prevalence is higher and if the difference is meaningful?

-
-
    -
  • We probably need to include a confidence interval in our calculation.
  • -
  • This is actually not so easy without more advanced tools that we will learn in upcoming modules.
  • -
  • Right now the best options are to do it by hand or google a function.
  • -
-
-
-
-

Q1: By hand

-
-
p_urban <- mean(diph[diph$group == "urban", ]$DP_infection)
-p_rural <- mean(diph[diph$group == "rural", ]$DP_infection)
-se_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == "urban", ]))
-se_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == "rural", ])) 
-
-result_urban <- paste0(
-    "Urban: ", round(p_urban, 2), "; 95% CI: (",
-    round(p_urban - 1.96 * se_urban, 2), ", ",
-    round(p_urban + 1.96 * se_urban, 2), ")"
-)
-
-result_rural <- paste0(
-    "Rural: ", round(p_rural, 2), "; 95% CI: (",
-    round(p_rural - 1.96 * se_rural, 2), ", ",
-    round(p_rural + 1.96 * se_rural, 2), ")"
-)
-
-cat(result_urban, result_rural, sep = "\n")
-
-
Urban: 0.82; 95% CI: (0.76, 0.89)
-Rural: 0.78; 95% CI: (0.71, 0.85)
-
-
-
-
-

Q1: By hand

-
    -
  • We can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.
  • -
  • Some people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.
  • -
-
-
-

Q1: Googling a package

-
-
-
# install.packages("DescTools")
-library(DescTools)
-
-aggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)
-
-
  group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
-1 rural         0.7786260           0.7065872           0.8506647
-2 urban         0.8235294           0.7540334           0.8930254
-
-
-
-
-
-

You try it!

-
    -
  • Using any of the approaches you can think of, answer this question!
  • -
  • How many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?
  • -
-
-
-

You try it!

-
-
# How many children under 5 were vaccinated
-sum(diph$DP_vacc[diph$age_months < 60])
-
-
[1] 91
-
-
# Prevalence in both vaccine groups for children under 5
-aggregate(
-    DP_infection ~ DP_vacc,
-    data = subset(diph, age_months < 60),
-    FUN = DescTools::MeanCI
-)
-
-
  DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
-1       0         0.4285714           0.1977457           0.6593972
-2       1         0.6373626           0.5366845           0.7380407
-
-
-

It appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure.

-
-
-

Congratulations for finishing the first case study!

-
    -
  • What R functions and skills did you practice?
  • -
  • What other questions could you answer about the same dataset with the skills you know now?
  • -
- - -
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/docs/downloads/data.zip b/docs/downloads/data.zip new file mode 100644 index 0000000..a4b9c40 Binary files /dev/null and b/docs/downloads/data.zip differ diff --git a/docs/downloads/exercises.zip b/docs/downloads/exercises.zip new file mode 100644 index 0000000..d08c9d4 Binary files /dev/null and b/docs/downloads/exercises.zip differ diff --git a/docs/downloads/modules.zip b/docs/downloads/modules.zip new file mode 100644 index 0000000..f073460 Binary files /dev/null and b/docs/downloads/modules.zip differ diff --git a/docs/exercises/CaseStudy01.html b/docs/exercises/CaseStudy01.html deleted file mode 100644 index 82336f6..0000000 --- a/docs/exercises/CaseStudy01.html +++ /dev/null @@ -1,1058 +0,0 @@ - - - - - - - - - - - - - - - SISMID Module NUMBER Materials (2025) – Algorithmic Thinking Case Study 1 - - - - - - - - - - - - - - - -
-
- -
-

Algorithmic Thinking Case Study 1

-

SISMID 2024 – Introduction to R

- -
- - -
- -
-
-

Learning goals

-
    -
  • Use logical operators, subsetting functions, and math calculations in R
  • -
  • Translate human-understandable problem descriptions into instructions that R can understand.
  • -
-
-
-
-

Remember, R always does EXACTLY what you tell it to do!

- -
-
-

Instructions

-
    -
  • Make a new R script for this case study, and save it to your code folder.
  • -
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • -
-
-
-

Instructions

-
    -
  • Make a new R script for this case study, and save it to your code folder.
  • -
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • -
  • The str() of your dataset should look like this.
  • -
-
-
-
tibble [250 × 5] (S3: tbl_df/tbl/data.frame)
- $ age_months  : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...
- $ group       : chr [1:250] "urban" "rural" "urban" "urban" ...
- $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...
- $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...
- $ DP_vacc     : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...
-
-
-
-
-

Q1: Was the overall prevalence higher in urban or rural areas?

-
-
    -
  1. How do we calculate the prevalence from the data?
  2. -
  3. How do we calculate the prevalence separately for urban and rural areas?
  4. -
  5. How do we determine which prevalence is higher and if the difference is meaningful?
  6. -
-
-
-
-

Q1: How do we calculate the prevalence from the data?

-
-
    -
  • The variable DP_infection in our dataset is binary / dichotomous.
  • -
  • The prevalence is the number or percent of people who had the disease over some duration.
  • -
  • The average of a binary variable gives the prevalence!
  • -
-
-
-
-
mean(diph$DP_infection)
-
-
[1] 0.8
-
-
-
-
-
-

Q1: How do we calculate the prevalence separately for urban and rural areas?

-
-
-
mean(diph[diph$group == "urban", ]$DP_infection)
-
-
[1] 0.8235294
-
-
mean(diph[diph$group == "rural", ]$DP_infection)
-
-
[1] 0.778626
-
-
-
-
-
    -
  • There are many ways you could write this code! You can use subset() or you can write the indices many ways.
  • -
  • Using tbl_df objects from haven uses different [[ rules than a base R data frame.
  • -
-
-
-
-

Q1: How do we calculate the prevalence separately for urban and rural areas?

-
    -
  • One easy way is to use the aggregate() function.
  • -
-
-
aggregate(DP_infection ~ group, data = diph, FUN = mean)
-
-
  group DP_infection
-1 rural    0.7786260
-2 urban    0.8235294
-
-
-
-
-

Q1: How do we determine which prevalence is higher and if the difference is meaningful?

-
-
    -
  • We probably need to include a confidence interval in our calculation.
  • -
  • This is actually not so easy without more advanced tools that we will learn in upcoming modules.
  • -
  • Right now the best options are to do it by hand or google a function.
  • -
-
-
-
-

Q1: By hand

-
-
p_urban <- mean(diph[diph$group == "urban", ]$DP_infection)
-p_rural <- mean(diph[diph$group == "rural", ]$DP_infection)
-se_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == "urban", ]))
-se_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == "rural", ])) 
-
-result_urban <- paste0(
-    "Urban: ", round(p_urban, 2), "; 95% CI: (",
-    round(p_urban - 1.96 * se_urban, 2), ", ",
-    round(p_urban + 1.96 * se_urban, 2), ")"
-)
-
-result_rural <- paste0(
-    "Rural: ", round(p_rural, 2), "; 95% CI: (",
-    round(p_rural - 1.96 * se_rural, 2), ", ",
-    round(p_rural + 1.96 * se_rural, 2), ")"
-)
-
-cat(result_urban, result_rural, sep = "\n")
-
-
Urban: 0.82; 95% CI: (0.76, 0.89)
-Rural: 0.78; 95% CI: (0.71, 0.85)
-
-
-
-
-

Q1: By hand

-
    -
  • We can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.
  • -
  • Some people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.
  • -
-
-
-

Q1: Googling a package

-
-
# install.packages("DescTools")
-library(DescTools)
-
-aggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)
-
-
  group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
-1 rural         0.7786260           0.7065872           0.8506647
-2 urban         0.8235294           0.7540334           0.8930254
-
-
-
-
-

Congratulations for finishing the first case study!

-
    -
  • What R functions and skills did you practice?
  • -
  • What other questions could you answer about the same dataset with the skills you know now?
  • -
- -
- -
-
-
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/docs/images/map-floor-room.PNG b/docs/images/map-floor-room.PNG new file mode 100644 index 0000000..382beac Binary files /dev/null and b/docs/images/map-floor-room.PNG differ diff --git a/docs/images/map-floorplan.PNG b/docs/images/map-floorplan.PNG new file mode 100644 index 0000000..1875cbe Binary files /dev/null and b/docs/images/map-floorplan.PNG differ diff --git a/docs/images/map-wd.PNG b/docs/images/map-wd.PNG new file mode 100644 index 0000000..323b805 Binary files /dev/null and b/docs/images/map-wd.PNG differ diff --git a/docs/images/map.PNG b/docs/images/map.PNG new file mode 100644 index 0000000..1bbd3d8 Binary files /dev/null and b/docs/images/map.PNG differ diff --git a/docs/images/presentation4.webp b/docs/images/presentation4.webp new file mode 100644 index 0000000..eb6550f Binary files /dev/null and b/docs/images/presentation4.webp differ diff --git a/docs/images/repspectrum.JPG b/docs/images/repspectrum.JPG new file mode 100644 index 0000000..4595cad Binary files /dev/null and b/docs/images/repspectrum.JPG differ diff --git a/docs/index.html b/docs/index.html index 40e8f30..906407f 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2,7 +2,7 @@ - + @@ -166,7 +166,7 @@ @@ -252,7 +331,7 @@

Page Items

Welcome to “Introduction to R”!

-

This website contains all of the slides and exercises for the 2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module “Introduction to R”.

+

This website contains all of the material for the 2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module “Introduction to R”.

Prerequisities

Familiary with basic statistical concepts on the level of an introductory statistics class is assumed for our course

@@ -267,14 +346,14 @@

About the

-Instructor: Dr. Amy Winter +Co-Instructor: Dr. Amy Winter

Dr. Winter is an Assistant Professor of Epidemiology at the University of Georgia. She has been coding in R for 10 years, and uses R day-to-day to conduct her research addressing policy-relevant questions on the transmission and control of infectious diseases in human populations, particularly VPDs. She teaches a semester-long course titled Introduction to Coding in R for Public Health to graduate students at the University of Georgia.

-TA: Zane Billings +Co-Instructor: Zane Billings

Zane Billings is a PhD student in Epidemiology and Biostatistics at the University of Georgia, working with Andreas Handel. He has been using R since 2017, and uses R for nearly all of his statistics and data science practice. Zane’s research focuses on the immune response to influenza vaccination, and uses machine learning and multilevel regression modeling (in R!) to improve our understanding of influenza immunology.

@@ -908,7 +987,7 @@

Welcome to "Introduction to R"! -This website contains all of the slides and exercises for the [2024 +This website contains all of the material for the [2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module "Introduction to R"](https://sph.emory.edu/SISMID/modules/intro-to-r/index.html). @@ -931,7 +1010,7 @@

<div class="container"> <div class="box" style="--width: 50%"> -<h3 style="margin-top:0px; text-align: center">Instructor: [Dr. Amy Winter](https://publichealth.uga.edu/faculty-member/amy-k-winter/)</h3> +<h3 style="margin-top:0px; text-align: center">Co-Instructor: [Dr. Amy Winter](https://publichealth.uga.edu/faculty-member/amy-k-winter/)</h3> <img src="./images/amy.jpg" style="width: 100%"/> @@ -944,7 +1023,7 @@

</div> <div class="box" style="--width: 50%"> -<h3 style="margin-top:0px; text-align: center">TA: [Zane Billings](https://wzbillings.com/ )</h3> +<h3 style="margin-top:0px; text-align: center">Co-Instructor: [Zane Billings](https://wzbillings.com/ )</h3> <img src="./images/zane.jpg" style="width: 100%"/> diff --git a/docs/modules/Module00-Welcome.html b/docs/modules/Module00-Welcome.html index 1601c48..b107c9d 100644 --- a/docs/modules/Module00-Welcome.html +++ b/docs/modules/Module00-Welcome.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) - Welcome to SISMID Workshop: Introduction to R + SISMID Module NUMBER Materials (2025) – Welcome to SISMID Workshop: Introduction to R @@ -157,7 +157,8 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - margin-bottom: 0.5rem; + padding-bottom: 0.5rem; + margin-bottom: 0; } .callout.callout-titled .callout-icon::before { @@ -364,6 +365,15 @@

Introductions

  • Favorite guilty pleasure app
  • +
    +

    Course website

    +
      +
    • All of the materials for this course can be found online here: here.
    • +
    • This contains the schedule, course resources, and online versions of all of our slide decks.
    • +
    • The Course Resources page contains download links for all of the data, exercises, and slides for this class.
    • +
    • Please feel free to download these resources and share them – all of the course content is under the Creative Commons BY-NC 4.0 license.
    • +
    +

    What is R?

    -R logo
    +R logo

    What is R?

      @@ -426,7 +436,7 @@

      Is R Difficult?

    • Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.
    -
    +

    Overall Workshop Objectives

    By the end of this workshop, you should be able to

      @@ -446,7 +456,7 @@

      This workshop differs from “Introduction to Tidyverse”

    1. more flexible for visualizing data
    2. -Tidyverse hex sticker
    +Tidyverse hex sticker

    Workshop Overview

    14 lecture blocks that will each:

    @@ -465,7 +475,14 @@

    Workshop Overview

    Reproducibility

    -

    xxzane slides

    +
      +
    • Reproducible research: the idea that other people should be able to verify the claims you make – usually by being able to see your data and run your code.
    • +
    + +
      +
    • 2023 was the US government’s year of open science – specific aspects of reproducibility will be mandated for federally funded research!
    • +
    • Sharing and documenting your code is a massive step towards making your work reproducible, and the R ecosystem can play a big role in that!
    • +

    Useful (+ Free) Resources

    @@ -500,8 +517,10 @@

    Installing R

  • Install RStudio
  • +
    @@ -530,7 +549,6 @@

    Installing R

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, -'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -716,43 +734,81 @@

    Installing R

    }); + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + +
    +

    This is an example R Markdown document

    +
      +
    • The top part of this document (between the ---) is +called the YAML header. You specify options here that +change the configuration of the document.
    • +
    • Text in the R Markdown body is formatted in the +Pandoc Markdown language. Most of the syntax can be found on the cheat +sheets in the references section.
    • +
    • To include a bibliography in your document, add the +bibliography option to your YAML header and include a +BIBTEX file. A bibtex file looks like this:
    • +
    +
    @Book{rmarkdown-cookbook,
    +  title = {R Markdown Cookbook},
    +  author = {Yihui Xie and Christophe Dervieux and Emily Riederer},
    +  publisher = {Chapman and Hall/CRC},
    +  address = {Boca Raton, Florida},
    +  year = {2020},
    +  isbn = {9780367563837},
    +  url = {https://bookdown.org/yihui/rmarkdown-cookbook},
    +}
    +
    +@Manual{rmarkdown-package,
    +  title = {rmarkdown: Dynamic Documents for R},
    +  author = {JJ Allaire and Yihui Xie and Christophe Dervieux and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},
    +  year = {2024},
    +  note = {R package version 2.27},
    +  url = {https://github.com/rstudio/rmarkdown},
    +}
    +
      +
    • You can then add citations from your bibliography by adding special +text in your R Markdown document: @rmarkdown-cookbook. +That’s how we can get this citation here (Xie, +Dervieux, and Riederer 2020).
    • +
    +
    +
    +

    Including R code in your Markdown document

    +

    You have to put all of your code in a “Code chunk” and tell +knitr that you are using R code.

    +
    meas <- readRDS(here::here("data", "measles_final.Rds"))
    +str(meas)
    +
    ## 'data.frame':    12438 obs. of  7 variables:
    +##  $ iso3c           : chr  "AFG" "AFG" "AFG" "AFG" ...
    +##  $ time            : int  1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
    +##  $ country         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
    +##  $ Cases           : int  2792 5166 2900 640 353 2012 1511 638 1154 492 ...
    +##  $ vaccine_antigen : chr  "MCV1" "MCV1" "MCV1" "MCV1" ...
    +##  $ vaccine_coverage: int  11 NA 8 9 14 14 14 31 34 22 ...
    +##  $ total_pop       : chr  "12486631" "11155195" "10088289" "9951449" ...
    +

    You can make plots and add captions in Markdown as well.

    +
    meas_plot <- subset(meas, country == "India" & vaccine_antigen == "MCV1")
    +plot(
    +    meas_plot$time, meas_plot$Cases,
    +    xlab = "Year",
    +    ylab = "Measles cases by year in India",
    +    type = "b"
    +)
    +
    +Meases cases over time in India. +

    +Meases cases over time in India. +

    +
    +

    Note that if you want to automatically reference your +figures like you would need to for a research paper, you will +also need to use the bookdown package, and you can read +about it here. +For this document, we would have to write out “Figure 1.” manually in +our text.

    +
    +
    +

    Including tables and figures from files

    +

    Including tables is a bit more complicated, because unlike +plot(), R cannot produce any tables on its own. Instead we +need to use another package. The easiest option is to use the +knitr package which has a function called +knitr::kable() that can make a table for us, like this.

    +
    meas_table <- data.frame(
    +    "Median cases" = median(meas_plot$Cases),
    +    "IQR cases" = IQR(meas_plot$Cases)
    +)
    +
    +knitr::kable(
    +    meas_table,
    +    caption = "Median and IQR number of measles cases across all years in India."
    +)
    + + + + + + + + + + + + + + +
    Median and IQR number of measles cases across all years in +India.
    Median.casesIQR.cases
    4707244015.5
    +

    You can also use the kableExtra package to format your +table more nicely. In general there are a lot of nice table making +packages in R, like we saw with the tinytable package in +the exercise.

    +
    tinytable::tt(meas_table)
    + + + + + + tinytable_o73g4xhg2p32dbgu79cj + + + + + + + +
    + + + + + + + + + + + + + + + +
    Median.casesIQR.cases
    4707244015.5
    +
    + + + + + + +

    Finally, if you want to include a figure that you already saved +somewhere, you can do that with knitr also.

    +
    knitr::include_graphics(here::here("images", "xkcd.png"))
    +

    +
    +
    +

    R Markdown resources

    + +
    +
    +

    References

    + + +
    +
    +Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R +Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook. +
    +
    +
    + + + + +
    + + + + + + + + + + + + + + + diff --git a/docs/modules/ModuleXX-Iteration.html b/docs/modules/Module13-Iteration.html similarity index 67% rename from docs/modules/ModuleXX-Iteration.html rename to docs/modules/Module13-Iteration.html index 8c66c94..72afa57 100644 --- a/docs/modules/ModuleXX-Iteration.html +++ b/docs/modules/Module13-Iteration.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) - Iteration in R + SISMID Module NUMBER Materials (2025) – Module 13: Iteration in R @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } + pre > code.sourceCode > span { line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; } /* ControlFlow */ + code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; } /* Keyword */ + code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,7 +222,8 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - margin-bottom: 0.5rem; + padding-bottom: 0.5rem; + margin-bottom: 0; } .callout.callout-titled .callout-icon::before { @@ -391,7 +392,7 @@
    -

    Iteration in R

    +

    Module 13: Iteration in R

    @@ -421,16 +422,16 @@

    What is iteration?

  • In R, this means running the same code multiple times in a row.
  • -
    data("penguins", package = "palmerpenguins")
    -for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    data("penguins", package = "palmerpenguins")
    +for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}
    The mean bill depth on Biscoe Island was 15.87 mm.
     The mean bill depth on Dream Island was 18.34 mm.
    @@ -441,37 +442,37 @@ 

    What is iteration?

    Parts of a loop

    -
    for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}

    The header declares how many times we will repeat the same code. The header contains a control variable that changes in each repetition and a sequence of values for the control variable to take.

    Parts of a loop

    -
    for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}

    The body of the loop contains code that will be repeated a number of times based on the header instructions. In R, the body has to be surrounded by curly braces.

    Header parts

    -
    for (this_island in levels(penguins$island)) {...}
    +
    for (this_island in levels(penguins$island)) {...}
    • for: keyword that declares we are doing a for loop.
    • @@ -485,12 +486,12 @@

      Header parts

      Header parts

      -
      for (this_island in levels(penguins$island)) {...}
      +
      for (this_island in levels(penguins$island)) {...}
      • Since levels(penguins$island) evaluates to c("Biscoe", "Dream", "Torgersen"), our loop will repeat 3 times.
      - +
      @@ -519,13 +520,13 @@

      Header parts

      Loop iteration 1

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Biscoe"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Biscoe", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Biscoe"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Biscoe", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Biscoe Island was 15.87 mm.
      @@ -534,13 +535,13 @@

      Loop iteration 1

      Loop iteration 2

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Dream"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Dream", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Dream"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Dream", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Dream Island was 18.34 mm.
      @@ -549,13 +550,13 @@

      Loop iteration 2

      Loop iteration 3

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Torgersen"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Torgersen", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Torgersen"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Torgersen", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Torgersen Island was 18.43 mm.
      @@ -564,15 +565,15 @@

      Loop iteration 3

      The loop structure automates this process for us so we don’t have to copy and paste our code!

      -
      for (this_island in levels(penguins$island)) {
      -    island_mean <-
      -        penguins$bill_depth_mm[penguins$island == this_island] |>
      -        mean(na.rm = TRUE) |>
      -        round(digits = 2)
      -    
      -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
      -                            "mm.\n"))
      -}
      +
      for (this_island in levels(penguins$island)) {
      +    island_mean <-
      +        penguins$bill_depth_mm[penguins$island == this_island] |>
      +        mean(na.rm = TRUE) |>
      +        round(digits = 2)
      +    
      +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
      +                            "mm.\n"))
      +}
      The mean bill depth on Biscoe Island was 15.87 mm.
       The mean bill depth on Dream Island was 18.34 mm.
      @@ -580,6 +581,44 @@ 

      The loop structure automates this process for us so we don’t have to copy

      +
      +

      Side note: the pipe operator |>

      +
        +
      • This operator allows us to chain commands together so the output of the previous statement is passed into the next statement.
      • +
      • E.g. the code
      • +
      +
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Torgersen"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +

      will be transformed by R into

      +
      +
      island_mean <-
      +    round(
      +        mean(
      +            penguins$bill_depth_mm[penguins$island == "Torgersen"],
      +            na.rm = TRUE
      +        ),
      +        digits = 2
      +    )
      +
      +

      before it gets run. So using the pipe is a way to avoid deeply nested functions.

      +

      Note that another alernative could be like this:

      +
      +
      island_data <- penguins$bill_depth_mm[penguins$island == "Torgersen"]
      +island_mean_raw <- mean(island_data, na.rm = TRUE)
      +island_mean <- round(island_mean_raw, digits = 2)
      +
      +

      So using |> can also help us to avoid a lot of assignments.

      +
        +
      • Whichever style you prefer is fine! Some people like the pipe, some people like nesting, and some people like intermediate assignments. All three are perfectly fine as long as your code is neat and commented.
      • +
      • If you go on to the tidyverse class, you will use a lot of piping – it is a very popular coding style in R these days thanks to the inventors of the tidyverse packages.
      • +
      • Also note that you need R version 4.1.0 or better to use |>. If you are on an older version of R, it will not be available.
      • +
      +

      Now, back to loops!

      +

      Remember: write DRY code!

        @@ -602,9 +641,9 @@

        You try it!

        Write a loop that goes from 1 to 10, squares each of the numbers, and prints the squared number.

        -
        for (i in 1:10) {
        -    cat(i ^ 2, "\n")
        -}
        +
        for (i in 1:10) {
        +    cat(i ^ 2, "\n")
        +}
        1 
         4 
        @@ -628,22 +667,7 @@ 

        Wait, did we need to do that?

      • Almost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.
      -
      -

      Wait, did we need to do that?

      -
        -
      • Well, yes, because you need to practice loops!
      • -
      • But technically no, because we can use vectorization.
      • -
      • Almost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.
      • -
      -
      -
      # No loop needed!
      -(1:10)^2
      -
      -
       [1]   1   4   9  16  25  36  49  64  81 100
      -
      -
      -
      -
      +

      Wait, did we need to do that?

      • Well, yes, because you need to practice loops!
      • @@ -651,22 +675,26 @@

        Wait, did we need to do that?

      • Almost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.
      -
      # No loop needed!
      -(1:10)^2
      +
      # No loop needed!
      +(1:10)^2
       [1]   1   4   9  16  25  36  49  64  81 100
      +
      -
      # Get the first 10 odd numbers, a common CS 101 loop problem on exams
      -(1:20)[which((1:20 %% 2) == 1)]
      +
      # Get the first 10 odd numbers, a common CS 101 loop problem on exams
      +(1:20)[which((1:20 %% 2) == 1)]
       [1]  1  3  5  7  9 11 13 15 17 19
      +
      +
      • So you should really try vectorization first, then use loops only when you can’t use vectorization.
      +

      Loop walkthrough

      @@ -676,9 +704,9 @@

      Loop walkthrough

      -
      meas <- readRDS(here::here("data", "measles_final.Rds")) |>
      -    subset(vaccine_antigen == "MCV1")
      -str(meas)
      +
      meas <- readRDS(here::here("data", "measles_final.Rds")) |>
      +    subset(vaccine_antigen == "MCV1")
      +str(meas)
      'data.frame':   7972 obs. of  7 variables:
        $ iso3c           : chr  "AFG" "AFG" "AFG" "AFG" ...
      @@ -699,7 +727,7 @@ 

      Loop walkthrough

      -
      res <- vector(mode = "list", length = length(unique(meas$country)))
      +
      res <- vector(mode = "list", length = length(unique(meas$country)))
      • This is called preallocation and it can make your loops much faster.
      • @@ -714,8 +742,8 @@

        Loop walkthrough

      -
      countries <- unique(meas$country)
      -for (i in 1:length(countries)) {...}
      +
      countries <- unique(meas$country)
      +for (i in 1:length(countries)) {...}
      @@ -731,10 +759,10 @@

      Loop walkthrough

      -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +}
      @@ -744,16 +772,17 @@

      Loop walkthrough

      -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_med <- median(country_cases, na.rm = TRUE)
      -    country_iqr <- IQR(country_cases, na.rm = TRUE)
      -    country_range <- range(country_cases, na.rm = TRUE)
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_quart <- quantile(
      +        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      +    )
      +    country_range <- range(country_cases, na.rm = TRUE)
      +}
      @@ -761,27 +790,27 @@

      Loop walkthrough

    • Next we save the summary statistics into a data frame.
    • -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_quart <- quantile(
      -        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      -    )
      -    country_range <- range(country_cases, na.rm = TRUE)
      -    
      -    # Save the summary statistics into a data frame
      -    country_summary <- data.frame(
      -        country = countries[[i]],
      -        min = country_range[[1]],
      -        Q1 = country_quart[[1]],
      -        median = country_quart[[2]],
      -        Q3 = country_quart[[3]],
      -        max = country_range[[2]]
      -    )
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_quart <- quantile(
      +        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      +    )
      +    country_range <- range(country_cases, na.rm = TRUE)
      +    
      +    # Save the summary statistics into a data frame
      +    country_summary <- data.frame(
      +        country = countries[[i]],
      +        min = country_range[[1]],
      +        Q1 = country_quart[[1]],
      +        median = country_quart[[2]],
      +        Q3 = country_quart[[3]],
      +        max = country_range[[2]]
      +    )
      +}
      @@ -789,30 +818,30 @@

      Loop walkthrough

    • And finally, we save the data frame as the next element in our storage list.
    • -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_quart <- quantile(
      -        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      -    )
      -    country_range <- range(country_cases, na.rm = TRUE)
      -    
      -    # Save the summary statistics into a data frame
      -    country_summary <- data.frame(
      -        country = countries[[i]],
      -        min = country_range[[1]],
      -        Q1 = country_quart[[1]],
      -        median = country_quart[[2]],
      -        Q3 = country_quart[[3]],
      -        max = country_range[[2]]
      -    )
      -    
      -    # Save the results to our container
      -    res[[i]] <- country_summary
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_quart <- quantile(
      +        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      +    )
      +    country_range <- range(country_cases, na.rm = TRUE)
      +    
      +    # Save the summary statistics into a data frame
      +    country_summary <- data.frame(
      +        country = countries[[i]],
      +        min = country_range[[1]],
      +        Q1 = country_quart[[1]],
      +        median = country_quart[[2]],
      +        Q3 = country_quart[[3]],
      +        max = country_range[[2]]
      +    )
      +    
      +    # Save the results to our container
      +    res[[i]] <- country_summary
      +}
      Warning in min(x): no non-missing arguments to min; returning Inf
      @@ -838,7 +867,7 @@

      Loop walkthrough

    • Let’s take a look at the results.
    • -
      head(res)
      +
      head(res)
      [[1]]
             country min   Q1 median   Q3   max
      @@ -874,10 +903,10 @@ 

      Loop walkthrough

    • We can use a vectorization trick: the function do.call() seems like ancient computer science magic. And it is. But it will actually help us a lot.
    • -
      res_df <- do.call(rbind, res)
      -head(res_df)
      +
      res_df <- do.call(rbind, res)
      +head(res_df)
      -
      Iteration
      +
      @@ -947,7 +976,7 @@

      Loop walkthrough

      -
      ?rbind
      +
      ?rbind
      Combine R Objects by Rows or Columns
       
      @@ -1081,8 +1110,8 @@ 

      Loop walkthrough

      Factors have their levels expanded as necessary (in the order of the levels of the level sets of the factors encountered) and the result is an ordered factor if and only if all the components were - ordered factors. (The last point differs from S-PLUS.) Old-style - categories (integer vectors with levels) are promoted to factors. + ordered factors. Old-style categories (integer vectors with + levels) are promoted to factors. Note that for result column 'j', 'factor(., exclude = X(j))' is applied, where @@ -1166,7 +1195,7 @@

      Loop walkthrough

      -
      ?do.call
      +
      ?do.call
      Execute a Function Call
       
      @@ -1263,13 +1292,13 @@ 

      Loop walkthrough

    • OK, so basically what happened is that
    • -
      do.call(rbind, list)
      +
      do.call(rbind, list)
      • Gets transformed into
      -
      rbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])
      +
      rbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])
      • That’s vectorization magic!
      • @@ -1282,79 +1311,79 @@

        You try it! (if we have time)

      • Use the code you wrote before the get the incidence per 1000 people on the entire measles data set (add a column for incidence to the full data).
      • Use the code plot(NULL, NULL, ...) to make a blank plot. You will need to set the xlim and ylim arguments to sensible values, and specify the axis titles as “Year” and “Incidence per 1000 people”.
      • Using a for loop and the lines() function, make a plot that shows all of the incidence curves over time, overlapping on the plot.
      • -
      • HINT: use col = adjustcolor(black, alpha.f = 0.25) to make the curves transparent, so you can see the others.
      • -
      • BONUS PROBLEM: using the function cumsum(), make a plot of the cumulative incidence per 1000 people over time for all of the countries. (Dealing with the NA’s here is tricky!!)
      • +
      • HINT: use col = adjustcolor(black, alpha.f = 0.25) to make the curves partially transparent, so you can see the overlap.
      • +
      • BONUS PROBLEM: using the function cumsum(), make a plot of the cumulative cases (not standardized) over time for all of the countries. (Dealing with the NA’s here is tricky!!)

      Main problem solution

      -
      meas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000
      -countries <- unique(meas$country)
      -
      -plot(
      -    NULL, NULL,
      -    xlim = c(1980, 2022),
      -    ylim = c(0, 50),
      -    xlab = "Year",
      -    ylab = "Incidence per 1000 people"
      -)
      -
      -for (i in 1:length(countries)) {
      -    country_data <- subset(meas, country == countries[[i]])
      -    lines(
      -        x = country_data$time,
      -        y = country_data$cases_per_thousand,
      -        col = adjustcolor("black", alpha.f = 0.25)
      -    )
      -}
      +
      meas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000
      +countries <- unique(meas$country)
      +
      +plot(
      +    NULL, NULL,
      +    xlim = c(1980, 2022),
      +    ylim = c(0, 50),
      +    xlab = "Year",
      +    ylab = "Incidence per 1000 people"
      +)
      +
      +for (i in 1:length(countries)) {
      +    country_data <- subset(meas, country == countries[[i]])
      +    lines(
      +        x = country_data$time,
      +        y = country_data$cases_per_thousand,
      +        col = adjustcolor("black", alpha.f = 0.25)
      +    )
      +}

      Main problem solution

      -
      +

      Bonus problem solution

      -
      # First calculate the cumulative cases, treating NA as zeroes
      -cumulative_cases <- ave(
      -    x = ifelse(is.na(meas$Cases), 0, meas$Cases),
      -    meas$country,
      -    FUN = cumsum
      -)
      -
      -# Now put the NAs back where they should be
      -meas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)
      -
      -plot(
      -    NULL, NULL,
      -    xlim = c(1980, 2022),
      -    ylim = c(1, 6.2e6),
      -    xlab = "Year",
      -    ylab = "Cumulative cases per 1000 people"
      -)
      -
      -for (i in 1:length(countries)) {
      -    country_data <- subset(meas, country == countries[[i]])
      -    lines(
      -        x = country_data$time,
      -        y = country_data$cumulative_cases,
      -        col = adjustcolor("black", alpha.f = 0.25)
      -    )
      -}
      -
      -text(
      -    x = 2020,
      -    y = 6e6,
      -    labels = "China →"
      -)
      +
      # First calculate the cumulative cases, treating NA as zeroes
      +cumulative_cases <- ave(
      +    x = ifelse(is.na(meas$Cases), 0, meas$Cases),
      +    meas$country,
      +    FUN = cumsum
      +)
      +
      +# Now put the NAs back where they should be
      +meas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)
      +
      +plot(
      +    NULL, NULL,
      +    xlim = c(1980, 2022),
      +    ylim = c(1, 6.2e6),
      +    xlab = "Year",
      +    ylab = paste0("Cumulative cases since", min(meas$time))
      +)
      +
      +for (i in 1:length(countries)) {
      +    country_data <- subset(meas, country == countries[[i]])
      +    lines(
      +        x = country_data$time,
      +        y = country_data$cumulative_cases,
      +        col = adjustcolor("black", alpha.f = 0.25)
      +    )
      +}
      +
      +text(
      +    x = 2020,
      +    y = 6e6,
      +    labels = "China →"
      +)

      Bonus problem solution

      -
      +

      More practice on your own

        @@ -1362,8 +1391,10 @@

        More practice on your own

      • Assess the impact of age_months as a confounder in the Diphtheria serology data. First, write code to transform age_months into age ranges for each year. Then, using a loop, calculate the crude odds ratio for the effect of vaccination on infection for each of the age ranges. How does the odds ratio change as age increases? Can you formalize this analysis by fitting a logistic regression model with age_months and vaccination as predictors?
      +
      @@ -1392,7 +1423,6 @@

      More practice on your own

      Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, -'smaller': false, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1578,43 +1608,81 @@

      More practice on your own

      }); + + +
      diff --git a/docs/schedule.html b/docs/schedule.html index ab5d63b..2607d73 100644 --- a/docs/schedule.html +++ b/docs/schedule.html @@ -2,7 +2,7 @@ - + @@ -166,7 +166,7 @@
      @@ -287,32 +366,56 @@

      Day 01 – Monday

      - - + + - + + + + + + + + + - - + + + + + + + + + + - - + + - + + + + + + + + + - - + + @@ -336,36 +439,64 @@

      Day 02 – Tuesday

      - - + + - + + + + + + + + + - - + + + + + + + + + + - - + + + + + + + + + + - + - - + + - - + + + + + +
      country
      08:30 am - 10:00 amcontent08:30 am - 09:00 amModule 0 (Amy and Zane)
      10:00 am - 10:15 am09:00 am - 10:00 amModule 1 (Amy)
      10:00 am - 10:30 am Coffee break
      10:30 am - 11:15 amModule 2 (Amy)
      10:30 am - 12:00 pmcontent11:15 am - 11:30 amModule 3 (Zane)
      11:30 am - 12:00 pmModule 4 (Zane)
      12:00 pm - 01:30 pm Lunch (2nd floor lobby)
      01:30 pm - 02:15 pmModule 5 (Amy)
      01:30 pm - 03:00 pmcontent02:15 pm - 02:45 pmExercise 1
      03:00 pm - 03:15 pm02:45 pm - 03:00 pmStart Module 6 (Amy)
      03:00 pm - 03:30 pm Coffee break
      03:30 pm - 04:00 pmFinish Module 6 (Amy or Zane)
      03:00 pm - 05:00 pmcontent04:00 pm - 05:00 pmModule 7, exercise 2 in remaining time (Zane)
      05:00 pm - 07:00 pm
      08:30 am - 10:00 amcontent08:30 am - 09:00 amexercise review and questions / catchup
      10:00 am - 10:15 am09:00 am - 09:15 amModule 8
      09:15 am - 10:00 amExercise 3 work time
      10:00 am - 10:30 am Coffee break
      10:30 am - 12:00 pmcontent10:30 am - 10:45 amExercise review
      10:45 am - 11:15 amModule 9
      11:15 am - 12:00 pmData analysis walkthrough
      12:00 pm - 01:30 pm Lunch (2nd floor lobby); Lunch and Learn!
      01:30 pm - 03:00 pmcontent01:30 pm - 02:00 pmExercise 4
      02:00 pm - 02:30 pmExercise 4 review
      02:30 pm - 03:00 pmModule 10
      03:00 pm - 03:15 pm03:00 pm - 03:30 pm Coffee break
      03:00 pm - 05:00 pmcontent03:30 pm - 04:00 pmExercise 5
      05:00 pm - 07:00 pmNetworking night and poster session, Randal Rollins P0104:00 pm - 04:30 pmReview exercise 5
      04:30 pm - 05:00 pmModule 11
      @@ -386,7 +517,7 @@

      Day 03 – Wednesday

      08:30 am - 10:00 am -content +tbd; Modules 12 (Amy) and 13 (Zane) 10:00 am - 10:15 am @@ -394,7 +525,7 @@

      Day 03 – Wednesday

      10:30 am - 12:00 pm -content +tbd; Module 14, practice, questions, review @@ -1040,42 +1171,55 @@

      Day 03 – Wednesday

      | Time | Section | |:--------------------|:--------| -| 08:30 am - 10:00 am | content | -| 10:00 am - 10:15 am | Coffee break | -| 10:30 am - 12:00 pm | content | -| 12:00 pm - 01:30 pm | Lunch (2nd floor lobby) | -| 01:30 pm - 03:00 pm | content | -| 03:00 pm - 03:15 pm | Coffee break | -| 03:00 pm - 05:00 pm | content | -| 05:00 pm - 07:00 pm | **Networking night** and poster session, Randal Rollins P01 | - -: {.striped .hover tbl-colwidths="[25,75]"} - - -## Day 02 – Tuesday - -| Time | Section | -|:--------------------|:--------| -| 08:30 am - 10:00 am | content | -| 10:00 am - 10:15 am | Coffee break | -| 10:30 am - 12:00 pm | content | -| 12:00 pm - 01:30 pm | Lunch (2nd floor lobby); **Lunch and Learn!** | -| 01:30 pm - 03:00 pm | content | -| 03:00 pm - 03:15 pm | Coffee break | -| 03:00 pm - 05:00 pm | content | -| 05:00 pm - 07:00 pm | Networking night and poster session, Randal Rollins P01 | - -: {.striped .hover tbl-colwidths="[25,75]"} - -## Day 03 – Wednesday - -| Time | Section | -|:--------------------|:--------| -| 08:30 am - 10:00 am | content | -| 10:00 am - 10:15 am | Coffee break | -| 10:30 am - 12:00 pm | content | - -: {.striped .hover tbl-colwidths="[25,75]"}
    +| 08:30 am - 09:00 am | Module 0 (Amy and Zane) | +| 09:00 am - 10:00 am | Module 1 (Amy) | +| 10:00 am - 10:30 am | Coffee break | +| 10:30 am - 11:15 am | Module 2 (Amy) | +| 11:15 am - 11:30 am | Module 3 (Zane) | +| 11:30 am - 12:00 pm | Module 4 (Zane) | +| 12:00 pm - 01:30 pm | Lunch (2nd floor lobby) | +| 01:30 pm - 02:15 pm | Module 5 (Amy) | +| 02:15 pm - 02:45 pm | Exercise 1| +| 02:45 pm - 03:00 pm | Start Module 6 (Amy) | +| 03:00 pm - 03:30 pm | Coffee break | +| 03:30 pm - 04:00 pm | Finish Module 6 (Amy or Zane) | +| 04:00 pm - 05:00 pm | Module 7, exercise 2 in remaining time (Zane) | +| 05:00 pm - 07:00 pm | **Networking night** and poster session, Randal Rollins P01 | + +: {.striped .hover tbl-colwidths="[25,75]"} + + +## Day 02 – Tuesday + +| Time | Section | +|:--------------------|:--------| +| 08:30 am - 09:00 am | exercise review and questions / catchup | +| 09:00 am - 09:15 am | Module 8 | +| 09:15 am - 10:00 am | Exercise 3 work time | +| 10:00 am - 10:30 am | Coffee break | +| 10:30 am - 10:45 am | Exercise review | +| 10:45 am - 11:15 am | Module 9 | +| 11:15 am - 12:00 pm | Data analysis walkthrough | +| 12:00 pm - 01:30 pm | Lunch (2nd floor lobby); **Lunch and Learn!** | +| 01:30 pm - 02:00 pm | Exercise 4 | +| 02:00 pm - 02:30 pm | Exercise 4 review | +| 02:30 pm - 03:00 pm | Module 10 | +| 03:00 pm - 03:30 pm | Coffee break | +| 03:30 pm - 04:00 pm | Exercise 5 | +| 04:00 pm - 04:30 pm | Review exercise 5 | +| 04:30 pm - 05:00 pm | Module 11 | + +: {.striped .hover tbl-colwidths="[25,75]"} + +## Day 03 – Wednesday + +| Time | Section | +|:--------------------|:--------| +| 08:30 am - 10:00 am | tbd; Modules 12 (Amy) and 13 (Zane) | +| 10:00 am - 10:15 am | Coffee break | +| 10:30 am - 12:00 pm | tbd; Module 14, practice, questions, review | + +: {.striped .hover tbl-colwidths="[25,75]"}
    diff --git a/docs/search.json b/docs/search.json index 821fe9e..6274756 100644 --- a/docs/search.json +++ b/docs/search.json @@ -1,1006 +1,872 @@ [ { - "objectID": "schedule.html", - "href": "schedule.html", - "title": "Course Schedule", - "section": "", - "text": "Meeting times:\nLocation: Randal Rollins Building (RR) 201, Emory University", + "objectID": "modules/Module13-Iteration.html#learning-goals", + "href": "modules/Module13-Iteration.html#learning-goals", + "title": "Module 13: Iteration in R", + "section": "Learning goals", + "text": "Learning goals\n\nReplace repetitive code with a for loop\nUse vectorization to replace unnecessary loops", "crumbs": [ - "Course Schedule" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "schedule.html#day-01-monday", - "href": "schedule.html#day-01-monday", - "title": "Course Schedule", - "section": "Day 01 – Monday", - "text": "Day 01 – Monday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 10:00 am\ncontent\n\n\n10:00 am - 10:15 am\nCoffee break\n\n\n10:30 am - 12:00 pm\ncontent\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby)\n\n\n01:30 pm - 03:00 pm\ncontent\n\n\n03:00 pm - 03:15 pm\nCoffee break\n\n\n03:00 pm - 05:00 pm\ncontent\n\n\n05:00 pm - 07:00 pm\nNetworking night and poster session, Randal Rollins P01", + "objectID": "modules/Module13-Iteration.html#what-is-iteration", + "href": "modules/Module13-Iteration.html#what-is-iteration", + "title": "Module 13: Iteration in R", + "section": "What is iteration?", + "text": "What is iteration?\n\nWhenever you repeat something, that’s iteration.\nIn R, this means running the same code multiple times in a row.\n\n\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.", "crumbs": [ - "Course Schedule" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "schedule.html#day-02-tuesday", - "href": "schedule.html#day-02-tuesday", - "title": "Course Schedule", - "section": "Day 02 – Tuesday", - "text": "Day 02 – Tuesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 10:00 am\ncontent\n\n\n10:00 am - 10:15 am\nCoffee break\n\n\n10:30 am - 12:00 pm\ncontent\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby); Lunch and Learn!\n\n\n01:30 pm - 03:00 pm\ncontent\n\n\n03:00 pm - 03:15 pm\nCoffee break\n\n\n03:00 pm - 05:00 pm\ncontent\n\n\n05:00 pm - 07:00 pm\nNetworking night and poster session, Randal Rollins P01", + "objectID": "modules/Module13-Iteration.html#parts-of-a-loop", + "href": "modules/Module13-Iteration.html#parts-of-a-loop", + "title": "Module 13: Iteration in R", + "section": "Parts of a loop", + "text": "Parts of a loop\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe header declares how many times we will repeat the same code. The header contains a control variable that changes in each repetition and a sequence of values for the control variable to take.", "crumbs": [ - "Course Schedule" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "schedule.html#day-03-wednesday", - "href": "schedule.html#day-03-wednesday", - "title": "Course Schedule", - "section": "Day 03 – Wednesday", - "text": "Day 03 – Wednesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 10:00 am\ncontent\n\n\n10:00 am - 10:15 am\nCoffee break\n\n\n10:30 am - 12:00 pm\ncontent", + "objectID": "modules/Module13-Iteration.html#parts-of-a-loop-1", + "href": "modules/Module13-Iteration.html#parts-of-a-loop-1", + "title": "Module 13: Iteration in R", + "section": "Parts of a loop", + "text": "Parts of a loop\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe body of the loop contains code that will be repeated a number of times based on the header instructions. In R, the body has to be surrounded by curly braces.", "crumbs": [ - "Course Schedule" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "modules/ModuleXX-Iteration.html#learning-goals", - "href": "modules/ModuleXX-Iteration.html#learning-goals", - "title": "Iteration in R", - "section": "Learning goals", - "text": "Learning goals\n\nReplace repetitive code with a for loop\nUse vectorization to replace unnecessary loops" - }, - { - "objectID": "index.html", - "href": "index.html", - "title": "Welcome", - "section": "", - "text": "Welcome to “Introduction to R”!\nThis website contains all of the slides and exercises for the 2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module “Introduction to R”.", + "objectID": "modules/Module13-Iteration.html#header-parts", + "href": "modules/Module13-Iteration.html#header-parts", + "title": "Module 13: Iteration in R", + "section": "Header parts", + "text": "Header parts\n\nfor (this_island in levels(penguins$island)) {...}\n\n\nfor: keyword that declares we are doing a for loop.\n(...): parentheses after for declare the control variable and sequence.\nthis_island: the control variable.\nin: keyword that separates the control varibale and sequence.\nlevels(penguins$island): the sequence.\n{}: curly braces will contain the body code.", "crumbs": [ - "Welcome!" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "index.html#prerequisities", - "href": "index.html#prerequisities", - "title": "Welcome", - "section": "Prerequisities", - "text": "Prerequisities\nFamiliary with basic statistical concepts on the level of an introductory statistics class is assumed for our course\nBefore the course begins, you should install R and RStudio on your laptop. If you are using an older version of R, you should update it before the course begins. You will need at least R version 4.3.0 for this course, but using the most recent version (4.4.1 at the time of writing) is always preferable.\n\nYou can install R from the CRAN website by clicking on the correct download link for your OS.\nYou can install RStudio from the Posit website.", + "objectID": "modules/Module13-Iteration.html#header-parts-1", + "href": "modules/Module13-Iteration.html#header-parts-1", + "title": "Module 13: Iteration in R", + "section": "Header parts", + "text": "Header parts\n\nfor (this_island in levels(penguins$island)) {...}\n\n\nSince levels(penguins$island) evaluates to c(\"Biscoe\", \"Dream\", \"Torgersen\"), our loop will repeat 3 times.\n\n\n\n\nIteration\nthis_island\n\n\n\n\n1\n“Biscoe”\n\n\n2\n“Dream”\n\n\n3\n“Torgersen”\n\n\n\n\nEverything inside of {...} will be repeated three times.", "crumbs": [ - "Welcome!" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "index.html#about-the-instructors", - "href": "index.html#about-the-instructors", - "title": "Welcome", - "section": "About the instructors", - "text": "About the instructors\n\n\n\nInstructor: Dr. Amy Winter\n\n\nDr. Winter is an Assistant Professor of Epidemiology at the University of Georgia. She has been coding in R for 10 years, and uses R day-to-day to conduct her research addressing policy-relevant questions on the transmission and control of infectious diseases in human populations, particularly VPDs. She teaches a semester-long course titled Introduction to Coding in R for Public Health to graduate students at the University of Georgia.\n\n\n\nTA: Zane Billings\n\n\nZane Billings is a PhD student in Epidemiology and Biostatistics at the University of Georgia, working with Andreas Handel. He has been using R since 2017, and uses R for nearly all of his statistics and data science practice. Zane’s research focuses on the immune response to influenza vaccination, and uses machine learning and multilevel regression modeling (in R!) to improve our understanding of influenza immunology.", + "objectID": "modules/Module13-Iteration.html#loop-iteration-1", + "href": "modules/Module13-Iteration.html#loop-iteration-1", + "title": "Module 13: Iteration in R", + "section": "Loop iteration 1", + "text": "Loop iteration 1\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Biscoe Island was 15.87 mm.", "crumbs": [ - "Welcome!" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "modules/Module01-Intro.html", - "href": "modules/Module01-Intro.html", - "title": "Intro to Modeling", - "section": "", - "text": "ReuseCC BY-NC 4.0", + "objectID": "modules/Module13-Iteration.html#loop-iteration-2", + "href": "modules/Module13-Iteration.html#loop-iteration-2", + "title": "Module 13: Iteration in R", + "section": "Loop iteration 2", + "text": "Loop iteration 2\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Dream Island was 18.34 mm.", "crumbs": [ - "Day 1", - "Intro to Modeling" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "references.html", - "href": "references.html", - "title": "", - "section": "", - "text": "Code\n\n\n\n\n\nReferences\n\n\nMatloff, Norman. 2011. The Art of R Programming. San Francisco, CA: No Starch Press.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. Sebastopol, CA: O’Reilly Media.\n\n\n\n\n\n\n\n\nReuseCC BY-NC 4.0", + "objectID": "modules/Module13-Iteration.html#loop-iteration-3", + "href": "modules/Module13-Iteration.html#loop-iteration-3", + "title": "Module 13: Iteration in R", + "section": "Loop iteration 3", + "text": "Loop iteration 3\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Torgersen Island was 18.43 mm.", "crumbs": [ - "More References" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "exercises/CaseStudy01.html#learning-goals", - "href": "exercises/CaseStudy01.html#learning-goals", - "title": "Algorithmic Thinking Case Study 1", - "section": "Learning goals", - "text": "Learning goals\n\nUse logical operators, subsetting functions, and math calculations in R\nTranslate human-understandable problem descriptions into instructions that R can understand." + "objectID": "modules/Module13-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy-and-paste-our-code", + "href": "modules/Module13-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy-and-paste-our-code", + "title": "Module 13: Iteration in R", + "section": "The loop structure automates this process for us so we don’t have to copy and paste our code!", + "text": "The loop structure automates this process for us so we don’t have to copy and paste our code!\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#instructions", - "href": "exercises/CaseStudy01.html#instructions", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it." + "objectID": "modules/Module13-Iteration.html#side-note-the-pipe-operator", + "href": "modules/Module13-Iteration.html#side-note-the-pipe-operator", + "title": "Module 13: Iteration in R", + "section": "Side note: the pipe operator |>", + "text": "Side note: the pipe operator |>\n\nThis operator allows us to chain commands together so the output of the previous statement is passed into the next statement.\nE.g. the code\n\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\nwill be transformed by R into\n\nisland_mean <-\n round(\n mean(\n penguins$bill_depth_mm[penguins$island == \"Torgersen\"],\n na.rm = TRUE\n ),\n digits = 2\n )\n\nbefore it gets run. So using the pipe is a way to avoid deeply nested functions.\nNote that another alernative could be like this:\n\nisland_data <- penguins$bill_depth_mm[penguins$island == \"Torgersen\"]\nisland_mean_raw <- mean(island_data, na.rm = TRUE)\nisland_mean <- round(island_mean_raw, digits = 2)\n\nSo using |> can also help us to avoid a lot of assignments.\n\nWhichever style you prefer is fine! Some people like the pipe, some people like nesting, and some people like intermediate assignments. All three are perfectly fine as long as your code is neat and commented.\nIf you go on to the tidyverse class, you will use a lot of piping – it is a very popular coding style in R these days thanks to the inventors of the tidyverse packages.\nAlso note that you need R version 4.1.0 or better to use |>. If you are on an older version of R, it will not be available.\n\nNow, back to loops!", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#instructions-1", - "href": "exercises/CaseStudy01.html#instructions-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.\nThe str() of your dataset should look like this.\n\n\n\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ..." + "objectID": "modules/Module13-Iteration.html#remember-write-dry-code", + "href": "modules/Module13-Iteration.html#remember-write-dry-code", + "title": "Module 13: Iteration in R", + "section": "Remember: write DRY code!", + "text": "Remember: write DRY code!\n\nDRY = “Don’t Repeat Yourself”\nInstead of copying and pasting, write loops and functions.\nEasier to debug and change in the future!\n\n\n\nOf course, we all copy and paste code sometimes. If you are running on a tight deadline or can’t get a loop or function to work, you might need to. DRY code is good, but working code is best!", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "modules/Module01-Intro.html#welcome-to-class", - "href": "modules/Module01-Intro.html#welcome-to-class", - "title": "Intro to Modeling", - "section": "Welcome to class!", - "text": "Welcome to class!\n\n2 + 2\n\n[1] 4", + "objectID": "modules/Module13-Iteration.html#you-try-it", + "href": "modules/Module13-Iteration.html#you-try-it", + "title": "Module 13: Iteration in R", + "section": "You try it!", + "text": "You try it!\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints the squared number.\n\n\nfor (i in 1:10) {\n cat(i ^ 2, \"\\n\")\n}\n\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100", "crumbs": [ - "Day 1", - "Intro to Modeling" + "Day 3", + "Module 13: Iteration in R" ] }, { - "objectID": "exercises/CaseStudy01.html#part-1", - "href": "exercises/CaseStudy01.html#part-1", - "title": "Case Study 1", - "section": "Part 1", - "text": "Part 1\n\n\n\nWas the overall prevalence higher in urban or rural areas?" + "objectID": "modules/Module13-Iteration.html#wait-did-we-need-to-do-that", + "href": "modules/Module13-Iteration.html#wait-did-we-need-to-do-that", + "title": "Module 13: Iteration in R", + "section": "Wait, did we need to do that?", + "text": "Wait, did we need to do that?\n\nWell, yes, because you need to practice loops!\nBut technically no, because we can use vectorization.\nAlmost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#what-is", - "href": "exercises/CaseStudy01.html#what-is", - "title": "Case Study 1", - "section": "What is", - "text": "What is\nWhat is" + "objectID": "modules/Module13-Iteration.html#wait-did-we-need-to-do-that-1", + "href": "modules/Module13-Iteration.html#wait-did-we-need-to-do-that-1", + "title": "Module 13: Iteration in R", + "section": "Wait, did we need to do that?", + "text": "Wait, did we need to do that?\n\nWell, yes, because you need to practice loops!\nBut technically no, because we can use vectorization.\nAlmost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.\n\n\n# No loop needed!\n(1:10)^2\n\n [1] 1 4 9 16 25 36 49 64 81 100\n\n\n\n\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n\n [1] 1 3 5 7 9 11 13 15 17 19\n\n\n\n\n\nSo you should really try vectorization first, then use loops only when you can’t use vectorization.", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "href": "exercises/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Was the overall prevalence higher in urban or rural areas?", - "text": "Q1: Was the overall prevalence higher in urban or rural areas?\n\n\nHow do we calculate the prevalence from the data?\nHow do we calculate the prevalence separately for urban and rural areas?\nHow do we determine which prevalence is higher and if the difference is meaningful?" + "objectID": "modules/Module13-Iteration.html#loop-walkthrough", + "href": "modules/Module13-Iteration.html#loop-walkthrough", + "title": "Module 13: Iteration in R", + "section": "Loop walkthrough", + "text": "Loop walkthrough\n\nLet’s walk through a complex but useful example where we can’t use vectorization.\nLoad the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n\n\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n subset(vaccine_antigen == \"MCV1\")\nstr(meas)\n\n'data.frame': 7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "href": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence from the data?", - "text": "Q1: How do we calculate the prevalence from the data?\n\n\nThe variable DP_infection in our dataset is binary / dichotomous.\nThe prevalence is the number or percent of people who had the disease over some duration.\nThe average of a binary variable gives the prevalence!\n\n\n\n\nmean(diph$DP_infection)\n\n[1] 0.8" + "objectID": "modules/Module13-Iteration.html#loop-walkthrough-1", + "href": "modules/Module13-Iteration.html#loop-walkthrough-1", + "title": "Module 13: Iteration in R", + "section": "Loop walkthrough", + "text": "Loop walkthrough\n\nFirst, make an empty list. This is where we’ll store our results. Make it the same length as the number of countries in the dataset.\n\n\n\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n\n\nThis is called preallocation and it can make your loops much faster.", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "href": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n\n[1] 0.8235294\n\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n\n[1] 0.778626\n\n\n\n\n\nThere are many ways you could write this code! You can use subset() or you can write the indices many ways.\nUsing tbl_df objects from haven uses different [[ rules than a base R data frame." + "objectID": "modules/Module13-Iteration.html#loop-walkthrough-2", + "href": "modules/Module13-Iteration.html#loop-walkthrough-2", + "title": "Module 13: Iteration in R", + "section": "Loop walkthrough", + "text": "Loop walkthrough\n\nLoop through every country in the dataset, and get the median, first and third quartiles, and range for each country. Store those summary statistics in a data frame.\nWhat should the header look like?\n\n\n\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n\n\n\n\nNote that we use the index as the control variable. When you need to do complex operations inside a loop, this is easier than the for-each construction we used earlier.", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is", - "text": "Q1: How do we determine which prevalence is higher and if the difference is\nmeaningful?\n\n\nWe probably need to include a confidence interval in our calculation.\nThis is actually not so easy without more advanced tools that we will learn in upcoming modules.\nRight now the best options are to do it by hand or google a function." + "objectID": "modules/Module13-Iteration.html#loop-walkthrough-3", + "href": "modules/Module13-Iteration.html#loop-walkthrough-3", + "title": "Module 13: Iteration in R", + "section": "Loop walkthrough", + "text": "Loop walkthrough\n\nNow write out the body of the code. First we need to subset the data, to get only the data for the current country.\n\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n}\n\n\n\n\nNext we need to get the summary of the cases for that country.\n\n\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_quart <- quantile(\n country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n )\n country_range <- range(country_cases, na.rm = TRUE)\n}\n\n\n\n\nNext we save the summary statistics into a data frame.\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_quart <- quantile(\n country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n )\n country_range <- range(country_cases, na.rm = TRUE)\n \n # Save the summary statistics into a data frame\n country_summary <- data.frame(\n country = countries[[i]],\n min = country_range[[1]],\n Q1 = country_quart[[1]],\n median = country_quart[[2]],\n Q3 = country_quart[[3]],\n max = country_range[[2]]\n )\n}\n\n\n\n\nAnd finally, we save the data frame as the next element in our storage list.\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_quart <- quantile(\n country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n )\n country_range <- range(country_cases, na.rm = TRUE)\n \n # Save the summary statistics into a data frame\n country_summary <- data.frame(\n country = countries[[i]],\n min = country_range[[1]],\n Q1 = country_quart[[1]],\n median = country_quart[[2]],\n Q3 = country_quart[[3]],\n max = country_range[[2]]\n )\n \n # Save the results to our container\n res[[i]] <- country_summary\n}\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\n\n\n\nLet’s take a look at the results.\n\n\nhead(res)\n\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n\n\n\nHow do we deal with this to get it into a nice form?\n\n\n\n\nWe can use a vectorization trick: the function do.call() seems like ancient computer science magic. And it is. But it will actually help us a lot.\n\n\nres_df <- do.call(rbind, res)\nhead(res_df)\n\n\n\n\ncountry\nmin\nQ1\nmedian\nQ3\nmax\n\n\n\n\nAfghanistan\n353\n1154.00\n2205\n5166.0\n31107\n\n\nAngola\n29\n700.00\n3271\n14474.0\n30067\n\n\nAlbania\n0\n1.00\n12\n29.0\n136034\n\n\nAndorra\n0\n0.00\n1\n2.0\n5\n\n\nUnited Arab Emirates\n22\n89.75\n320\n1128.0\n2913\n\n\nArgentina\n0\n0.00\n17\n4591.5\n42093\n\n\n\n\n\n\nIt combined our data frames together! Let’s take a look at the rbind and do.call() help packages to see what happened.\n\n\n\n\n?rbind\n\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. Old-style categories (integer vectors with\n levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n\n\n\n\n\n?do.call\n\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n\n\n\n\n\nOK, so basically what happened is that\n\n\ndo.call(rbind, list)\n\n\nGets transformed into\n\n\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n\n\nThat’s vectorization magic!", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-1", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-1", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is", - "text": "Q1: How do we determine which prevalence is higher and if the difference is\nmeaningful? (By hand)\n\n\nUrban: 0.82; 95% CI: (0.78, 0.87)\nRural: 0.78; 95% CI: (0.73, 0.83)\n\n\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code!\nWe will learn techniques for writing this with less work and less repetition in upcoming modules." + "objectID": "modules/Module13-Iteration.html#you-try-it-if-we-have-time", + "href": "modules/Module13-Iteration.html#you-try-it-if-we-have-time", + "title": "Module 13: Iteration in R", + "section": "You try it! (if we have time)", + "text": "You try it! (if we have time)\n\nUse the code you wrote before the get the incidence per 1000 people on the entire measles data set (add a column for incidence to the full data).\nUse the code plot(NULL, NULL, ...) to make a blank plot. You will need to set the xlim and ylim arguments to sensible values, and specify the axis titles as “Year” and “Incidence per 1000 people”.\nUsing a for loop and the lines() function, make a plot that shows all of the incidence curves over time, overlapping on the plot.\nHINT: use col = adjustcolor(black, alpha.f = 0.25) to make the curves partially transparent, so you can see the overlap.\nBONUS PROBLEM: using the function cumsum(), make a plot of the cumulative cases (not standardized) over time for all of the countries. (Dealing with the NA’s here is tricky!!)", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-2", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-2", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is", - "text": "Q1: How do we determine which prevalence is higher and if the difference is\nmeaningful? (Google a package)\n\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254\n\n\n\nNotice that the results are slightly different from what we did manually! One advantage of writing your own code is that you know exactly what it does!\nFiguring out the details of how that function works might take a lot of time." + "objectID": "modules/Module13-Iteration.html#main-problem-solution", + "href": "modules/Module13-Iteration.html#main-problem-solution", + "title": "Module 13: Iteration in R", + "section": "Main problem solution", + "text": "Main problem solution\n\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n NULL, NULL,\n xlim = c(1980, 2022),\n ylim = c(0, 50),\n xlab = \"Year\",\n ylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n country_data <- subset(meas, country == countries[[i]])\n lines(\n x = country_data$time,\n y = country_data$cases_per_thousand,\n col = adjustcolor(\"black\", alpha.f = 0.25)\n )\n}", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n\nWe probably need to include a confidence interval in our calculation.\nThis is actually not so easy without more advanced tools that we will learn in upcoming modules.\nRight now the best options are to do it by hand or google a function." + "objectID": "modules/Module13-Iteration.html#main-problem-solution-1", + "href": "modules/Module13-Iteration.html#main-problem-solution-1", + "title": "Module 13: Iteration in R", + "section": "Main problem solution", + "text": "Main problem solution", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-by-hand", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-by-hand", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (By hand)", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (By hand)\n\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph))\n\nresult_urban <- paste0(\n \"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n round(p_urban - 1.96 * se_urban, 2), \", \",\n round(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n \"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n round(p_rural - 1.96 * se_rural, 2), \", \",\n round(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n\nUrban: 0.82; 95% CI: (0.78, 0.87)\nRural: 0.78; 95% CI: (0.73, 0.83)" + "objectID": "modules/Module13-Iteration.html#bonus-problem-solution", + "href": "modules/Module13-Iteration.html#bonus-problem-solution", + "title": "Module 13: Iteration in R", + "section": "Bonus problem solution", + "text": "Bonus problem solution\n\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n x = ifelse(is.na(meas$Cases), 0, meas$Cases),\n meas$country,\n FUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n NULL, NULL,\n xlim = c(1980, 2022),\n ylim = c(1, 6.2e6),\n xlab = \"Year\",\n ylab = paste0(\"Cumulative cases since\", min(meas$time))\n)\n\nfor (i in 1:length(countries)) {\n country_data <- subset(meas, country == countries[[i]])\n lines(\n x = country_data$time,\n y = country_data$cumulative_cases,\n col = adjustcolor(\"black\", alpha.f = 0.25)\n )\n}\n\ntext(\n x = 2020,\n y = 6e6,\n labels = \"China →\"\n)", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-google-a-package", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-google-a-package", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (Google a package)", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (Google a package)\n\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254\n\n\n\nNotice that the results are slightly different from what we did manually! One advantage of writing your own code is that you know exactly what it does!\nFiguring out the details of how that function works might take a lot of time." + "objectID": "modules/Module13-Iteration.html#bonus-problem-solution-1", + "href": "modules/Module13-Iteration.html#bonus-problem-solution-1", + "title": "Module 13: Iteration in R", + "section": "Bonus problem solution", + "text": "Bonus problem solution", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "href": "exercises/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\nOne easy way is to use the aggregate() function.\n\n\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294" + "objectID": "modules/Module13-Iteration.html#more-practice-on-your-own", + "href": "modules/Module13-Iteration.html#more-practice-on-your-own", + "title": "Module 13: Iteration in R", + "section": "More practice on your own", + "text": "More practice on your own\n\nMerge the countries-regions.csv data with the measles_final.Rds data. Reshape the measles data so that MCV1 and MCV2 vaccine coverage are two separate columns. Then use a loop to fit a poisson regression model for each continent where Cases is the outcome, and MCV1 coverage and MCV2 coverage are the predictors. Discuss your findings, and try adding an interation term.\nAssess the impact of age_months as a confounder in the Diphtheria serology data. First, write code to transform age_months into age ranges for each year. Then, using a loop, calculate the crude odds ratio for the effect of vaccination on infection for each of the age ranges. How does the odds ratio change as age increases? Can you formalize this analysis by fitting a logistic regression model with age_months and vaccination as predictors?", + "crumbs": [ + "Day 3", + "Module 13: Iteration in R" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-by-hand-1", - "href": "exercises/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful-by-hand-1", - "title": "Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (By hand)", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful? (By hand)\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#learning-objectives", + "href": "modules/Module07-VarCreationClassesSummaries.html#learning-objectives", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 7, you should be able to…\n\nCreate new variables\nCharacterize variable classes\nManipulate the classes of variables\nConduct 1 variable data summaries", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-googling-a-package", - "href": "exercises/CaseStudy01.html#q1-googling-a-package", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Googling a package", - "text": "Q1: Googling a package\n\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254" + "objectID": "modules/Module07-VarCreationClassesSummaries.html#import-data-for-this-module", + "href": "modules/Module07-VarCreationClassesSummaries.html#import-data-for-this-module", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Import data for this module", + "text": "Import data for this module\nLet’s first read in the data from the previous module and look at it briefly with a new function head(). head() allows us to look at the first n observations.\n\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-by-hand", - "href": "exercises/CaseStudy01.html#q1-by-hand", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n \"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n round(p_urban - 1.96 * se_urban, 2), \", \",\n round(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n \"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n round(p_rural - 1.96 * se_rural, 2), \", \",\n round(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)" + "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-operator", + "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-operator", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Adding new columns with $ operator", + "text": "Adding new columns with $ operator\nYou can add a new column, called log_IgG to df, using the $ operator:\n\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "exercises/CaseStudy01.html#q1-by-hand-1", - "href": "exercises/CaseStudy01.html#q1-by-hand-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform", + "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Adding new columns with transform()", + "text": "Adding new columns with transform()\nWe can also add a new column using the transform() function:\n\n?transform\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "exercises/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "href": "exercises/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "title": "Algorithmic Thinking Case Study 1", - "section": "Congratulations for finishing the first case study!", - "text": "Congratulations for finishing the first case study!\n\nWhat R functions and skills did you practice?\nWhat other questions could you answer about the same dataset with the skills you know now?" + "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform-1", + "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform-1", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Adding new columns with transform()", + "text": "Adding new columns with transform()\nFor example, adding a binary column for seropositivity called seropos:\n\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.1466807\nFALSE\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.2345475\nFALSE\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.2039728\nFALSE\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n4.9644957\nTRUE\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n-0.8037359\nFALSE\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n-3.6781074\nFALSE", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/CaseStudy01.html#learning-goals", - "href": "modules/CaseStudy01.html#learning-goals", - "title": "Algorithmic Thinking Case Study 1", - "section": "Learning goals", - "text": "Learning goals\n\nUse logical operators, subsetting functions, and math calculations in R\nTranslate human-understandable problem descriptions into instructions that R can understand.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#creating-conditional-variables", + "href": "modules/Module07-VarCreationClassesSummaries.html#creating-conditional-variables", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Creating conditional variables", + "text": "Creating conditional variables\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE.”\n\n?ifelse\n\nConditional Element Selection\nDescription:\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\nUsage:\n ifelse(test, yes, no)\n \nArguments:\ntest: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\nDetails:\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\nValue:\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\nWarning:\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'if'.\nExamples:\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#instructions", - "href": "modules/CaseStudy01.html#instructions", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example", + "href": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "ifelse example", + "text": "ifelse example\nReminder of the first three arguments in the ifelse() function are ifelse(test, yes, no).\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\nage_group\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.1466807\nFALSE\nyoung\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.2345475\nFALSE\nyoung\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.2039728\nFALSE\nyoung\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n4.9644957\nTRUE\nyoung\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n-0.8037359\nFALSE\nyoung\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n-3.6781074\nFALSE\nyoung", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#instructions-1", - "href": "modules/CaseStudy01.html#instructions-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.\nThe str() of your dataset should look like this.\n\n\n\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example-1", + "href": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example-1", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "ifelse example", + "text": "ifelse example\nLet’s delve into what is actually happening, with a focus on the NA values in age variable.\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n\n\ndf$age <= 5\n\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "href": "modules/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Was the overall prevalence higher in urban or rural areas?", - "text": "Q1: Was the overall prevalence higher in urban or rural areas?\n\n\nHow do we calculate the prevalence from the data?\nHow do we calculate the prevalence separately for urban and rural areas?\nHow do we determine which prevalence is higher and if the difference is meaningful?", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#nesting-two-ifelse-statements-example", + "href": "modules/Module07-VarCreationClassesSummaries.html#nesting-two-ifelse-statements-example", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Nesting two ifelse statements example", + "text": "Nesting two ifelse statements example\nifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2)).\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n\nLet’s use the table() function to check if it worked.\n\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n\n\n\n\nage/\nmiddle\nold\nyoung\nNA\n\n\n\n\n1\n0\n0\n44\n0\n\n\n2\n0\n0\n72\n0\n\n\n3\n0\n0\n79\n0\n\n\n4\n0\n0\n80\n0\n\n\n5\n0\n0\n41\n0\n\n\n6\n38\n0\n0\n0\n\n\n7\n38\n0\n0\n0\n\n\n8\n39\n0\n0\n0\n\n\n9\n20\n0\n0\n0\n\n\n10\n44\n0\n0\n0\n\n\n11\n0\n41\n0\n0\n\n\n12\n0\n23\n0\n0\n\n\n13\n0\n35\n0\n0\n\n\n14\n0\n37\n0\n0\n\n\n15\n0\n11\n0\n0\n\n\nNA\n0\n0\n0\n9\n\n\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "href": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence from the data?", - "text": "Q1: How do we calculate the prevalence from the data?\n\n\nThe variable DP_infection in our dataset is binary / dichotomous.\nThe prevalence is the number or percent of people who had the disease over some duration.\nThe average of a binary variable gives the prevalence!\n\n\n\n\nmean(diph$DP_infection)\n\n[1] 0.8", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#overview---data-classes", + "href": "modules/Module07-VarCreationClassesSummaries.html#overview---data-classes", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Overview - Data Classes", + "text": "Overview - Data Classes\n\nOne dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\nTwo dimensional types (e.g., matrix, data frame, tibble)\nSpecial data classes (e.g., lists, dates).", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "href": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n\n[1] 0.8235294\n\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n\n[1] 0.778626\n\n\n\n\n\nThere are many ways you could write this code! You can use subset() or you can write the indices many ways.\nUsing tbl_df objects from haven uses different [[ rules than a base R data frame.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#class-function", + "href": "modules/Module07-VarCreationClassesSummaries.html#class-function", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "class() function", + "text": "class() function\nThe class() function allows you to evaluate the class of an object.\n\nclass(df$IgG_concentration)\n\n[1] \"numeric\"\n\nclass(df$age)\n\n[1] \"integer\"\n\nclass(df$gender)\n\n[1] \"character\"", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "href": "modules/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\nOne easy way is to use the aggregate() function.\n\n\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#one-dimensional-data-types", + "href": "modules/Module07-VarCreationClassesSummaries.html#one-dimensional-data-types", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "One dimensional data types", + "text": "One dimensional data types\n\nCharacter: strings or individual characters, quoted\nNumeric: any real number(s)\n\nDouble: contains fractional values (i.e., double precision) - default numeric\nInteger: any integer(s)/whole numbers\n\nLogical: variables composed of TRUE or FALSE\nFactor: categorical/qualitative variables", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "href": "modules/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n\nWe probably need to include a confidence interval in our calculation.\nThis is actually not so easy without more advanced tools that we will learn in upcoming modules.\nRight now the best options are to do it by hand or google a function.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-and-numeric", + "href": "modules/Module07-VarCreationClassesSummaries.html#character-and-numeric", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Character and numeric", + "text": "Character and numeric\nThis can also be a bit tricky.\nIf only one character in the whole vector, the class is assumed to be character\n\nclass(c(1, 2, \"tree\")) \n\n[1] \"character\"\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\nclass(c(\"1\", \"4\", \"7\")) \n\n[1] \"character\"\n\n\nNote, instead of creating a new vector object (e.g., x <- c(\"1\", \"4\", \"7\")) and then feeding the vector object x into the first argument of the class() function (e.g., class(x)), we combined the two steps and directly fed a vector object into the class function.", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-by-hand", - "href": "modules/CaseStudy01.html#q1-by-hand", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n \"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n round(p_urban - 1.96 * se_urban, 2), \", \",\n round(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n \"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n round(p_rural - 1.96 * se_rural, 2), \", \",\n round(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-subclasses", + "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-subclasses", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Numeric Subclasses", + "text": "Numeric Subclasses\nThere are two major numeric subclasses\n\nDouble is a special subset of numeric that contains fractional values. Double stands for double-precision\nInteger is a special subset of numeric that contains only whole numbers.\n\ntypeof() identifies the vector type (double, integer, logical, or character), whereas class() identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\nclass(df$IgG_concentration)\n\n[1] \"numeric\"\n\nclass(df$age)\n\n[1] \"integer\"\n\ntypeof(df$IgG_concentration)\n\n[1] \"double\"\n\ntypeof(df$age)\n\n[1] \"integer\"", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-by-hand-1", - "href": "modules/CaseStudy01.html#q1-by-hand-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#logical", + "href": "modules/Module07-VarCreationClassesSummaries.html#logical", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Logical", + "text": "Logical\nReminder logical is a type that only has three possible elements: TRUE and FALSE and NA\n\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n\n[1] \"logical\"\n\n\nNote that when creating logical object the TRUE and FALSE are NOT in quotes. Putting R special classes (e.g., NA or FALSE) in quotations turns them into character value.", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#q1-googling-a-package", - "href": "modules/CaseStudy01.html#q1-googling-a-package", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Googling a package", - "text": "Q1: Googling a package\n\n\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#other-useful-functions-for-evaluatingsetting-classes", + "href": "modules/Module07-VarCreationClassesSummaries.html#other-useful-functions-for-evaluatingsetting-classes", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Other useful functions for evaluating/setting classes", + "text": "Other useful functions for evaluating/setting classes\nThere are two useful functions associated with practically all R classes:\n\nis.CLASS_NAME(x) to logically check whether or not x is of certain class. For example, is.integer or is.character or is.numeric\nas.CLASS_NAME(x) to coerce between classes x from current x class into a another class. For example, as.integer or as.character or as.numeric. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#you-try-it", - "href": "modules/CaseStudy01.html#you-try-it", - "title": "Algorithmic Thinking Case Study 1", - "section": "You try it!", - "text": "You try it!\n\nUsing any of the approaches you can think of, answer this question!\nHow many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#examples-is.class_namex", + "href": "modules/Module07-VarCreationClassesSummaries.html#examples-is.class_namex", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Examples is.CLASS_NAME(x)", + "text": "Examples is.CLASS_NAME(x)\n\nis.numeric(df$IgG_concentration)\n\n[1] TRUE\n\nis.character(df$age)\n\n[1] FALSE\n\nis.character(df$gender)\n\n[1] TRUE", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#you-try-it-1", - "href": "modules/CaseStudy01.html#you-try-it-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "You try it!", - "text": "You try it!\n\n# How many children under 5 were vaccinated\nsum(diph$DP_vacc[diph$age_months < 60])\n\n[1] 91\n\n# Prevalence in both vaccine groups for children under 5\naggregate(\n DP_infection ~ DP_vacc,\n data = subset(diph, age_months < 60),\n FUN = DescTools::MeanCI\n)\n\n DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 0 0.4285714 0.1977457 0.6593972\n2 1 0.6373626 0.5366845 0.7380407\n\n\nIt appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#examples-as.class_namex", + "href": "modules/Module07-VarCreationClassesSummaries.html#examples-as.class_namex", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Examples as.CLASS_NAME(x)", + "text": "Examples as.CLASS_NAME(x)\nIn some cases, coercing is seamless\n\nas.character(c(1, 4, 7))\n\n[1] \"1\" \"4\" \"7\"\n\nas.numeric(c(\"1\", \"4\", \"7\"))\n\n[1] 1 4 7\n\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n\n[1] TRUE FALSE FALSE\n\n\nIn some cases the coercing is not possible; if executed, will return NA\n\nas.numeric(c(\"1\", \"4\", \"7a\"))\n\nWarning: NAs introduced by coercion\n\n\n[1] 1 4 NA\n\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n\n[1] TRUE FALSE NA", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "href": "modules/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "title": "Algorithmic Thinking Case Study 1", - "section": "Congratulations for finishing the first case study!", - "text": "Congratulations for finishing the first case study!\n\nWhat R functions and skills did you practice?\nWhat other questions could you answer about the same dataset with the skills you know now?", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#factors", + "href": "modules/Module07-VarCreationClassesSummaries.html#factors", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Factors", + "text": "Factors\nA factor is a special character vector where the elements have pre-defined groups or ‘levels’. You can think of these as qualitative or categorical variables. Use the factor() function to create factors from character values.\n\nclass(df$age_group)\n\n[1] \"character\"\n\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n\n[1] \"factor\"\n\nlevels(df$age_group_factor)\n\n[1] \"middle\" \"old\" \"young\" \n\n\nNote 1, that levels are, by default, set to alphanumerical order! And, the first is always the “reference” group. However, we often prefer a different reference group.\nNote 2, we can also make ordered factors using factor(... ordered=TRUE), but we won’t talk more about that.", "crumbs": [ "Day 1", - "Algorithmic Thinking Case Study 1" + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/ModuleXX-Iteration.html#what-is-iteration", - "href": "modules/ModuleXX-Iteration.html#what-is-iteration", - "title": "Iteration in R", - "section": "What is iteration?", - "text": "What is iteration?\n\nWhenever you repeat something, that’s iteration.\nIn R, this means running the same code multiple times in a row.\n\n\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#reference-groups", + "href": "modules/Module07-VarCreationClassesSummaries.html#reference-groups", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Reference Groups", + "text": "Reference Groups\nWhy do we care about reference groups?\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations\nBy default middle is the reference group therefore we will only generate beta coefficients comparing middle to young AND middle to old. But, we want young to be the reference group so we will generate beta coefficients comparing young to middle AND young to old.", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#parts-of-a-loop", - "href": "modules/ModuleXX-Iteration.html#parts-of-a-loop", - "title": "Iteration in R", - "section": "Parts of a loop", - "text": "Parts of a loop\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe header declares how many times we will repeat the same code. The header contains a control variable that changes in each repetition and a sequence of values for the control variable to take." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference", + "href": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Changing factor reference", + "text": "Changing factor reference\nChanging the reference group of a factor variable.\n\nIf the object is already a factor then use relevel() function and the ref argument to specify the reference.\nIf the object is a character then use factor() function and levels argument to specify the order of the values, the first being the reference.\n\nLet’s look at the relevel() help file\nReorder Levels of Factor\nDescription:\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\nUsage:\n relevel(x, ref, ...)\n \nArguments:\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\nDetails:\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\nValue:\n A factor of the same length as 'x'.\nSee Also:\n 'factor', 'contr.treatment', 'levels', 'reorder'.\nExamples:\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\nLet’s look at the factor() help file\nFactors\nDescription:\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\nUsage:\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n x: a vector of data, usually taking a small number of distinct\n values.\nlevels: an optional vector of the unique values (as character strings) that ‘x’ might have taken. The default is the unique set of values taken by ‘as.character(x)’, sorted into increasing order of ‘x’. Note that this set can be specified as smaller than ‘sort(unique(x))’.\nlabels: either an optional character vector of labels for the levels (in the same order as ‘levels’ after removing those in ‘exclude’), or a character string of length 1. Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level.\nexclude: a vector of values to be excluded when forming the set of levels. This may be factor with the same level set as ‘x’ or should be a ‘character’.\nordered: logical flag to determine if the levels should be regarded as ordered (in the order given).\nnmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\nifany: only add an ‘NA’ level if it is used, i.e. if ‘any(is.na(x))’.\nobject: an R object.\nDetails:\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and model-fitting functions may treat the two classes\n quite differently, see 'options(\"contrasts\")'.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '<NA>');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '<NA>', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\nValue:\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject(<factor>)'.\nWarning:\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\nComparison operators and group generic methods:\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\nNote:\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\nReferences:\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\nSee Also:\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\nExamples:\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 <NA> <NA>\n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps <NA> , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = <character>\n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#parts-of-a-loop-1", - "href": "modules/ModuleXX-Iteration.html#parts-of-a-loop-1", - "title": "Iteration in R", - "section": "Parts of a loop", - "text": "Parts of a loop\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe body of the loop contains code that will be repeated a number of times based on the header instructions. In R, the body has to be surrounded by curly braces." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference-examples", + "href": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference-examples", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Changing factor reference examples", + "text": "Changing factor reference examples\n\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n\n[1] \"young\" \"middle\" \"old\" \n\n\nOR\n\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n\n[1] \"young\" \"middle\" \"old\" \n\n\nArranging, tabulating, and plotting the data will reflect the new order", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#header-parts", - "href": "modules/ModuleXX-Iteration.html#header-parts", - "title": "Iteration in R", - "section": "Header parts", - "text": "Header parts\n\nfor (this_island in levels(penguins$island)) {...}\n\n\nfor: keyword that declares we are doing a for loop.\n(...): parentheses after for declare the control variable and sequence.\nthis_island: the control variable.\nin: keyword that separates the control varibale and sequence.\nlevels(penguins$island): the sequence.\n{}: curly braces will contain the body code." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#two-dimensional-data-classes", + "href": "modules/Module07-VarCreationClassesSummaries.html#two-dimensional-data-classes", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Two-dimensional data classes", + "text": "Two-dimensional data classes\nTwo-dimensional classes are those we would often use to store data read from a file\n\na matrix (matrix class)\na data frame (data.frame or tibble classes)", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#header-parts-1", - "href": "modules/ModuleXX-Iteration.html#header-parts-1", - "title": "Iteration in R", - "section": "Header parts", - "text": "Header parts\n\nfor (this_island in levels(penguins$island)) {...}\n\n\nSince levels(penguins$island) evaluates to c(\"Biscoe\", \"Dream\", \"Torgersen\"), our loop will repeat 3 times.\n\n\n\n\nIteration\nthis_island\n\n\n\n\n1\n“Biscoe”\n\n\n2\n“Dream”\n\n\n3\n“Torgersen”\n\n\n\n\nEverything inside of {...} will be repeated three times." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#matrices", + "href": "modules/Module07-VarCreationClassesSummaries.html#matrices", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Matrices", + "text": "Matrices\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike data.frame, the entire matrix is composed of one R class. For example: all entries are numeric, or all entries are character\nas.matrix() creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n\nmatrix(data=1:6, ncol = 2) \n\n\n\n\n1\n4\n\n\n2\n5\n\n\n3\n6\n\n\n\n\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n\n\n\n\n1\n2\n\n\n3\n4\n\n\n5\n6\n\n\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default byrow argument is FALSE. In the second matrix, we changed the argument byrow to TRUE, and now numbers 1-6 are filled by rows first and then columns.", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-iteration-1", - "href": "modules/ModuleXX-Iteration.html#loop-iteration-1", - "title": "Iteration in R", - "section": "Loop iteration 1", - "text": "Loop iteration 1\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Biscoe Island was 15.87 mm." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#data-frame", + "href": "modules/Module07-VarCreationClassesSummaries.html#data-frame", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Data frame", + "text": "Data frame\nYou can transform an existing matrix into data frames using as.data.frame()\n\nas.data.frame(matrix(1:6, ncol = 2) ) \n\n\n\n\nV1\nV2\n\n\n\n\n1\n4\n\n\n2\n5\n\n\n3\n6\n\n\n\n\n\nYou can create a new data frame out of vectors (and potentially lists, but this is an advanced feature and unusual) by using the data.frame() function. Recall that all of the vectors that make up a data frame must be the same length.\n\nlotr <- \n data.frame(\n name = c(\"Frodo\", \"Sam\", \"Aragorn\", \"Legolas\", \"Gimli\"),\n race = c(\"Hobbit\", \"Hobbit\", \"Human\", \"Elf\", \"Dwarf\"),\n age = c(53, 38, 87, 2931, 139)\n )", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-iteration-2", - "href": "modules/ModuleXX-Iteration.html#loop-iteration-2", - "title": "Iteration in R", - "section": "Loop iteration 2", - "text": "Loop iteration 2\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Dream Island was 18.34 mm." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary", + "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Numeric variable data summary", + "text": "Numeric variable data summary\nData summarization on numeric vectors/variables:\n\nmean(): takes the mean of x\nsd(): takes the standard deviation of x\nmedian(): takes the median of x\nquantile(): displays sample quantiles of x. Default is min, IQR, max\nrange(): displays the range. Same as c(min(), max())\nsum(): sum of x\nmax(): maximum value in x\nmin(): minimum value in x\ncolSums(): get the columns sums of a data frame\nrowSums(): get the row sums of a data frame\ncolMeans(): get the columns means of a data frame\nrowMeans(): get the row means of a data frame\n\nNote, all of these functions have an na.rm argument for missing data.", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-iteration-3", - "href": "modules/ModuleXX-Iteration.html#loop-iteration-3", - "title": "Iteration in R", - "section": "Loop iteration 3", - "text": "Loop iteration 3\n\nisland_mean <-\n penguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n \"mm.\\n\"))\n\nThe mean bill depth on Torgersen Island was 18.43 mm." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-1", + "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-1", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Numeric variable data summary", + "text": "Numeric variable data summary\nLet’s look at a help file for mean() to make note of the na.rm argument\n\n?range\n\nRange of Values\nDescription:\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\nUsage:\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n ...: any 'numeric' or character objects.\nna.rm: logical, indicating if ‘NA’’s should be omitted.\nfinite: logical, indicating if all non-finite elements should be omitted.\nisNumeric: a ‘function’ returning ‘TRUE’ or ‘FALSE’ when called on ‘c(…, recursive = TRUE)’, ‘is.numeric()’ for the default ‘range()’ method.\nDetails:\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\nS4 methods:\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\nExamples:\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy", - "href": "modules/ModuleXX-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy", - "title": "Iteration in R", - "section": "The loop structure automates this process for us so we don’t have to copy", - "text": "The loop structure automates this process for us so we don’t have to copy\nand paste our code!\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-examples", + "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-examples", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Numeric variable data summary examples", + "text": "Numeric variable data summary examples\n\nsummary(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\nage_group\nage_group_factor\n\n\n\n\n\nMin. :5006\nMin. : 0.0054\nMin. : 1.000\nLength:651\nLength:651\nMin. :-5.2231\nMode :logical\nLength:651\nyoung :316\n\n\n\n1st Qu.:6306\n1st Qu.: 0.3000\n1st Qu.: 3.000\nClass :character\nClass :character\n1st Qu.:-1.2040\nFALSE:360\nClass :character\nmiddle:179\n\n\n\nMedian :7495\nMedian : 1.6658\nMedian : 6.000\nMode :character\nMode :character\nMedian : 0.5103\nTRUE :281\nMode :character\nold :147\n\n\n\nMean :7492\nMean : 87.3683\nMean : 6.606\nNA\nNA\nMean : 1.6074\nNA’s :10\nNA\nNA’s : 9\n\n\n\n3rd Qu.:8749\n3rd Qu.:141.4405\n3rd Qu.:10.000\nNA\nNA\n3rd Qu.: 4.9519\nNA\nNA\nNA\n\n\n\nMax. :9982\nMax. :916.4179\nMax. :15.000\nNA\nNA\nMax. : 6.8205\nNA\nNA\nNA\n\n\n\nNA\nNA’s :10\nNA’s :9\nNA\nNA\nNA’s :10\nNA\nNA\nNA\n\n\n\n\nrange(df$age)\n\n[1] NA NA\n\nrange(df$age, na.rm=TRUE)\n\n[1] 1 15\n\nmedian(df$IgG_concentration, na.rm=TRUE)\n\n[1] 1.665753", "crumbs": [ - "Day 2", - "Iteration in R" + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/ModuleXX-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy-and-paste-our-code", - "href": "modules/ModuleXX-Iteration.html#the-loop-structure-automates-this-process-for-us-so-we-dont-have-to-copy-and-paste-our-code", - "title": "Iteration in R", - "section": "The loop structure automates this process for us so we don’t have to copy and paste our code!", - "text": "The loop structure automates this process for us so we don’t have to copy and paste our code!\n\nfor (this_island in levels(penguins$island)) {\n island_mean <-\n penguins$bill_depth_mm[penguins$island == this_island] |>\n mean(na.rm = TRUE) |>\n round(digits = 2)\n \n cat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n \"mm.\\n\"))\n}\n\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm." + "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summaries", + "href": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summaries", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Character variable data summaries", + "text": "Character variable data summaries\nData summarization on character or factor vectors/variables using table()\n\n?table\n\nCross Tabulation and Table Creation\nDescription:\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\nUsage:\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\nexclude: levels to remove for all factors in ‘…’. If it does not contain ‘NA’ and ‘useNA’ is not specified, it implies ‘useNA = “ifany”’. See ‘Details’ for its interpretation for non-factor arguments.\nuseNA: whether to include ‘NA’ values in the table. See ‘Details’. Can be abbreviated.\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\ndeparse.level: controls how the default ‘dnn’ is constructed. See ‘Details’.\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\nrow.names: a character vector giving the row names for the data frame.\nresponseName: the name to be used for the column of table entries, usually counts.\nstringsAsFactors: logical: should the classifying factors be returned as factors (the default) or character vectors?\nsep, base: passed to ‘provideDimnames’.\nDetails:\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\nValue:\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\nExamples:\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), dnn = NULL) # dimnames() have no names\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#remember-write-dry-code", - "href": "modules/ModuleXX-Iteration.html#remember-write-dry-code", - "title": "Iteration in R", - "section": "Remember: write DRY code!", - "text": "Remember: write DRY code!\n\nDRY = “Don’t Repeat Yourself”\nInstead of copying and pasting, write loops and functions.\nEasier to debug and change in the future!\n\n\n\nOf course, we all copy and paste code sometimes. If you are running on a tight deadline or can’t get a loop or function to work, you might need to. DRY code is good, but working code is best!" + "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summary-examples", + "href": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summary-examples", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Character variable data summary examples", + "text": "Character variable data summary examples\nNumber of observations in each category\n\ntable(df$gender)\n\n\n\n\nFemale\nMale\n\n\n\n\n325\n326\n\n\n\n\ntable(df$gender, useNA=\"always\")\n\n\n\n\nFemale\nMale\nNA\n\n\n\n\n325\n326\n0\n\n\n\n\ntable(df$age_group, useNA=\"always\")\n\n\n\n\nmiddle\nold\nyoung\nNA\n\n\n\n\n179\n147\n316\n9\n\n\n\n\n\n\ntable(df$gender)/nrow(df) #if no NA values\n\n\n\n\nFemale\nMale\n\n\n\n\n0.499232\n0.500768\n\n\n\n\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n\n\n\n\nmiddle\nold\nyoung\n\n\n\n\n0.2788162\n0.228972\n0.4922118\n\n\n\n\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n\n\n\n\nmiddle\nold\nyoung\n\n\n\n\n0.2788162\n0.228972\n0.4922118", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#tweet-slide", - "href": "modules/ModuleXX-Iteration.html#tweet-slide", - "title": "Iteration in R", - "section": "", - "text": "quart", + "objectID": "modules/Module07-VarCreationClassesSummaries.html#summary", + "href": "modules/Module07-VarCreationClassesSummaries.html#summary", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Summary", + "text": "Summary\n\nYou can create new columns/variable to a data frame by using $ or the transform() function\nOne useful function for creating new variables based on existing variables is the ifelse() function, which returns a value depending on whether the element of test is TRUE or FALSE\nThe class() function allows you to evaluate the class of an object.\nThere are two types of numeric class objects: integer and double\nLogical class objects only have TRUE or False (without quotes)\nis.CLASS_NAME(x) can be used to test the class of an object x\nas.CLASS_NAME(x) can be used to change the class of an object x\nFactors are a special character class that has levels\nThere are many fairly intuitive data summary functions you can perform on a vector (i.e., mean(), sd(), range()) or on rows or columns of a data frame (i.e., colSums(), colMeans(), rowSums())\nThe table() function builds frequency tables of the counts at each combination of categorical levels", "crumbs": [ - "Day 2", - "Iteration in R" + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" ] }, { - "objectID": "modules/ModuleXX-Iteration.html#you-try-it", - "href": "modules/ModuleXX-Iteration.html#you-try-it", - "title": "Iteration in R", - "section": "You try it!", - "text": "You try it!\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints the squared number.\n\n\nfor (i in 1:10) {\n cat(i ^ 2, \"\\n\")\n}\n\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100" + "objectID": "modules/Module07-VarCreationClassesSummaries.html#acknowledgements", + "href": "modules/Module07-VarCreationClassesSummaries.html#acknowledgements", + "title": "Module 7: Variable Creation, Classes, and Summaries", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University", + "crumbs": [ + "Day 1", + "Module 7: Variable Creation, Classes, and Summaries" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that", - "href": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that", - "title": "Iteration in R", - "section": "Wait, did we need to do that?", - "text": "Wait, did we need to do that?\n\nWell, yes, because you need to practice loops!\nBut technically no, because we can use vectorization.\nAlmost all basic operations in R are vectorized: they work on a vector of arguments all at the same time." + "objectID": "modules/Module09-DataAnalysis.html#learning-objectives", + "href": "modules/Module09-DataAnalysis.html#learning-objectives", + "title": "Module 9: Data Analysis", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 9, you should be able to…\n\nDescriptively assess association between two variables\nCompute basic statistics\nFit a generalized linear model", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that-1", - "href": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that-1", - "title": "Iteration in R", - "section": "Wait, did we need to do that?", - "text": "Wait, did we need to do that?\n\nWell, yes, because you need to practice loops!\nBut technically no, because we can use vectorization.\nAlmost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.\n\n\n# No loop needed!\n(1:10)^2\n\n [1] 1 4 9 16 25 36 49 64 81 100" + "objectID": "modules/Module09-DataAnalysis.html#import-data-for-this-module", + "href": "modules/Module09-DataAnalysis.html#import-data-for-this-module", + "title": "Module 9: Data Analysis", + "section": "Import data for this module", + "text": "Import data for this module\nLet’s read in our data (again) and take a quick look.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that-2", - "href": "modules/ModuleXX-Iteration.html#wait-did-we-need-to-do-that-2", - "title": "Iteration in R", - "section": "Wait, did we need to do that?", - "text": "Wait, did we need to do that?\n\nWell, yes, because you need to practice loops!\nBut technically no, because we can use vectorization.\nAlmost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.\n\n\n# No loop needed!\n(1:10)^2\n\n [1] 1 4 9 16 25 36 49 64 81 100\n\n\n\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n\n [1] 1 3 5 7 9 11 13 15 17 19\n\n\n\nSo you should really try vectorization first, then use loops only when you can’t use vectorization." + "objectID": "modules/Module09-DataAnalysis.html#prep-data", + "href": "modules/Module09-DataAnalysis.html#prep-data", + "title": "Module 9: Data Analysis", + "section": "Prep data", + "text": "Prep data\nCreate age_group three level factor variable\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n\nCreate seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-walkthrough", - "href": "modules/ModuleXX-Iteration.html#loop-walkthrough", - "title": "Iteration in R", - "section": "Loop walkthrough", - "text": "Loop walkthrough\n\nLet’s walk through a complex but useful example where we can’t use vectorization.\nLoad the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n\n\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n subset(vaccine_antigen == \"MCV1\")\nstr(meas)\n\n'data.frame': 7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ..." + "objectID": "modules/Module09-DataAnalysis.html#grouped-analyses", + "href": "modules/Module09-DataAnalysis.html#grouped-analyses", + "title": "Module 9: Data Analysis", + "section": "Grouped analyses", + "text": "Grouped analyses\n\nMost of this module will discuss statistical analyses. But first we’ll discuss doing univariate analyses we’ve already used on multiple groups.\nWe can use the aggregate() function to do many analyses across groups.\n\n\n?aggregate\n\n\nlibrary(printr)\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n?aggregate\n\nCompute Summary Statistics of Data Subsets\n\nDescription:\n\n Splits the data into subsets, computes summary statistics for\n each, and returns the result in a convenient form.\n\nUsage:\n\n aggregate(x, ...)\n \n ## Default S3 method:\n aggregate(x, ...)\n \n ## S3 method for class 'data.frame'\n aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)\n \n ## S3 method for class 'formula'\n aggregate(x, data, FUN, ...,\n subset, na.action = na.omit)\n \n ## S3 method for class 'ts'\n aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,\n ts.eps = getOption(\"ts.eps\"), ...)\n \nArguments:\n\n x: an R object. For the 'formula' method a 'formula', such as\n 'y ~ x' or 'cbind(y1, y2) ~ x1 + x2', where the 'y' variables\n are numeric data to be split into groups according to the\n grouping 'x' variables (usually factors).\n\n by: a list of grouping elements, each as long as the variables in\n the data frame 'x', or a formula. The elements are coerced\n to factors before use.\n\n FUN: a function to compute the summary statistics which can be\n applied to all data subsets.\n\nsimplify: a logical indicating whether results should be simplified to\n a vector or matrix if possible.\n\n drop: a logical indicating whether to drop unused combinations of\n grouping values. The non-default case 'drop=FALSE' has been\n amended for R 3.5.0 to drop unused combinations.\n\n data: a data frame (or list) from which the variables in the\n formula should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\nnfrequency: new number of observations per unit of time; must be a\n divisor of the frequency of 'x'.\n\n ndeltat: new fraction of the sampling period between successive\n observations; must be a divisor of the sampling interval of\n 'x'.\n\n ts.eps: tolerance used to decide if 'nfrequency' is a sub-multiple of\n the original frequency.\n\n ...: further arguments passed to or used by methods.\n\nDetails:\n\n 'aggregate' is a generic function with methods for data frames and\n time series.\n\n The default method, 'aggregate.default', uses the time series\n method if 'x' is a time series, and otherwise coerces 'x' to a\n data frame and calls the data frame method.\n\n 'aggregate.data.frame' is the data frame method. If 'x' is not a\n data frame, it is coerced to one, which must have a non-zero\n number of rows. Then, each of the variables (columns) in 'x' is\n split into subsets of cases (rows) of identical combinations of\n the components of 'by', and 'FUN' is applied to each such subset\n with further arguments in '...' passed to it. The result is\n reformatted into a data frame containing the variables in 'by' and\n 'x'. The ones arising from 'by' contain the unique combinations\n of grouping values used for determining the subsets, and the ones\n arising from 'x' the corresponding summaries for the subset of the\n respective variables in 'x'. If 'simplify' is true, summaries are\n simplified to vectors or matrices if they have a common length of\n one or greater than one, respectively; otherwise, lists of summary\n results according to subsets are obtained. Rows with missing\n values in any of the 'by' variables will be omitted from the\n result. (Note that versions of R prior to 2.11.0 required 'FUN'\n to be a scalar function.)\n\n The formula method provides a standard formula interface to\n 'aggregate.data.frame'. The latter invokes the formula method if\n 'by' is a formula, in which case 'aggregate(x, by, FUN)' is the\n same as 'aggregate(by, x, FUN)' for a data frame 'x'.\n\n 'aggregate.ts' is the time series method, and requires 'FUN' to be\n a scalar function. If 'x' is not a time series, it is coerced to\n one. Then, the variables in 'x' are split into appropriate blocks\n of length 'frequency(x) / nfrequency', and 'FUN' is applied to\n each such block, with further (named) arguments in '...' passed to\n it. The result returned is a time series with frequency\n 'nfrequency' holding the aggregated values. Note that this make\n most sense for a quarterly or yearly result when the original\n series covers a whole number of quarters or years: in particular\n aggregating a monthly series to quarters starting in February does\n not give a conventional quarterly series.\n\n 'FUN' is passed to 'match.fun', and hence it can be a function or\n a symbol or character string naming a function.\n\nValue:\n\n For the time series method, a time series of class '\"ts\"' or class\n 'c(\"mts\", \"ts\")'.\n\n For the data frame method, a data frame with columns corresponding\n to the grouping variables in 'by' followed by aggregated columns\n from 'x'. If the 'by' has names, the non-empty times are used to\n label the columns in the results, with unnamed grouping variables\n being named 'Group.i' for 'by[[i]]'.\n\nWarning:\n\n The first argument of the '\"formula\"' method was named 'formula'\n rather than 'x' prior to R 4.2.0. Portable uses should not name\n that argument.\n\nAuthor(s):\n\n Kurt Hornik, with contributions by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'apply', 'lapply', 'tapply'.\n\nExamples:\n\n ## Compute the averages for the variables in 'state.x77', grouped\n ## according to the region (Northeast, South, North Central, West) that\n ## each state belongs to.\n aggregate(state.x77, list(Region = state.region), mean)\n \n ## Compute the averages according to region and the occurrence of more\n ## than 130 days of frost.\n aggregate(state.x77,\n list(Region = state.region,\n Cold = state.x77[,\"Frost\"] > 130),\n mean)\n ## (Note that no state in 'South' is THAT cold.)\n \n \n ## example with character variables and NAs\n testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),\n v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )\n by1 <- c(\"red\", \"blue\", 1, 2, NA, \"big\", 1, 2, \"red\", 1, NA, 12)\n by2 <- c(\"wet\", \"dry\", 99, 95, NA, \"damp\", 95, 99, \"red\", 99, NA, NA)\n aggregate(x = testDF, by = list(by1, by2), FUN = \"mean\")\n \n # and if you want to treat NAs as a group\n fby1 <- factor(by1, exclude = \"\")\n fby2 <- factor(by2, exclude = \"\")\n aggregate(x = testDF, by = list(fby1, fby2), FUN = \"mean\")\n \n \n ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:\n aggregate(weight ~ feed, data = chickwts, mean)\n aggregate(breaks ~ wool + tension, data = warpbreaks, mean)\n aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)\n aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)\n \n ## Dot notation:\n aggregate(. ~ Species, data = iris, mean)\n aggregate(len ~ ., data = ToothGrowth, mean)\n \n ## Often followed by xtabs():\n ag <- aggregate(len ~ ., data = ToothGrowth, mean)\n xtabs(len ~ ., data = ag)\n \n ## Formula interface via 'by' (for pipe operations)\n ToothGrowth |> aggregate(len ~ ., FUN = mean)\n \n ## Compute the average annual approval ratings for American presidents.\n aggregate(presidents, nfrequency = 1, FUN = mean)\n ## Give the summer less weight.\n aggregate(presidents, nfrequency = 1,\n FUN = weighted.mean, w = c(1, 1, 0.5, 1))", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-walkthrough-1", - "href": "modules/ModuleXX-Iteration.html#loop-walkthrough-1", - "title": "Iteration in R", - "section": "Loop walkthrough", - "text": "Loop walkthrough\n\nFirst, make an empty list. This is where we’ll store our results. Make it the same length as the number of countries in the dataset.\n\n\n\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n\n\nThis is called preallocation and it can make your loops much faster." + "objectID": "modules/Module09-DataAnalysis.html#grouped-analyses-1", + "href": "modules/Module09-DataAnalysis.html#grouped-analyses-1", + "title": "Module 9: Data Analysis", + "section": "Grouped analyses", + "text": "Grouped analyses\n\nLet’s calculate seropositivity rate across age groups using the variables we just created.\nThe easiest way to use aggregate() is with the formula option. The syntax is variable_of_intest ~ grouping_variables.\n\n\naggregate(\n # Formula specifies we are calculating statistics on seropos, separately for\n # each level of age_group\n seropos ~ age_group,\n data = df, # Data argument\n FUN = mean # function for our calculation WITHOUT PARENTHESES\n)\n\n\n\n\nage_group\nseropos\n\n\n\n\nyoung\n0.1832797\n\n\nmiddle\n0.6000000\n\n\nold\n0.7945205\n\n\n\n\n\n\nWe can add as many things as we want on the RHS of the formula.\n\n\naggregate(\n IgG_concentration ~ age_group + slum,\n data = df,\n FUN = sd # standard deviation\n)\n\n\n\n\nage_group\nslum\nIgG_concentration\n\n\n\n\nyoung\nMixed\n174.89797\n\n\nmiddle\nMixed\n162.08188\n\n\nold\nMixed\n150.07063\n\n\nyoung\nNon slum\n114.68422\n\n\nmiddle\nNon slum\n177.62113\n\n\nold\nNon slum\n141.22330\n\n\nyoung\nSlum\n61.85705\n\n\nmiddle\nSlum\n202.42018\n\n\nold\nSlum\n74.75217\n\n\n\n\n\n\nWe can also add multiple variables on the LHS at the same time using cbind() syntax.\n\n\naggregate(\n cbind(age, IgG_concentration) ~ gender + slum,\n data = df,\n FUN = median\n)\n\n\n\n\ngender\nslum\nage\nIgG_concentration\n\n\n\n\nFemale\nMixed\n5.0\n2.0117423\n\n\nMale\nMixed\n6.0\n2.2082192\n\n\nFemale\nNon slum\n6.0\n2.5040431\n\n\nMale\nNon slum\n5.0\n1.1245846\n\n\nFemale\nSlum\n3.0\n5.1482480\n\n\nMale\nSlum\n5.5\n0.7753834", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-walkthrough-2", - "href": "modules/ModuleXX-Iteration.html#loop-walkthrough-2", - "title": "Iteration in R", - "section": "Loop walkthrough", - "text": "Loop walkthrough\n\nLoop through every country in the dataset, and get the median, first and third quartiles, and range for each country. Store those summary statistics in a data frame.\nWhat should the header look like?\n\n\n\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n\n\n\n\nNote that we use the index as the control variable. When you need to do complex operations inside a loop, this is easier than the for-each construction we used earlier." + "objectID": "modules/Module09-DataAnalysis.html#variable-contingency-tables", + "href": "modules/Module09-DataAnalysis.html#variable-contingency-tables", + "title": "Module 9: Data Analysis", + "section": "2 variable contingency tables", + "text": "2 variable contingency tables\nWe use table() prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the prop.table() is useful.\n\n?prop.table\n\n\nlibrary(printr)\n?prop.table\n\nExpress Table Entries as Fraction of Marginal Table\n\nDescription:\n\n Returns conditional proportions given 'margins', i.e. entries of\n 'x', divided by the appropriate marginal sums.\n\nUsage:\n\n proportions(x, margin = NULL)\n prop.table(x, margin = NULL)\n \nArguments:\n\n x: table\n\n margin: a vector giving the margins to split by. E.g., for a matrix\n '1' indicates rows, '2' indicates columns, 'c(1, 2)'\n indicates rows and columns. When 'x' has named dimnames, it\n can be a character vector selecting dimension names.\n\nValue:\n\n Table like 'x' expressed relative to 'margin'\n\nNote:\n\n 'prop.table' is an earlier name, retained for back-compatibility.\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'marginSums'. 'apply', 'sweep' are a more general mechanism for\n sweeping out marginal statistics.\n\nExamples:\n\n m <- matrix(1:4, 2)\n m\n proportions(m, 1)\n \n DF <- as.data.frame(UCBAdmissions)\n tbl <- xtabs(Freq ~ Gender + Admit, DF)\n \n proportions(tbl, \"Gender\")", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#loop-walkthrough-3", - "href": "modules/ModuleXX-Iteration.html#loop-walkthrough-3", - "title": "Iteration in R", - "section": "Loop walkthrough", - "text": "Loop walkthrough\n\nNow write out the body of the code. First we need to subset the data, to get only the data for the current country.\n\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n}\n\n\n\n\nNext we need to get the summary of the cases for that country.\n\n\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_med <- median(country_cases, na.rm = TRUE)\n country_iqr <- IQR(country_cases, na.rm = TRUE)\n country_range <- range(country_cases, na.rm = TRUE)\n}\n\n\n\n\nNext we save the summary statistics into a data frame.\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_quart <- quantile(\n country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n )\n country_range <- range(country_cases, na.rm = TRUE)\n \n # Save the summary statistics into a data frame\n country_summary <- data.frame(\n country = countries[[i]],\n min = country_range[[1]],\n Q1 = country_quart[[1]],\n median = country_quart[[2]],\n Q3 = country_quart[[3]],\n max = country_range[[2]]\n )\n}\n\n\n\n\nAnd finally, we save the data frame as the next element in our storage list.\n\n\nfor (i in 1:length(countries)) {\n # Get the data for the current country only\n country_data <- subset(meas, country == countries[i])\n \n # Get the summary statistics for this country\n country_cases <- country_data$Cases\n country_quart <- quantile(\n country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n )\n country_range <- range(country_cases, na.rm = TRUE)\n \n # Save the summary statistics into a data frame\n country_summary <- data.frame(\n country = countries[[i]],\n min = country_range[[1]],\n Q1 = country_quart[[1]],\n median = country_quart[[2]],\n Q3 = country_quart[[3]],\n max = country_range[[2]]\n )\n \n # Save the results to our container\n res[[i]] <- country_summary\n}\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\nWarning in min(x): no non-missing arguments to min; returning Inf\n\n\nWarning in max(x): no non-missing arguments to max; returning -Inf\n\n\n\n\n\nLet’s take a look at the results.\n\n\nhead(res)\n\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n\n\n\nHow do we deal with this to get it into a nice form?\n\n\n\n\nWe can use a vectorization trick: the function do.call() seems like ancient computer science magic. And it is. But it will actually help us a lot.\n\n\nres_df <- do.call(rbind, res)\nhead(res_df)\n\n\n\n\ncountry\nmin\nQ1\nmedian\nQ3\nmax\n\n\n\n\nAfghanistan\n353\n1154.00\n2205\n5166.0\n31107\n\n\nAngola\n29\n700.00\n3271\n14474.0\n30067\n\n\nAlbania\n0\n1.00\n12\n29.0\n136034\n\n\nAndorra\n0\n0.00\n1\n2.0\n5\n\n\nUnited Arab Emirates\n22\n89.75\n320\n1128.0\n2913\n\n\nArgentina\n0\n0.00\n17\n4591.5\n42093\n\n\n\n\n\n\nIt combined our data frames together! Let’s take a look at the rbind and do.call() help packages to see what happened.\n\n\n\n\n?rbind\n\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. (The last point differs from S-PLUS.) Old-style\n categories (integer vectors with levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n\n\n\n\n\n?do.call\n\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n\n\n\n\n\nOK, so basically what happened is that\n\n\ndo.call(rbind, list)\n\n\nGets transformed into\n\n\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n\n\nThat’s vectorization magic!" + "objectID": "modules/Module09-DataAnalysis.html#variable-contingency-tables-1", + "href": "modules/Module09-DataAnalysis.html#variable-contingency-tables-1", + "title": "Module 9: Data Analysis", + "section": "2 variable contingency tables", + "text": "2 variable contingency tables\nLet’s practice\n\nfreq <- table(df$age_group, df$seropos)\nfreq\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n254\n57\n\n\nmiddle\n70\n105\n\n\nold\n30\n116\n\n\n\n\n\nNow, lets move to percentages\n\nprop.cell.percentages <- prop.table(freq)\nprop.cell.percentages\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n0.4018987\n0.0901899\n\n\nmiddle\n0.1107595\n0.1661392\n\n\nold\n0.0474684\n0.1835443\n\n\n\n\nprop.column.percentages <- prop.table(freq, margin=2)\nprop.column.percentages\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n0.7175141\n0.2050360\n\n\nmiddle\n0.1977401\n0.3776978\n\n\nold\n0.0847458\n0.4172662", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#you-try-it-if-we-have-time", - "href": "modules/ModuleXX-Iteration.html#you-try-it-if-we-have-time", - "title": "Iteration in R", - "section": "You try it! (if we have time)", - "text": "You try it! (if we have time)\n\nUse the code you wrote before the get the incidence per 1000 people on the entire measles data set (add a column for incidence to the full data).\nUse the code plot(NULL, NULL, ...) to make a blank plot. You will need to set the xlim and ylim arguments to sensible values, and specify the axis titles as “Year” and “Incidence per 1000 people”.\nUsing a for loop and the lines() function, make a plot that shows all of the incidence curves over time, overlapping on the plot.\nHINT: use col = adjustcolor(black, alpha.f = 0.25) to make the curves transparent, so you can see the others.\nBONUS PROBLEM: using the function cumsum(), make a plot of the cumulative incidence per 1000 people over time for all of the countries. (Dealing with the NA’s here is tricky!!)" + "objectID": "modules/Module09-DataAnalysis.html#chi-square-test", + "href": "modules/Module09-DataAnalysis.html#chi-square-test", + "title": "Module 9: Data Analysis", + "section": "Chi-Square test", + "text": "Chi-Square test\nThe chisq.test() function test of independence of factor variables from stats package.\n\n?chisq.test\n\nPearson’s Chi-squared Test for Count Data\nDescription:\n 'chisq.test' performs chi-squared contingency table tests and\n goodness-of-fit tests.\nUsage:\n chisq.test(x, y = NULL, correct = TRUE,\n p = rep(1/length(x), length(x)), rescale.p = FALSE,\n simulate.p.value = FALSE, B = 2000)\n \nArguments:\n x: a numeric vector or matrix. 'x' and 'y' can also both be\n factors.\n\n y: a numeric vector; ignored if 'x' is a matrix. If 'x' is a\n factor, 'y' should be a factor of the same length.\ncorrect: a logical indicating whether to apply continuity correction when computing the test statistic for 2 by 2 tables: one half is subtracted from all |O - E| differences; however, the correction will not be bigger than the differences themselves. No correction is done if ‘simulate.p.value = TRUE’.\n p: a vector of probabilities of the same length as 'x'. An\n error is given if any entry of 'p' is negative.\nrescale.p: a logical scalar; if TRUE then ‘p’ is rescaled (if necessary) to sum to 1. If ‘rescale.p’ is FALSE, and ‘p’ does not sum to 1, an error is given.\nsimulate.p.value: a logical indicating whether to compute p-values by Monte Carlo simulation.\n B: an integer specifying the number of replicates used in the\n Monte Carlo test.\nDetails:\n If 'x' is a matrix with one row or column, or if 'x' is a vector\n and 'y' is not given, then a _goodness-of-fit test_ is performed\n ('x' is treated as a one-dimensional contingency table). The\n entries of 'x' must be non-negative integers. In this case, the\n hypothesis tested is whether the population probabilities equal\n those in 'p', or are all equal if 'p' is not given.\n\n If 'x' is a matrix with at least two rows and columns, it is taken\n as a two-dimensional contingency table: the entries of 'x' must be\n non-negative integers. Otherwise, 'x' and 'y' must be vectors or\n factors of the same length; cases with missing values are removed,\n the objects are coerced to factors, and the contingency table is\n computed from these. Then Pearson's chi-squared test is performed\n of the null hypothesis that the joint distribution of the cell\n counts in a 2-dimensional contingency table is the product of the\n row and column marginals.\n\n If 'simulate.p.value' is 'FALSE', the p-value is computed from the\n asymptotic chi-squared distribution of the test statistic;\n continuity correction is only used in the 2-by-2 case (if\n 'correct' is 'TRUE', the default). Otherwise the p-value is\n computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.\n The default 'B = 2000' implies a minimum p-value of about 0.0005\n (1/(B+1)).\n\n In the contingency table case, simulation is done by random\n sampling from the set of all contingency tables with given\n marginals, and works only if the marginals are strictly positive.\n Continuity correction is never used, and the statistic is quoted\n without it. Note that this is not the usual sampling situation\n assumed for the chi-squared test but rather that for Fisher's\n exact test.\n\n In the goodness-of-fit case simulation is done by random sampling\n from the discrete distribution specified by 'p', each sample being\n of size 'n = sum(x)'. This simulation is done in R and may be\n slow.\nValue:\n A list with class '\"htest\"' containing the following components:\nstatistic: the value the chi-squared test statistic.\nparameter: the degrees of freedom of the approximate chi-squared distribution of the test statistic, ‘NA’ if the p-value is computed by Monte Carlo simulation.\np.value: the p-value for the test.\nmethod: a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used.\ndata.name: a character string giving the name(s) of the data.\nobserved: the observed counts.\nexpected: the expected counts under the null hypothesis.\nresiduals: the Pearson residuals, ‘(observed - expected) / sqrt(expected)’.\nstdres: standardized residuals, ‘(observed - expected) / sqrt(V)’, where ‘V’ is the residual cell variance (Agresti, 2007, section 2.4.5 for the case where ‘x’ is a matrix, ‘n * p * (1 - p)’ otherwise).\nSource:\n The code for Monte Carlo simulation is a C translation of the\n Fortran algorithm of Patefield (1981).\nReferences:\n Hope, A. C. A. (1968). A simplified Monte Carlo significance test\n procedure. _Journal of the Royal Statistical Society Series B_,\n *30*, 582-598. doi:10.1111/j.2517-6161.1968.tb00759.x\n <https://doi.org/10.1111/j.2517-6161.1968.tb00759.x>.\n\n Patefield, W. M. (1981). Algorithm AS 159: An efficient method of\n generating r x c tables with given row and column totals.\n _Applied Statistics_, *30*, 91-97. doi:10.2307/2346669\n <https://doi.org/10.2307/2346669>.\n\n Agresti, A. (2007). _An Introduction to Categorical Data\n Analysis_, 2nd ed. New York: John Wiley & Sons. Page 38.\nSee Also:\n For goodness-of-fit testing, notably of continuous distributions,\n 'ks.test'.\nExamples:\n ## From Agresti(2007) p.39\n M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))\n dimnames(M) <- list(gender = c(\"F\", \"M\"),\n party = c(\"Democrat\",\"Independent\", \"Republican\"))\n (Xsq <- chisq.test(M)) # Prints test summary\n Xsq$observed # observed counts (same as M)\n Xsq$expected # expected counts under the null\n Xsq$residuals # Pearson residuals\n Xsq$stdres # standardized residuals\n \n \n ## Effect of simulating p-values\n x <- matrix(c(12, 5, 7, 7), ncol = 2)\n chisq.test(x)$p.value # 0.4233\n chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value\n # around 0.29!\n \n ## Testing for population probabilities\n ## Case A. Tabulated data\n x <- c(A = 20, B = 15, C = 25)\n chisq.test(x)\n chisq.test(as.table(x)) # the same\n x <- c(89,37,30,28,2)\n p <- c(40,20,20,15,5)\n try(\n chisq.test(x, p = p) # gives an error\n )\n chisq.test(x, p = p, rescale.p = TRUE)\n # works\n p <- c(0.40,0.20,0.20,0.19,0.01)\n # Expected count in category 5\n # is 1.86 < 5 ==> chi square approx.\n chisq.test(x, p = p) # maybe doubtful, but is ok!\n chisq.test(x, p = p, simulate.p.value = TRUE)\n \n ## Case B. Raw data\n x <- trunc(5 * runif(100))\n chisq.test(table(x)) # NOT 'chisq.test(x)'!", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#main-problem-solution", - "href": "modules/ModuleXX-Iteration.html#main-problem-solution", - "title": "Iteration in R", - "section": "Main problem solution", - "text": "Main problem solution\n\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n NULL, NULL,\n xlim = c(1980, 2022),\n ylim = c(0, 50),\n xlab = \"Year\",\n ylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n country_data <- subset(meas, country == countries[[i]])\n lines(\n x = country_data$time,\n y = country_data$cases_per_thousand,\n col = adjustcolor(\"black\", alpha.f = 0.25)\n )\n}" + "objectID": "modules/Module09-DataAnalysis.html#chi-square-test-1", + "href": "modules/Module09-DataAnalysis.html#chi-square-test-1", + "title": "Module 9: Data Analysis", + "section": "Chi-Square test", + "text": "Chi-Square test\n\nchisq.test(freq)\n\n\n Pearson's Chi-squared test\n\ndata: freq\nX-squared = 175.85, df = 2, p-value < 2.2e-16\n\n\nWe reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same.", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#bonus-problem-solution", - "href": "modules/ModuleXX-Iteration.html#bonus-problem-solution", - "title": "Iteration in R", - "section": "Bonus problem solution", - "text": "Bonus problem solution\n\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n x = ifelse(is.na(meas$Cases), 0, meas$Cases),\n meas$country,\n FUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n NULL, NULL,\n xlim = c(1980, 2022),\n ylim = c(1, 6.2e6),\n xlab = \"Year\",\n ylab = \"Cumulative cases per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n country_data <- subset(meas, country == countries[[i]])\n lines(\n x = country_data$time,\n y = country_data$cumulative_cases,\n col = adjustcolor(\"black\", alpha.f = 0.25)\n )\n}\n\ntext(\n x = 2020,\n y = 6e6,\n labels = \"China →\"\n)" + "objectID": "modules/Module09-DataAnalysis.html#correlation", + "href": "modules/Module09-DataAnalysis.html#correlation", + "title": "Module 9: Data Analysis", + "section": "Correlation", + "text": "Correlation\nFirst, we compute correlation by providing two vectors.\nLike other functions, if there are NAs, you get NA as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.\n\ncor(df$age, df$IgG_concentration, method=\"pearson\")\n\n[1] NA\n\ncor(df$age, df$IgG_concentration, method=\"pearson\", use = \"complete.obs\") #IF have missing data\n\n[1] 0.2604783\n\n\nSmall positive correlation between IgG concentration and age.", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#main-problem-solution-1", - "href": "modules/ModuleXX-Iteration.html#main-problem-solution-1", - "title": "Iteration in R", - "section": "Main problem solution", - "text": "Main problem solution" + "objectID": "modules/Module09-DataAnalysis.html#correlation-confidence-interval", + "href": "modules/Module09-DataAnalysis.html#correlation-confidence-interval", + "title": "Module 9: Data Analysis", + "section": "Correlation confidence interval", + "text": "Correlation confidence interval\nThe function cor.test() also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default.\n\ncor.test(df$age, df$IgG_concentration, method=\"pearson\")\n\n\n Pearson's product-moment correlation\n\ndata: df$age and df$IgG_concentration\nt = 6.7717, df = 630, p-value = 2.921e-11\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n 0.1862722 0.3317295\nsample estimates:\n cor \n0.2604783", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#bonus-problem-solution-1", - "href": "modules/ModuleXX-Iteration.html#bonus-problem-solution-1", - "title": "Iteration in R", - "section": "Bonus problem solution", - "text": "Bonus problem solution" + "objectID": "modules/Module09-DataAnalysis.html#t-test", + "href": "modules/Module09-DataAnalysis.html#t-test", + "title": "Module 9: Data Analysis", + "section": "T-test", + "text": "T-test\nThe commonly used are:\n\none-sample t-test – used to test mean of a variable in one group (to the null hypothesis mean)\ntwo-sample t-test – used to test difference in means of a variable between two groups (null hypothesis - the group means are the same)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-Iteration.html#more-practice-on-your-own", - "href": "modules/ModuleXX-Iteration.html#more-practice-on-your-own", - "title": "Iteration in R", - "section": "More practice on your own", - "text": "More practice on your own\n\nMerge the countries-regions.csv data with the measles_final.Rds data. Reshape the measles data so that MCV1 and MCV2 vaccine coverage are two separate columns. Then use a loop to fit a poisson regression model for each continent where Cases is the outcome, and MCV1 coverage and MCV2 coverage are the predictors. Discuss your findings, and try adding an interation term.\nAssess the impact of age_months as a confounder in the Diphtheria serology data. First, write code to transform age_months into age ranges for each year. Then, using a loop, calculate the crude odds ratio for the effect of vaccination on infection for each of the age ranges. How does the odds ratio change as age increases? Can you formalize this analysis by fitting a logistic regression model with age_months and vaccination as predictors?" + "objectID": "modules/Module09-DataAnalysis.html#t-test-1", + "href": "modules/Module09-DataAnalysis.html#t-test-1", + "title": "Module 9: Data Analysis", + "section": "T-test", + "text": "T-test\nWe can use the t.test() function from the stats package.\n\n?t.test\n\nStudent’s t-Test\nDescription:\n Performs one and two sample t-tests on vectors of data.\nUsage:\n t.test(x, ...)\n \n ## Default S3 method:\n t.test(x, y = NULL,\n alternative = c(\"two.sided\", \"less\", \"greater\"),\n mu = 0, paired = FALSE, var.equal = FALSE,\n conf.level = 0.95, ...)\n \n ## S3 method for class 'formula'\n t.test(formula, data, subset, na.action, ...)\n \nArguments:\n x: a (non-empty) numeric vector of data values.\n\n y: an optional (non-empty) numeric vector of data values.\nalternative: a character string specifying the alternative hypothesis, must be one of ‘“two.sided”’ (default), ‘“greater”’ or ‘“less”’. You can specify just the initial letter.\n mu: a number indicating the true value of the mean (or difference\n in means if you are performing a two sample test).\npaired: a logical indicating whether you want a paired t-test.\nvar.equal: a logical variable indicating whether to treat the two variances as being equal. If ‘TRUE’ then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.\nconf.level: confidence level of the interval.\nformula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric variable giving the data values and ‘rhs’ either ‘1’ for a one-sample or paired test or a factor with two levels giving the corresponding groups. If ‘lhs’ is of class ‘“Pair”’ and ‘rhs’ is ‘1’, a paired test is done.\ndata: an optional matrix or data frame (or similar: see\n 'model.frame') containing the variables in the formula\n 'formula'. By default the variables are taken from\n 'environment(formula)'.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’s. Defaults to ’getOption(“na.action”)’.\n ...: further arguments to be passed to or from methods.\nDetails:\n 'alternative = \"greater\"' is the alternative that 'x' has a larger\n mean than 'y'. For the one-sample case: that the mean is positive.\n\n If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and\n they must be the same length. Missing values are silently removed\n (in pairs if 'paired' is 'TRUE'). If 'var.equal' is 'TRUE' then\n the pooled estimate of the variance is used. By default, if\n 'var.equal' is 'FALSE' then the variance is estimated separately\n for both groups and the Welch modification to the degrees of\n freedom is used.\n\n If the input data are effectively constant (compared to the larger\n of the two means) an error is generated.\nValue:\n A list with class '\"htest\"' containing the following components:\nstatistic: the value of the t-statistic.\nparameter: the degrees of freedom for the t-statistic.\np.value: the p-value for the test.\nconf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.\nestimate: the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test.\nnull.value: the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test.\nstderr: the standard error of the mean (difference), used as denominator in the t-statistic formula.\nalternative: a character string describing the alternative hypothesis.\nmethod: a character string indicating what type of t-test was performed.\ndata.name: a character string giving the name(s) of the data.\nSee Also:\n 'prop.test'\nExamples:\n require(graphics)\n \n t.test(1:10, y = c(7:20)) # P = .00001855\n t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore\n \n ## Classical example: Student's sleep data\n plot(extra ~ group, data = sleep)\n ## Traditional interface\n with(sleep, t.test(extra[group == 1], extra[group == 2]))\n \n ## Formula interface\n t.test(extra ~ group, data = sleep)\n \n ## Formula interface to one-sample test\n t.test(extra ~ 1, data = sleep)\n \n ## Formula interface to paired test\n ## The sleep data are actually paired, so could have been in wide format:\n sleep2 <- reshape(sleep, direction = \"wide\", \n idvar = \"ID\", timevar = \"group\")\n t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#learning-goals", - "href": "modules/ModuleXX-RMarkdown.html#learning-goals", - "title": "Literate Programming", - "section": "Learning goals", - "text": "Learning goals\n\nDefine literate programming\nImplement literate programming in R using knitr and either R Markdown or Quarto\nInclude plots, tables, and references along with your code in a written report.\nLocate additional resources for literate programming with R Markdown or Quarto." + "objectID": "modules/Module09-DataAnalysis.html#running-two-sample-t-test", + "href": "modules/Module09-DataAnalysis.html#running-two-sample-t-test", + "title": "Module 9: Data Analysis", + "section": "Running two-sample t-test", + "text": "Running two-sample t-test\nThe base R - t.test() function from the stats package. It tests test difference in means of a variable between two groups. By default:\n\ntests whether difference in means of a variable is equal to 0 (default mu=0)\nuses “two sided” alternative (alternative = \"two.sided\")\nreturns result assuming confidence level 0.95 (conf.level = 0.95)\nassumes data are not paired (paired = FALSE)\nassumes true variance in the two groups is not equal (var.equal = FALSE)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#what-is-literate-programming", - "href": "modules/ModuleXX-RMarkdown.html#what-is-literate-programming", - "title": "Literate Programming", - "section": "What is literate programming?", - "text": "What is literate programming?\n\nProgramming files contain code along with text, code results, and other supporting information.\nInstead of having separate code and text, that you glue together in Word, we have one document which combines code and text." + "objectID": "modules/Module09-DataAnalysis.html#running-two-sample-t-test-1", + "href": "modules/Module09-DataAnalysis.html#running-two-sample-t-test-1", + "title": "Module 9: Data Analysis", + "section": "Running two-sample t-test", + "text": "Running two-sample t-test\n\nIgG_young <- df$IgG_concentration[df$age_group==\"young\"]\nIgG_old <- df$IgG_concentration[df$age_group==\"old\"]\n\nt.test(IgG_young, IgG_old)\n\n\n Welch Two Sample t-test\n\ndata: IgG_young and IgG_old\nt = -6.1969, df = 259.54, p-value = 2.25e-09\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -111.09281 -57.51515\nsample estimates:\nmean of x mean of y \n 45.05056 129.35454 \n\n\nThe mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#literate-programming-examples", - "href": "modules/ModuleXX-RMarkdown.html#literate-programming-examples", - "title": "Literate Programming", - "section": "Literate programming examples", - "text": "Literate programming examples\n\nWriting a research paper with R Markdown: https://github.com/wzbillings/Patient-vs-Clinician-Symptom-Reports\nWriting a book with R Markdown: https://github.com/moderndive/ModernDive_book\nPersonal websites (like my tutorial!): https://jadeyryan.com/blog/2024-02-19_beginner-quarto-netlify/\nOther examples: https://bookdown.org/yihui/rmarkdown/basics-examples.html" + "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r", + "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r", + "title": "Module 9: Data Analysis", + "section": "Linear regression fit in R", + "text": "Linear regression fit in R\nTo fit regression models in R, we use the function glm() (Generalized Linear Model).\n\n?glm\n\nFitting Generalized Linear Models\nDescription:\n 'glm' is used to fit generalized linear models, specified by\n giving a symbolic description of the linear predictor and a\n description of the error distribution.\nUsage:\n glm(formula, family = gaussian, data, weights, subset,\n na.action, start = NULL, etastart, mustart, offset,\n control = list(...), model = TRUE, method = \"glm.fit\",\n x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)\n \n glm.fit(x, y, weights = rep.int(1, nobs),\n start = NULL, etastart = NULL, mustart = NULL,\n offset = rep.int(0, nobs), family = gaussian(),\n control = list(), intercept = TRUE, singular.ok = TRUE)\n \n ## S3 method for class 'glm'\n weights(object, type = c(\"prior\", \"working\"), ...)\n \nArguments:\nformula: an object of class ‘“formula”’ (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.\nfamily: a description of the error distribution and link function to be used in the model. For ‘glm’ this can be a character string naming a family function, a family function or the result of a call to a family function. For ‘glm.fit’ only the third option is supported. (See ‘family’ for details of family functions.)\ndata: an optional data frame, list or environment (or object\n coercible by 'as.data.frame' to a data frame) containing the\n variables in the model. If not found in 'data', the\n variables are taken from 'environment(formula)', typically\n the environment from which 'glm' is called.\nweights: an optional vector of ‘prior weights’ to be used in the fitting process. Should be ‘NULL’ or a numeric vector.\nsubset: an optional vector specifying a subset of observations to be used in the fitting process.\nna.action: a function which indicates what should happen when the data contain ‘NA’s. The default is set by the ’na.action’ setting of ‘options’, and is ‘na.fail’ if that is unset. The ‘factory-fresh’ default is ‘na.omit’. Another possible value is ‘NULL’, no action. Value ‘na.exclude’ can be useful.\nstart: starting values for the parameters in the linear predictor.\netastart: starting values for the linear predictor.\nmustart: starting values for the vector of means.\noffset: this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be ‘NULL’ or a numeric vector of length equal to the number of cases. One or more ‘offset’ terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See ‘model.offset’.\ncontrol: a list of parameters for controlling the fitting process. For ‘glm.fit’ this is passed to ‘glm.control’.\nmodel: a logical value indicating whether model frame should be included as a component of the returned value.\nmethod: the method to be used in fitting the model. The default method ‘“glm.fit”’ uses iteratively reweighted least squares (IWLS): the alternative ‘“model.frame”’ returns the model frame and does no fitting.\n User-supplied fitting functions can be supplied either as a\n function or a character string naming a function, with a\n function which takes the same arguments as 'glm.fit'. If\n specified as a character string it is looked up from within\n the 'stats' namespace.\n\nx, y: For 'glm': logical values indicating whether the response\n vector and model matrix used in the fitting process should be\n returned as components of the returned value.\n\n For 'glm.fit': 'x' is a design matrix of dimension 'n * p',\n and 'y' is a vector of observations of length 'n'.\nsingular.ok: logical; if ‘FALSE’ a singular fit is an error.\ncontrasts: an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.\nintercept: logical. Should an intercept be included in the null model?\nobject: an object inheriting from class ‘“glm”’.\ntype: character, partial matching allowed. Type of weights to\n extract from the fitted model object. Can be abbreviated.\n\n ...: For 'glm': arguments to be used to form the default 'control'\n argument if it is not supplied directly.\n\n For 'weights': further arguments passed to or from other\n methods.\nDetails:\n A typical predictor has the form 'response ~ terms' where\n 'response' is the (numeric) response vector and 'terms' is a\n series of terms which specifies a linear predictor for 'response'.\n For 'binomial' and 'quasibinomial' families the response can also\n be specified as a 'factor' (when the first level denotes failure\n and all others success) or as a two-column matrix with the columns\n giving the numbers of successes and failures. A terms\n specification of the form 'first + second' indicates all the terms\n in 'first' together with all the terms in 'second' with any\n duplicates removed.\n\n A specification of the form 'first:second' indicates the set of\n terms obtained by taking the interactions of all terms in 'first'\n with all terms in 'second'. The specification 'first*second'\n indicates the _cross_ of 'first' and 'second'. This is the same\n as 'first + second + first:second'.\n\n The terms in the formula will be re-ordered so that main effects\n come first, followed by the interactions, all second-order, all\n third-order and so on: to avoid this pass a 'terms' object as the\n formula.\n\n Non-'NULL' 'weights' can be used to indicate that different\n observations have different dispersions (with the values in\n 'weights' being inversely proportional to the dispersions); or\n equivalently, when the elements of 'weights' are positive integers\n w_i, that each response y_i is the mean of w_i unit-weight\n observations. For a binomial GLM prior weights are used to give\n the number of trials when the response is the proportion of\n successes: they would rarely be used for a Poisson GLM.\n\n 'glm.fit' is the workhorse function: it is not normally called\n directly but can be more efficient where the response vector,\n design matrix and family have already been calculated.\n\n If more than one of 'etastart', 'start' and 'mustart' is\n specified, the first in the list will be used. It is often\n advisable to supply starting values for a 'quasi' family, and also\n for families with unusual links such as 'gaussian(\"log\")'.\n\n All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are\n evaluated in the same way as variables in 'formula', that is first\n in 'data' and then in the environment of 'formula'.\n\n For the background to warning messages about 'fitted probabilities\n numerically 0 or 1 occurred' for binomial GLMs, see Venables &\n Ripley (2002, pp. 197-8).\nValue:\n 'glm' returns an object of class inheriting from '\"glm\"' which\n inherits from the class '\"lm\"'. See later in this section. If a\n non-standard 'method' is used, the object will also inherit from\n the class (if any) returned by that function.\n\n The function 'summary' (i.e., 'summary.glm') can be used to obtain\n or print a summary of the results and the function 'anova' (i.e.,\n 'anova.glm') to produce an analysis of variance table.\n\n The generic accessor functions 'coefficients', 'effects',\n 'fitted.values' and 'residuals' can be used to extract various\n useful features of the value returned by 'glm'.\n\n 'weights' extracts a vector of weights, one for each case in the\n fit (after subsetting and 'na.action').\n\n An object of class '\"glm\"' is a list containing at least the\n following components:\ncoefficients: a named vector of coefficients\nresiduals: the working residuals, that is the residuals in the final iteration of the IWLS fit. Since cases with zero weights are omitted, their working residuals are ‘NA’.\nfitted.values: the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.\nrank: the numeric rank of the fitted linear model.\nfamily: the ‘family’ object used.\nlinear.predictors: the linear fit on link scale.\ndeviance: up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.\n aic: A version of Akaike's _An Information Criterion_, minus twice\n the maximized log-likelihood plus twice the number of\n parameters, computed via the 'aic' component of the family.\n For binomial and Poison families the dispersion is fixed at\n one and the number of parameters is the number of\n coefficients. For gaussian, Gamma and inverse gaussian\n families the dispersion is estimated from the residual\n deviance, and the number of parameters is the number of\n coefficients plus one. For a gaussian family the MLE of the\n dispersion is used so this is a valid value of AIC, but for\n Gamma and inverse gaussian families it is not. For families\n fitted by quasi-likelihood the value is 'NA'.\nnull.deviance: The deviance for the null model, comparable with ‘deviance’. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.\niter: the number of iterations of IWLS used.\nweights: the working weights, that is the weights in the final iteration of the IWLS fit.\nprior.weights: the weights initially supplied, a vector of ’1’s if none were.\ndf.residual: the residual degrees of freedom.\ndf.null: the residual degrees of freedom for the null model.\n y: if requested (the default) the 'y' vector used. (It is a\n vector even for a binomial model.)\n\n x: if requested, the model matrix.\nmodel: if requested (the default), the model frame.\nconverged: logical. Was the IWLS algorithm judged to have converged?\nboundary: logical. Is the fitted value on the boundary of the attainable values?\ncall: the matched call.\nformula: the formula supplied.\nterms: the ‘terms’ object used.\ndata: the 'data argument'.\noffset: the offset vector used.\ncontrol: the value of the ‘control’ argument used.\nmethod: the name of the fitter function used (when provided as a ‘character’ string to ‘glm()’) or the fitter ‘function’ (when provided as that).\ncontrasts: (where relevant) the contrasts used.\nxlevels: (where relevant) a record of the levels of the factors used in fitting.\nna.action: (where relevant) information returned by ‘model.frame’ on the special handling of ’NA’s.\n In addition, non-empty fits will have components 'qr', 'R' and\n 'effects' relating to the final weighted linear fit.\n\n Objects of class '\"glm\"' are normally of class 'c(\"glm\", \"lm\")',\n that is inherit from class '\"lm\"', and well-designed methods for\n class '\"lm\"' will be applied to the weighted linear model at the\n final iteration of IWLS. However, care is needed, as extractor\n functions for class '\"glm\"' such as 'residuals' and 'weights' do\n *not* just pick out the component of the fit with the same name.\n\n If a 'binomial' 'glm' model was specified by giving a two-column\n response, the weights returned by 'prior.weights' are the total\n numbers of cases (factored by the supplied case weights) and the\n component 'y' of the result is the proportion of successes.\nFitting functions:\n The argument 'method' serves two purposes. One is to allow the\n model frame to be recreated with no fitting. The other is to\n allow the default fitting function 'glm.fit' to be replaced by a\n function which takes the same arguments and uses a different\n fitting algorithm. If 'glm.fit' is supplied as a character string\n it is used to search for a function of that name, starting in the\n 'stats' namespace.\n\n The class of the object return by the fitter (if any) will be\n prepended to the class returned by 'glm'.\nAuthor(s):\n The original R implementation of 'glm' was written by Simon Davies\n working for Ross Ihaka at the University of Auckland, but has\n since been extensively re-written by members of the R Core team.\n\n The design was inspired by the S function of the same name\n described in Hastie & Pregibon (1992).\nReferences:\n Dobson, A. J. (1990) _An Introduction to Generalized Linear\n Models._ London: Chapman and Hall.\n\n Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._\n Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.\n J. Hastie, Wadsworth & Brooks/Cole.\n\n McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._\n London: Chapman and Hall.\n\n Venables, W. N. and Ripley, B. D. (2002) _Modern Applied\n Statistics with S._ New York: Springer.\nSee Also:\n 'anova.glm', 'summary.glm', etc. for 'glm' methods, and the\n generic functions 'anova', 'summary', 'effects', 'fitted.values',\n and 'residuals'.\n\n 'lm' for non-generalized _linear_ models (which SAS calls GLMs,\n for 'general' linear models).\n\n 'loglin' and 'loglm' (package 'MASS') for fitting log-linear\n models (which binomial and Poisson GLMs are) to contingency\n tables.\n\n 'bigglm' in package 'biglm' for an alternative way to fit GLMs to\n large datasets (especially those with many cases).\n\n 'esoph', 'infert' and 'predict.glm' have examples of fitting\n binomial glms.\nExamples:\n ## Dobson (1990) Page 93: Randomized Controlled Trial :\n counts <- c(18,17,15,20,10,20,25,13,12)\n outcome <- gl(3,1,9)\n treatment <- gl(3,3)\n data.frame(treatment, outcome, counts) # showing data\n glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())\n anova(glm.D93)\n summary(glm.D93)\n ## Computing AIC [in many ways]:\n (A0 <- AIC(glm.D93))\n (ll <- logLik(glm.D93))\n A1 <- -2*c(ll) + 2*attr(ll, \"df\")\n A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +\n 2 * length(coef(glm.D93))\n stopifnot(exprs = {\n all.equal(A0, A1)\n all.equal(A1, A2)\n all.equal(A1, glm.D93$aic)\n })\n \n \n ## an example with offsets from Venables & Ripley (2002, p.189)\n utils::data(anorexia, package = \"MASS\")\n \n anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),\n family = gaussian, data = anorexia)\n summary(anorex.1)\n \n \n # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)\n clotting <- data.frame(\n u = c(5,10,15,20,30,40,60,80,100),\n lot1 = c(118,58,42,35,27,25,21,19,18),\n lot2 = c(69,35,26,21,18,16,13,12,12))\n summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))\n summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))\n ## Aliased (\"S\"ingular) -> 1 NA coefficient\n (fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))\n tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())\n ## -> .. \"singular fit encountered\"\n \n ## Not run:\n \n ## for an example of the use of a terms object as a formula\n demo(glm.vr)\n ## End(Not run)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#r-markdown-and-quarto", - "href": "modules/ModuleXX-RMarkdown.html#r-markdown-and-quarto", - "title": "Literate Programming", - "section": "R Markdown and Quarto", - "text": "R Markdown and Quarto\n\nR Markdown and Quarto are both implementations of literate programming using R, with the knitr package for the backend. Both are supported by RStudio.\nTo use R Markdown, you need to install.packages(\"rmarkdown\").\nQuarto comes with new versions of RStudio, but you can also install the latest version from the Quarto website.\nR Markdown is older and now very commonly used. Quarto is newer and so has many fancy new features, but more bugs that are constantly being found and fixed.\nIn this class, we will use R Markdown. But if you decide to use quarto, 90% of your knowledge will transfer since they are very similar.\n\nAdvantages of R Markdown: more online resources, most common bugs have been fixed over the years, many people are familiar with it.\nAdvantages of Quarto: supports other programming languages like Python and Julia, uses more modern syntax, less slapped together overall." + "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-1", + "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-1", + "title": "Module 9: Data Analysis", + "section": "Linear regression fit in R", + "text": "Linear regression fit in R\nWe tend to focus on three arguments:\n\nformula – model formula written using names of columns in our data\ndata – our data frame\nfamily – error distribution and link function\n\n\nfit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())\nfit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = \"logit\"))", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#what-is-literate-programming-1", - "href": "modules/ModuleXX-RMarkdown.html#what-is-literate-programming-1", - "title": "Literate Programming", - "section": "What is literate programming?", - "text": "What is literate programming?\n\nR markdown example, from https://rmarkdown.rstudio.com/authoring_quick_tour.html" + "objectID": "modules/Module09-DataAnalysis.html#summary.glm", + "href": "modules/Module09-DataAnalysis.html#summary.glm", + "title": "Module 9: Data Analysis", + "section": "summary.glm()", + "text": "summary.glm()\nThe summary() function when applied to a fit object based on a glm is technically the summary.glm() function and produces details of the model fit. Note on object oriented code.\n\nSummarizing Generalized Linear Model Fits\nDescription:\n These functions are all 'methods' for class 'glm' or 'summary.glm'\n objects.\nUsage:\n ## S3 method for class 'glm'\n summary(object, dispersion = NULL, correlation = FALSE,\n symbolic.cor = FALSE, ...)\n \n ## S3 method for class 'summary.glm'\n print(x, digits = max(3, getOption(\"digits\") - 3),\n symbolic.cor = x$symbolic.cor,\n signif.stars = getOption(\"show.signif.stars\"),\n show.residuals = FALSE, ...)\n \nArguments:\nobject: an object of class ‘“glm”’, usually, a result of a call to ‘glm’.\n x: an object of class '\"summary.glm\"', usually, a result of a\n call to 'summary.glm'.\ndispersion: the dispersion parameter for the family used. Either a single numerical value or ‘NULL’ (the default), when it is inferred from ‘object’ (see ‘Details’).\ncorrelation: logical; if ‘TRUE’, the correlation matrix of the estimated parameters is returned and printed.\ndigits: the number of significant digits to use when printing.\nsymbolic.cor: logical. If ‘TRUE’, print the correlations in a symbolic form (see ‘symnum’) rather than as numbers.\nsignif.stars: logical. If ‘TRUE’, ‘significance stars’ are printed for each coefficient.\nshow.residuals: logical. If ‘TRUE’ then a summary of the deviance residuals is printed at the head of the output.\n ...: further arguments passed to or from other methods.\nDetails:\n 'print.summary.glm' tries to be smart about formatting the\n coefficients, standard errors, etc. and additionally gives\n 'significance stars' if 'signif.stars' is 'TRUE'. The\n 'coefficients' component of the result gives the estimated\n coefficients and their estimated standard errors, together with\n their ratio. This third column is labelled 't ratio' if the\n dispersion is estimated, and 'z ratio' if the dispersion is known\n (or fixed by the family). A fourth column gives the two-tailed\n p-value corresponding to the t or z ratio based on a Student t or\n Normal reference distribution. (It is possible that the\n dispersion is not known and there are no residual degrees of\n freedom from which to estimate it. In that case the estimate is\n 'NaN'.)\n\n Aliased coefficients are omitted in the returned object but\n restored by the 'print' method.\n\n Correlations are printed to two decimal places (or symbolically):\n to see the actual correlations print 'summary(object)$correlation'\n directly.\n\n The dispersion of a GLM is not used in the fitting process, but it\n is needed to find standard errors. If 'dispersion' is not\n supplied or 'NULL', the dispersion is taken as '1' for the\n 'binomial' and 'Poisson' families, and otherwise estimated by the\n residual Chisquared statistic (calculated from cases with non-zero\n weights) divided by the residual degrees of freedom.\n\n 'summary' can be used with Gaussian 'glm' fits to handle the case\n of a linear regression with known error variance, something not\n handled by 'summary.lm'.\nValue:\n 'summary.glm' returns an object of class '\"summary.glm\"', a list\n with components\n\ncall: the component from 'object'.\nfamily: the component from ‘object’.\ndeviance: the component from ‘object’.\ncontrasts: the component from ‘object’.\ndf.residual: the component from ‘object’.\nnull.deviance: the component from ‘object’.\ndf.null: the component from ‘object’.\ndeviance.resid: the deviance residuals: see ‘residuals.glm’.\ncoefficients: the matrix of coefficients, standard errors, z-values and p-values. Aliased coefficients are omitted.\naliased: named logical vector showing if the original coefficients are aliased.\ndispersion: either the supplied argument or the inferred/estimated dispersion if the former is ‘NULL’.\n df: a 3-vector of the rank of the model and the number of\n residual degrees of freedom, plus number of coefficients\n (including aliased ones).\ncov.unscaled: the unscaled (‘dispersion = 1’) estimated covariance matrix of the estimated coefficients.\ncov.scaled: ditto, scaled by ‘dispersion’.\ncorrelation: (only if ‘correlation’ is true.) The estimated correlations of the estimated coefficients.\nsymbolic.cor: (only if ‘correlation’ is true.) The value of the argument ‘symbolic.cor’.\nSee Also:\n 'glm', 'summary'.\nExamples:\n ## For examples see example(glm)", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#a-few-sticking-points", - "href": "modules/ModuleXX-RMarkdown.html#a-few-sticking-points", - "title": "Literate Programming", - "section": "A few sticking points", - "text": "A few sticking points\n\nKnitting to html format is really easy, but most scientist don’t like html format for some reason. If you want to knit to pdf, you should install the package tinytex and read the intro.\nIf you want to knit to word (what many journals in epidemiology require), you need to have Word installed on your computer. Note that with word, you are a bit more restricted in your formatting options, so if weird things happen you’ll have to try some other options.\nYou maybe noticed in the tutorial that I used the here::here() function for all of my file paths. This is because R Markdown and Quarto files use a different working directory from the R Project. Using here::here() translates relative paths into absolute paths based on your R Project, so it makes sure your R Markdown files can always find the right path!" + "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-2", + "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-2", + "title": "Module 9: Data Analysis", + "section": "Linear regression fit in R", + "text": "Linear regression fit in R\nLets look at the output…\n\nsummary(fit1)\n\n\nCall:\nglm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(), \n data = df)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 46.132 16.774 2.750 0.00613 ** \nage 9.324 1.388 6.718 4.15e-11 ***\ngenderMale -9.655 11.543 -0.836 0.40321 \nslumNon slum -20.353 14.299 -1.423 0.15513 \nslumSlum -29.705 25.009 -1.188 0.23536 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for gaussian family taken to be 20918.39)\n\n Null deviance: 14141483 on 631 degrees of freedom\nResidual deviance: 13115831 on 627 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 8087.9\n\nNumber of Fisher Scoring iterations: 2\n\nsummary(fit2)\n\n\nCall:\nglm(formula = seropos ~ age_group + gender + slum, family = binomial(link = \"logit\"), \n data = df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.3220 0.2516 -5.254 1.49e-07 ***\nage_groupmiddle 1.9020 0.2133 8.916 < 2e-16 ***\nage_groupold 2.8443 0.2522 11.278 < 2e-16 ***\ngenderMale -0.1725 0.1895 -0.910 0.363 \nslumNon slum -0.1099 0.2329 -0.472 0.637 \nslumSlum -0.1073 0.4118 -0.261 0.794 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 866.98 on 631 degrees of freedom\nResidual deviance: 679.10 on 626 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 691.1\n\nNumber of Fisher Scoring iterations: 4", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/ModuleXX-RMarkdown.html#you-try-it", - "href": "modules/ModuleXX-RMarkdown.html#you-try-it", - "title": "Literate Programming", - "section": "You try it!", - "text": "You try it!\n\nCreate an R Markdown document. Write about either the measles or diphtheria example data sets, and include a figure and a table.\nBONUS EXERCISE: read the intro of the bookdown book, and create a bookdown document. Modify your writeup to have a few references with a bibliography, and cross-references with your figures and tables.\nBONUS: Try to structure your document like a report, with a section stating the questions you want to answer (intro), a section with your R code and results, and a section with your interpretations (discussion). This is a very open ended exercise but by now I believe you can do it, and you’ll have a nice document you can put on your portfolio or show employers!" + "objectID": "modules/Module09-DataAnalysis.html#summary", + "href": "modules/Module09-DataAnalysis.html#summary", + "title": "Module 9: Data Analysis", + "section": "Summary", + "text": "Summary\n\nthe aggregate() function can be used to conduct analyses across groups (i.e., categorical variables in the data(\nthe table() function can generate frequency tables for 2 plus variables, but to get percentage tables, the prop.table() is useful\nthe chisq.test() function tests independence of factor variables\nthe cor() or cor.test() functions can be used to calculate correlation between two numeric vectors\nthe t.test() functions conducts one and two sample (paired or unpaired) t-tests\nthe function glm() fits generalized linear modules to data and returns a fit object that can be read with the summary() function\nchanging the family argument in the glm() function allows you to fit models with different link functions", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#learning-objectives", - "href": "modules/Module10-DataVisualization.html#learning-objectives", - "title": "Module 10: Data Visualization", + "objectID": "modules/Module09-DataAnalysis.html#acknowledgements", + "href": "modules/Module09-DataAnalysis.html#acknowledgements", + "title": "Module 9: Data Analysis", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University", + "crumbs": [ + "Day 2", + "Module 9: Data Analysis" + ] + }, + { + "objectID": "modules/Module01-Intro.html#learning-objectives", + "href": "modules/Module01-Intro.html#learning-objectives", + "title": "Module 1: Introduction to RStudio and R Basics", "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 10, you should be able to:\n\nCreate Base R plots" - }, - { - "objectID": "modules/Module10-DataVisualization.html#import-data-for-this-module", - "href": "modules/Module10-DataVisualization.html#import-data-for-this-module", - "title": "Module 10: Data Visualization", - "section": "Import data for this module", - "text": "Import data for this module\nLet’s read in our data (again) and take a quick look.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum" - }, - { - "objectID": "modules/Module10-DataVisualization.html#prep-data", - "href": "modules/Module10-DataVisualization.html#prep-data", - "title": "Module 10: Data Visualization", - "section": "Prep data", - "text": "Prep data\nCreate age_group three level factor variable\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n\nCreate seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#base-r-data-visualizattion-functions", - "href": "modules/Module10-DataVisualization.html#base-r-data-visualizattion-functions", - "title": "Module 10: Data Visualization", - "section": "Base R data visualizattion functions", - "text": "Base R data visualizattion functions\nThe Base R ‘graphics’ package has a ton of graphics options.\n\nhelp(package = \"graphics\")\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\n Information on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team <do-use-Contact-address@r-project.org>\nContact: R-help mailing list <r-help@r-project.org>\nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline" - }, - { - "objectID": "modules/Module10-DataVisualization.html#focus-on-a-handful-here-today", - "href": "modules/Module10-DataVisualization.html#focus-on-a-handful-here-today", - "title": "Module 10: Data Visualization", - "section": "Focus on a handful here today", - "text": "Focus on a handful here today\n\n `hist()` displays histogram of one variable\n `plot()` displays x-y plot of two variables\n `boxplot()` displays boxplot \n `barplot()` displays barplot" - }, - { - "objectID": "modules/Module10-DataVisualization.html#histogram-help-file", - "href": "modules/Module10-DataVisualization.html#histogram-help-file", - "title": "Module 10: Data Visualization", - "section": "histogram() Help File", - "text": "histogram() Help File\n\n?hist\n\nHistograms\nDescription:\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\nUsage:\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n x: a vector of values for which the histogram is desired.\nbreaks: one of:\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\nfreq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\nprobability: an alias for ‘!freq’, for S compatibility.\ninclude.lowest: logical; if ‘TRUE’, an ‘x[i]’ equal to the ‘breaks’ value will be included in the first (or last, for ‘right = FALSE’) bar. This will be ignored (with a warning) unless ‘breaks’ is a vector.\nright: logical; if ‘TRUE’, the histogram cells are right-closed (left open) intervals.\nfuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\ndensity: the density of shading lines, in lines per inch. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise).\n col: a colour to be used to fill the bars.\nborder: the color of the border around the bars. The default is to use the standard foreground color.\nmain, xlab, ylab: main title and axis labels: these arguments to ‘title()’ get “smart” defaults here, e.g., the default ‘ylab’ is ‘“Frequency”’ iff ‘freq’ is true.\nxlim, ylim: the range of x and y values with sensible defaults. Note that ‘xlim’ is not used to define the histogram (breaks), but only for plotting (when ‘plot = TRUE’).\naxes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\nplot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\nlabels: logical or character string. Additionally draw labels on top of bars, if not ‘FALSE’; see ‘plot.histogram’.\nnclass: numeric (integer). For S(-PLUS) compatibility only, ‘nclass’ is equivalent to ‘breaks’ for a scalar or character argument.\nwarn.unused: logical. If ‘plot = FALSE’ and ‘warn.unused = TRUE’, a warning will be issued when graphical parameters are passed to ‘hist.default()’.\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\nDetails:\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\nValue:\n an object of class '\"histogram\"' which is a list with components:\nbreaks: the n+1 cell boundaries (= ‘breaks’ if that was a vector). These are the nominal breaks, not with the boundary fuzz.\ncounts: n integers; for each cell, the number of ‘x[]’ inside.\ndensity: values f^(x[i]), as estimated density values. If ‘all(diff(breaks) == 1)’, they are the relative frequencies ‘counts/n’ and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = ‘breaks[i]’.\nmids: the n cell midpoints.\nxname: a character string with the actual ‘x’ argument name.\nequidist: logical, indicating if the distances between ‘breaks’ are all the same.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\nSee Also:\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\nExamples:\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#histogram-example", - "href": "modules/Module10-DataVisualization.html#histogram-example", - "title": "Module 10: Data Visualization", - "section": "histogram() example", - "text": "histogram() example\nReminder function signature\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\nLet’s practice\n\nhist(df$age)\n\n\n\nhist(\n df$age, \n freq=FALSE, \n main=\"Histogram\", \n xlab=\"Age (years)\"\n )" - }, - { - "objectID": "modules/Module10-DataVisualization.html#plot-help-file", - "href": "modules/Module10-DataVisualization.html#plot-help-file", - "title": "Module 10: Data Visualization", - "section": "plot() Help File", - "text": "plot() Help File\n\n?plot\n\nGeneric X-Y Plotting\nDescription:\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\nUsage:\n plot(x, y, ...)\n \nArguments:\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\nDetails:\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\nNote:\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\nSee Also:\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\nExamples:\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")" - }, - { - "objectID": "modules/Module10-DataVisualization.html#plot-help-file-2-par", - "href": "modules/Module10-DataVisualization.html#plot-help-file-2-par", - "title": "Module 10: Data Visualization", - "section": "plot() Help File 2 – par", - "text": "plot() Help File 2 – par\npar can be used to set or query graphical parameters - basically the plot options\n\n?par\n\nSet or Query Graphical Parameters\nDescription:\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\nUsage:\n par(..., no.readonly = FALSE)\n \n <highlevel plot> (...., <tag> = <value>)\n \nArguments:\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\nno.readonly: logical; if ‘TRUE’ and there are no other arguments, only parameters are returned which can be set by a subsequent ‘par()’ call on the same device.\nDetails:\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\nValue:\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\nGraphical Parameters:\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\nColor Specification:\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\nLine Type Specification:\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\nNote:\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\nExamples:\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#plot-example", - "href": "modules/Module10-DataVisualization.html#plot-example", - "title": "Module 10: Data Visualization", - "section": "plot() example", - "text": "plot() example\n\nplot(df$age, df$IgG_concentration)\n\n\n\n\n\n\n\nplot(\n df$age, \n df$IgG_concentration, \n type=\"p\", \n main=\"Age by IgG Concentrations\", \n xlab=\"Age (years)\", \n ylab=\"IgG Concentration (IU/mL)\", \n pch=16, \n cex=0.9,\n col=\"lightblue\")" - }, - { - "objectID": "modules/Module10-DataVisualization.html#boxplot-help-file", - "href": "modules/Module10-DataVisualization.html#boxplot-help-file", - "title": "Module 10: Data Visualization", - "section": "boxplot() Help File", - "text": "boxplot() Help File\n\n?boxplot\n\nBox Plots\nDescription:\n Produce box-and-whisker plot(s) of the given (grouped) values.\nUsage:\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\nformula: a formula, such as ‘y ~ grp’, where ‘y’ is a numeric vector of data values to be split into groups according to the grouping variable ‘grp’ (usually a factor). Note that ‘~ g1 + g2’ is equivalent to ‘g1:g2’.\ndata: a data.frame (or list) from which the variables in 'formula'\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used for plotting.\nna.action: a function which indicates what should happen when the data contain ’NA’s. The default is to ignore missing values in either the response or the group.\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ‘ann=FALSE’.\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\ndrop, sep, lex.order: passed to ‘split.default’, see there.\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\nrange: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.\nwidth: a vector giving the relative widths of the boxes making up the plot.\nvarwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.\nnotch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See ‘boxplot.stats’ for the calculations used.\noutline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).\nnames: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).\nboxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.\nstaplewex: staple line width expansion, proportional to box width.\noutwex: outlier line width expansion, proportional to box width.\nplot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\nborder: an optional vector of colors for the outlines of the boxplots. The values in ‘border’ are recycled if the length of ‘border’ is less than the number of plots.\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\npars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\nhorizontal: logical indicating if the boxplots should be horizontal; default ‘FALSE’ means vertical boxes.\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\nDetails:\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\nValue:\n List with the following components:\nstats: a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.\n n: a vector with the number of (non-'NA') observations in each\n group.\n\nconf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\ngroup: a vector of the same length as ‘out’ whose elements indicate to which group the outlier belongs.\nnames: a vector of names for the groups.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\nSee Also:\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\nExamples:\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#boxplot-example", - "href": "modules/Module10-DataVisualization.html#boxplot-example", - "title": "Module 10: Data Visualization", - "section": "boxplot() example", - "text": "boxplot() example\nReminder function signature\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\nLet’s practice\n\nboxplot(IgG_concentration~age_group, data=df)\n\n\n\n\n\n\n\nboxplot(\n log(df$IgG_concentration)~df$age_group, \n main=\"Age by IgG Concentrations\", \n xlab=\"Age Group (years)\", \n ylab=\"log IgG Concentration (mIU/mL)\", \n names=c(\"1-5\",\"6-10\", \"11-15\"), \n varwidth=T\n )" - }, - { - "objectID": "modules/Module10-DataVisualization.html#barplot-help-file", - "href": "modules/Module10-DataVisualization.html#barplot-help-file", - "title": "Module 10: Data Visualization", - "section": "barplot() Help File", - "text": "barplot() Help File\n\n?barplot\n\nBar Plots\nDescription:\n Creates a bar plot with vertical or horizontal bars.\nUsage:\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\nheight: either a vector or matrix of values describing the bars which make up the plot. If ‘height’ is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If ‘height’ is a matrix and ‘beside’ is ‘FALSE’ then each bar of the plot corresponds to a column of ‘height’, with the values in the column giving the heights of stacked sub-bars making up the bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, then the values in each column are juxtaposed rather than stacked.\nwidth: optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless ‘xlim’ is specified.\nspace: the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, ‘space’ may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to ‘c(0,1)’ if ‘height’ is a matrix and ‘beside’ is ‘TRUE’, and to 0.2 otherwise.\nnames.arg: a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the ‘names’ attribute of ‘height’ if this is a vector, or the column names if it is a matrix.\nlegend.text: a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when ‘height’ is a matrix. In that case given legend labels should correspond to the rows of ‘height’; if ‘legend.text’ is true, the row names of ‘height’ will be used as labels if they are non-null.\nbeside: a logical value. If ‘FALSE’, the columns of ‘height’ are portrayed as stacked bars, and if ‘TRUE’ the columns are portrayed as juxtaposed bars.\nhoriz: a logical value. If ‘FALSE’, the bars are drawn vertically with the first bar to the left. If ‘TRUE’, the bars are drawn horizontally with the first at the bottom.\ndensity: a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\nborder: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.\nmain,sub: main title and subtitle for the plot.\nxlab: a label for the x axis.\n\nylab: a label for the y axis.\n\nxlim: limits for the x axis.\n\nylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\naxes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\naxisnames: logical. If ‘TRUE’, and if there are ‘names.arg’ (see above), the other axis is drawn (with ‘lty = 0’) and labeled.\ncex.axis: expansion factor for numeric axis labels (see ‘par(’cex’)’).\ncex.names: expansion factor for axis names (bar labels).\ninside: logical. If ‘TRUE’, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when ‘space = 0’ (which it partly is when ‘beside = TRUE’).\nplot: logical. If 'FALSE', nothing is plotted.\naxis.lty: the graphics parameter ‘lty’ (see ‘par(’lty’)’) applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.\noffset: a vector indicating how much the bars should be shifted relative to the x axis.\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\nargs.legend: list of additional arguments to pass to ‘legend()’; names of the list are used as argument names. Only used if ‘legend.text’ is supplied.\nformula: a formula where the ‘y’ variables are numeric data to plot against the categorical ‘x’ variables. The formula can have one of three forms:\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\ndata: a data frame (or list) from which the variables in formula\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’ values. The default is to ignore missing values in the given variables.\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\nValue:\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\nAuthor(s):\n R Core, with a contribution by Arni Magnusson.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\nExamples:\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#barplot-example", - "href": "modules/Module10-DataVisualization.html#barplot-example", - "title": "Module 10: Data Visualization", - "section": "barplot() example", - "text": "barplot() example\nThe function takes the a lot of arguments to control the way the way our data is plotted.\nReminder function signature\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n\n\n\n\n\n\n\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#legend", - "href": "modules/Module10-DataVisualization.html#legend", - "title": "Module 10: Data Visualization", - "section": "3. Legend!", - "text": "3. Legend!\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n?legend\n\n\n\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#barplot-example-1", - "href": "modules/Module10-DataVisualization.html#barplot-example-1", - "title": "Module 10: Data Visualization", - "section": "barplot() example", - "text": "barplot() example\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#summary", - "href": "modules/Module10-DataVisualization.html#summary", - "title": "Module 10: Data Visualization", - "section": "Summary", - "text": "Summary\n\nthe Base R ‘graphics’ package has a ton of graphics options that allow for ultimate flexibility\nBase R plots typically include setting plot options (par()), mapping data to the plot (e.g., plot(), barplot(), points(), lines()), and creating a legend (legend()).\nthe functions points() or lines() add additional points or additional lines to an existing plot, but must be called with a plot()-style function\nin Base R plotting the legend is not automatically generated, so be careful when creating it" - }, - { - "objectID": "modules/Module10-DataVisualization.html#acknowledgements", - "href": "modules/Module10-DataVisualization.html#acknowledgements", - "title": "Module 10: Data Visualization", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Base Plotting in R” by Medium\n [\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)" - }, - { - "objectID": "modules/Module10-DataVisualization.html#base-r-plotting", - "href": "modules/Module10-DataVisualization.html#base-r-plotting", - "title": "Module 10: Data Visualization", - "section": "Base R Plotting", - "text": "Base R Plotting\nTo make a plot you often need to specify the following features:\n\nParameters\nPlot attributes\nThe legend" - }, - { - "objectID": "modules/Module10-DataVisualization.html#parameters", - "href": "modules/Module10-DataVisualization.html#parameters", - "title": "Module 10: Data Visualization", - "section": "1. Parameters", - "text": "1. Parameters\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via par() before you call the plot creates ‘global’ settings for your plot.\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\nThe mfrow specifies that we have one row and two columns of plots — that is, two plots side by side.\nThe mar attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\npar(mfrow = c(1,2), mar = c(5,5,4,1))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#plot-attributes", - "href": "modules/Module10-DataVisualization.html#plot-attributes", - "title": "Module 10: Data Visualization", - "section": "2. Plot Attributes", - "text": "2. Plot Attributes\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot.\nWe will only look at four types of plots today:\n\nhist() displays histogram of one variable\nplot() displays x-y plot of two variables\nboxplot() displays boxplot\nbarplot() displays barplot" - }, - { - "objectID": "modules/Module10-DataVisualization.html#barplot-example-2", - "href": "modules/Module10-DataVisualization.html#barplot-example-2", - "title": "Module 10: Data Visualization", - "section": "barplot() example", - "text": "barplot() example" - }, - { - "objectID": "modules/Module10-DataVisualization.html#add-legend-to-the-plot", - "href": "modules/Module10-DataVisualization.html#add-legend-to-the-plot", - "title": "Module 10: Data Visualization", - "section": "Add legend to the plot", - "text": "Add legend to the plot\nReminder function signature\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\nLet’s practice\n\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#lots-of-parameters-options", - "href": "modules/Module10-DataVisualization.html#lots-of-parameters-options", - "title": "Module 10: Data Visualization", - "section": "Lots of parameters options", - "text": "Lots of parameters options\nHowever, there are many more parameter options that can be specified in the ‘global’ settings or specific to a certain plot option.\n\n?par\n\nSet or Query Graphical Parameters\nDescription:\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\nUsage:\n par(..., no.readonly = FALSE)\n \n <highlevel plot> (...., <tag> = <value>)\n \nArguments:\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\nno.readonly: logical; if ‘TRUE’ and there are no other arguments, only parameters are returned which can be set by a subsequent ‘par()’ call on the same device.\nDetails:\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\nValue:\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\nGraphical Parameters:\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\nColor Specification:\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\nLine Type Specification:\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\nNote:\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\nExamples:\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))" - }, - { - "objectID": "modules/Module10-DataVisualization.html#common-parameter-options", - "href": "modules/Module10-DataVisualization.html#common-parameter-options", - "title": "Module 10: Data Visualization", - "section": "Common parameter options", - "text": "Common parameter options\nEight useful parameter arguments help improve the readability of the plot:\n\nxlab: specifies the x-axis label of the plot\nylab: specifies the y-axis label\nmain: titles your graph\npch: specifies the symbology of your graph\nlty: specifies the line type of your graph\nlwd: specifies line thickness\ncex : specifies size\ncol: specifies the colors for your graph.\n\nWe will explore use of these arguments below." - }, - { - "objectID": "modules/Module10-DataVisualization.html#common-parameter-options-1", - "href": "modules/Module10-DataVisualization.html#common-parameter-options-1", - "title": "Module 10: Data Visualization", - "section": "Common parameter options", - "text": "Common parameter options" - }, - { - "objectID": "modules/Module00-Welcome.html#welcome-to-sismid-workshop-introduction-to-r", - "href": "modules/Module00-Welcome.html#welcome-to-sismid-workshop-introduction-to-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Welcome to SISMID Workshop: Introduction to R!", - "text": "Welcome to SISMID Workshop: Introduction to R!\nAmy Winter (she/her)\nAssistant Professor, Department of Epidemiology and Biostatistics\nEmail: awinter@uga.edu\n\nZane Billings (he/him)\nPhD Candidate, Department of Epidemiology and Biostatistics\nEmail: Wesley.Billings@uga.edu" - }, - { - "objectID": "modules/Module00-Welcome.html#introductions", - "href": "modules/Module00-Welcome.html#introductions", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Introductions", - "text": "Introductions\n\nName?\nCurrent position / institution?\nPast experience with other statistical programs, including R?\nWhy do you want to learn R?\nFavorite useful app\nFavorite guilty pleasure app" - }, - { - "objectID": "modules/Module00-Welcome.html#what-is-r", - "href": "modules/Module00-Welcome.html#what-is-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "What is R?", - "text": "What is R?\n\nR is a language and environment for statistical computing and graphics developed in 1991\nR is the open source implementation of the S language, which was developed by Bell laboratories in the 70s.\nThe aim of the S language, as expressed by John Chambers, is “to turn ideas into software, quickly and faithfully”" - }, - { - "objectID": "modules/Module00-Welcome.html#what-is-r-1", - "href": "modules/Module00-Welcome.html#what-is-r-1", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "What is R?", - "text": "What is R?\n\nRoss Ihaka and Robert Gentleman at the University of Auckland, New Zealand developed R\nR is both open source and open development" - }, - { - "objectID": "modules/Module00-Welcome.html#what-is-r-2", - "href": "modules/Module00-Welcome.html#what-is-r-2", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "What is R?", - "text": "What is R?\n\nR possesses an extensive catalog of statistical and graphical methods\n\nincludes machine learning algorithm, linear regression, time series, statistical inference to name a few.\n\nData analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results" - }, - { - "objectID": "modules/Module00-Welcome.html#what-is-r-3", - "href": "modules/Module00-Welcome.html#what-is-r-3", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "What is R?", - "text": "What is R?\n\nProgram: R is a clear and accessible programming tool\nTransform: R is made up of a collection of packages/libraries designed specifically for statistical computing\nDiscover: Investigate the data, refine your hypothesis and analyze them\nModel: R provides a wide array of tools to capture the right model for your data\nCommunicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world" - }, - { - "objectID": "modules/Module00-Welcome.html#why-r", - "href": "modules/Module00-Welcome.html#why-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Why R?", - "text": "Why R?\n\nFree (open source)\nHigh level language designed for statistical computing\nPowerful and flexible - especially for data wrangling and visualization\nExtensive add-on software (packages)\nStrong community" - }, - { - "objectID": "modules/Module00-Welcome.html#why-not-r", - "href": "modules/Module00-Welcome.html#why-not-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Why not R?", - "text": "Why not R?\n\nLittle centralized support, relies on online community and package developers\nAnnoying to update\nSlower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)" - }, - { - "objectID": "modules/Module00-Welcome.html#who-uses-r", - "href": "modules/Module00-Welcome.html#who-uses-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Who uses R?", - "text": "Who uses R?\n\nData scientist can use two excellent tools: R and Python.\nA programming language is a tool to compute and communicate your discovery.\nHow do you deal with data (Most important aspect for this course!)\n\nimport, clean, prep, analyze. This should be your primary focus.\n\nData scientist are not programmers. Their job is to understand the data, manipulate it and expose the best approach." - }, - { - "objectID": "modules/Module00-Welcome.html#is-r-difficult", - "href": "modules/Module00-Welcome.html#is-r-difficult", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Is R Difficult?", - "text": "Is R Difficult?\n\nShort answer – It has a steep learning curve, like all programming languages\nYears ago, R was a difficult language to master.\nHadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore." - }, - { - "objectID": "modules/Module00-Welcome.html#workshop-objective", - "href": "modules/Module00-Welcome.html#workshop-objective", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Workshop Objective", - "text": "Workshop Objective\nWe hope to teach you to fish – Note in a full semester course some of these more abstract themes are sprinkled in throughout the semester while we continue to learn all sorts of functions. Here we need you to get some of these main themes as quickly as possible … What you can expect at the end How this differs from Tidy course\nWe will focus this class on using base R because some resources online and R users will use this." - }, - { - "objectID": "modules/Module00-Welcome.html#tidyverse-and-base-r", - "href": "modules/Module00-Welcome.html#tidyverse-and-base-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Tidyverse and Base R", - "text": "Tidyverse and Base R\nWe will mostly show you how to use tidyverse packages and functions.\nThis is a newer set of packages designed for data science that can make your code more intuitive as compared to the original older Base R.\nTidyverse advantages:\n- consistent structure - making it easier to learn how to use different packages\n- particularly good for wrangling (manipulating, cleaning, joining) data\n- more flexible for visualizing data\nPackages for the tidyverse are managed by a team of respected data scientists at RStudio.\n\nSee this article for more info." - }, - { - "objectID": "modules/Module00-Welcome.html#workshop-overview", - "href": "modules/Module00-Welcome.html#workshop-overview", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Workshop Overview", - "text": "Workshop Overview\n14 lecture blocks that will each:\n\nStart with learning objectives\nEnd with summary slides\nInclude mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n\nReproducibility\nGood coding techniques\nThinking algorithmically\nBasic terms / R jargon" - }, - { - "objectID": "modules/Module00-Welcome.html#course-format", - "href": "modules/Module00-Welcome.html#course-format", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Course Format", - "text": "Course Format\n\nLecture with slides (possibly “Interactive”)\nLab/Practical experience\nOne 10 min breaks each day - timing may vary" - }, - { - "objectID": "modules/Module00-Welcome.html#reproducibility", - "href": "modules/Module00-Welcome.html#reproducibility", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Reproducibility", - "text": "Reproducibility\nxxzane slides" - }, - { - "objectID": "modules/Module00-Welcome.html#good-coding-techniques", - "href": "modules/Module00-Welcome.html#good-coding-techniques", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Good coding techniques", - "text": "Good coding techniques" - }, - { - "objectID": "modules/Module00-Welcome.html#thinking-algorithmically", - "href": "modules/Module00-Welcome.html#thinking-algorithmically", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Thinking algorithmically", - "text": "Thinking algorithmically" - }, - { - "objectID": "modules/Module00-Welcome.html#useful-free-resources", - "href": "modules/Module00-Welcome.html#useful-free-resources", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Useful (+ Free) Resources", - "text": "Useful (+ Free) Resources\nWant more?\n\nR for Data Science: http://r4ds.had.co.nz/\n(great general information)\nFundamentals of Data Visualization: https://clauswilke.com/dataviz/\nR for Epidemiology: https://www.r4epi.com/\nThe Epidemiologist R Handbook: https://epirhandbook.com/en/\nR basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html (great general information)\nOpen Case Studies: https://www.opencasestudies.org/\n(resource for specific public health cases with statistical implementation and interpretation)" - }, - { - "objectID": "modules/Module00-Welcome.html#useful-free-resources-1", - "href": "modules/Module00-Welcome.html#useful-free-resources-1", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Useful (+Free) Resources", - "text": "Useful (+Free) Resources\nNeed help?\n\nVarious “Cheat Sheets”: https://github.com/rstudio/cheatsheets/\nR reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf\nR jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf" - }, - { - "objectID": "modules/Module00-Welcome.html#thank-you", - "href": "modules/Module00-Welcome.html#thank-you", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Thank you", - "text": "Thank you\nrevised from Carrie Wright and Ava Hoffman [https://jhudatascience.org/intro_to_r/]" - }, - { - "objectID": "modules/Module00-Welcome.html#installing-r", - "href": "modules/Module00-Welcome.html#installing-r", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Installing R", - "text": "Installing R\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone’s laptops.\n\nInstall the latest version from: http://cran.r-project.org/\nInstall RStudio" - }, - { - "objectID": "modules/Module00-Welcome.html#overall-workshop-objectives", - "href": "modules/Module00-Welcome.html#overall-workshop-objectives", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "Overall Workshop Objectives", - "text": "Overall Workshop Objectives\nBy the end of this workshop, you should be able to\n\nstart a new project, read in data, and conduct basic data manipulation, analysis, and visualization\nknow how to use and find packages/functions that we did not specifically learn in class\ntroubleshoot errors" - }, - { - "objectID": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidervyse", - "href": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidervyse", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "This workshop differs from “Introduction to Tidervyse”", - "text": "This workshop differs from “Introduction to Tidervyse”\nWe will focus this class on using Base R functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\nthe Tidyverse is one set of useful/pretty packages, designed to can make your code more intuitive as compared to the original older Base R. Tidyverse advantages:\n\nconsistent structure - making it easier to learn how to use different packages\nparticularly good for wrangling (manipulating, cleaning, joining) data\n\nmore flexible for visualizing data" - }, - { - "objectID": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidyverse", - "href": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidyverse", - "title": "Welcome to SISMID Workshop: Introduction to R", - "section": "This workshop differs from “Introduction to Tidyverse”", - "text": "This workshop differs from “Introduction to Tidyverse”\nWe will focus this class on using Base R functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\nThe Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more intuitive as compared to the original older Base R. Tidyverse advantages:\n\nconsistent structure - making it easier to learn how to use different packages\nparticularly good for wrangling (manipulating, cleaning, joining) data\n\nmore flexible for visualizing data" - }, - { - "objectID": "modules/Module01-Intro.html#learning-objectives", - "href": "modules/Module01-Intro.html#learning-objectives", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 1, you should be able to…\n\nCreate and save an R script\nDescribe the utility and differences b/w the Console and the Source panes\nModify R Studio panes\nCreate objects\nDescribe the difference b/w character, numeric, list, and matrix objects\nReference objects in the RStudio Environment pane\nUse basic arithmetic operators in R\nUse comments within an R script to create header, sections, and make notes", - "crumbs": [ - "Day 1", - "Module 1: Introduction to RStudio and R Basics" - ] + "text": "Learning Objectives\nAfter module 1, you should be able to…\n\nCreate and save an R script\nDescribe the utility and differences b/w the Console and the Source panes\nModify R Studio panes\nCreate objects\nDescribe the difference b/w character, numeric, list, and matrix objects\nReference objects in the RStudio Environment pane\nUse basic arithmetic operators in R\nUse comments within an R script to create header, sections, and make notes", + "crumbs": [ + "Day 1", + "Module 1: Introduction to RStudio and R Basics" + ] }, { "objectID": "modules/Module01-Intro.html#working-with-r-rstudio", @@ -1244,13 +1110,6 @@ "Module 1: Introduction to RStudio and R Basics" ] }, - { - "objectID": "modules/Module01-Intro.html#commenting-to-explain-code-1", - "href": "modules/Module01-Intro.html#commenting-to-explain-code-1", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Commenting to explain code", - "text": "Commenting to explain code\nI tend to use:\n\nOne hash tag with a space to describe what is happening in the following few lines of code\nOne hastag with no space after a command to list specifics\n\n\n# Practicing my arithmetic\n5+2\n3*5\n9/8\n\n5+2 #5 plus 2" - }, { "objectID": "modules/Module01-Intro.html#object---basic-terms", "href": "modules/Module01-Intro.html#object---basic-terms", @@ -1263,11 +1122,55 @@ ] }, { - "objectID": "modules/Module01-Intro.html#objects", - "href": "modules/Module01-Intro.html#objects", + "objectID": "modules/Module01-Intro.html#objects", + "href": "modules/Module01-Intro.html#objects", + "title": "Module 1: Introduction to RStudio and R Basics", + "section": "Objects", + "text": "Objects\n\nYou can create objects from within the R environment and from files on your computer\nR uses <- to assign values to an object name\nNote: Object names are case-sensitive, i.e. X and x are different\nHere are examples of creating five different objects:\n\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n\nNote, c(), paste0(), and matrix() are functions, which we will talk more about in module 2.", + "crumbs": [ + "Day 1", + "Module 1: Introduction to RStudio and R Basics" + ] + }, + { + "objectID": "modules/Module01-Intro.html#object-names---good-coding", + "href": "modules/Module01-Intro.html#object-names---good-coding", + "title": "Module 1: Introduction to RStudio and R Basics", + "section": "Object names - Good coding", + "text": "Object names - Good coding\n\nIn general, any object name can be typed into R.\nHowever, only some are considered “valid”. If you use a non-valid object name, you will have to enclose it in backticks `like this` for R to recognize it.\nFrom the R documentation:\n\n\nA syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.\n\n\nReserved words: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_Complex_, _NA_Character, ..., ..1, ..2, ..3, and so on.", + "crumbs": [ + "Day 1", + "Module 1: Introduction to RStudio and R Basics" + ] + }, + { + "objectID": "modules/Module01-Intro.html#object-names---good-coding-1", + "href": "modules/Module01-Intro.html#object-names---good-coding-1", + "title": "Module 1: Introduction to RStudio and R Basics", + "section": "Object names - Good coding", + "text": "Object names - Good coding\n\n\n\nValid\nInvalid\n\n\n\n\nmy_object\nmy-data\n\n\nthe.vector\n2data\n\n\nnum12\nfor\n\n\nmeasles_data\n.9data\n\n\n.calc\nxX~mŷ_δätą~Xx", + "crumbs": [ + "Day 1", + "Module 1: Introduction to RStudio and R Basics" + ] + }, + { + "objectID": "modules/Module01-Intro.html#object-assingment---good-coding", + "href": "modules/Module01-Intro.html#object-assingment---good-coding", + "title": "Module 1: Introduction to RStudio and R Basics", + "section": "Object assingment - Good coding", + "text": "Object assingment - Good coding\n= and <- can both be used for assignment, but <- is better coding practice, because sometimes = doesn’t work and we want to distinguish between the logical operator ==. We will talk about this more, later.", + "crumbs": [ + "Day 1", + "Module 1: Introduction to RStudio and R Basics" + ] + }, + { + "objectID": "modules/Module01-Intro.html#mini-exercise-1", + "href": "modules/Module01-Intro.html#mini-exercise-1", "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Objects", - "text": "Objects\n\nYou can create objects from within the R environment and from files on your computer\nR uses <- to assign values to an object name\nNote: Object names are case-sensitive, i.e. X and x are different\nHere are examples of creating five different objects:\n\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n\nNote, c(), paste0(), and matrix() are functions, which we will talk more about in module 2.", + "section": "Mini Exercise", + "text": "Mini Exercise\nTry creating one or two of these objects in your R script\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)", "crumbs": [ "Day 1", "Module 1: Introduction to RStudio and R Basics" @@ -1284,13 +1187,6 @@ "Module 1: Introduction to RStudio and R Basics" ] }, - { - "objectID": "modules/Module01-Intro.html#assignment---good-coding", - "href": "modules/Module01-Intro.html#assignment---good-coding", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Assignment - Good coding", - "text": "Assignment - Good coding\n= and <- can both be used for assignment, but <- is better coding practice, because == is a logical operator. We will talk about this more, later." - }, { "objectID": "modules/Module01-Intro.html#lists", "href": "modules/Module01-Intro.html#lists", @@ -1336,11 +1232,11 @@ ] }, { - "objectID": "modules/Module01-Intro.html#mini-exercise-1", - "href": "modules/Module01-Intro.html#mini-exercise-1", + "objectID": "modules/Module01-Intro.html#mini-exercise-2", + "href": "modules/Module01-Intro.html#mini-exercise-2", "title": "Module 1: Introduction to RStudio and R Basics", "section": "Mini Exercise", - "text": "Mini Exercise\nTry creating one or two of these objects in your R script\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)", + "text": "Mini Exercise\n\nCreate a new number object and name it my.object\nCreate a vector of 4 numbers and name it my.vector using the c() function\nAdd my.object and my.vector together using an arithmetic operator", "crumbs": [ "Day 1", "Module 1: Introduction to RStudio and R Basics" @@ -1357,1456 +1253,2064 @@ "Module 1: Introduction to RStudio and R Basics" ] }, - { - "objectID": "modules/Module01-Intro.html#mini-exercise-2", - "href": "modules/Module01-Intro.html#mini-exercise-2", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Mini Exercise", - "text": "Mini Exercise\n\nCreate a new number object and name it my.object\nCreate a vector of 4 numbers and name it my.vector using the c() function\nAdd my.object and my.vector together using an arithmetic operator", - "crumbs": [ - "Day 1", - "Module 1: Introduction to RStudio and R Basics" - ] - }, { "objectID": "modules/Module02-Functions.html#learning-objectives", "href": "modules/Module02-Functions.html#learning-objectives", "title": "Module 2: Functions", "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 2, you should be able to…\n\nDescribe and execute functions in R\nModify default behavior of functions using arguments in R\nUse R-specific sources of help to get more information about functions and packages\nDifferentiate between Base R functions and functions that come from other packages" + "text": "Learning Objectives\nAfter module 2, you should be able to…\n\nDescribe and execute functions in R\nModify default behavior of functions using arguments in R\nUse R-specific sources of help to get more information about functions and packages\nDifferentiate between Base R functions and functions that come from other packages", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#function---basic-term", "href": "modules/Module02-Functions.html#function---basic-term", "title": "Module 2: Functions", "section": "Function - Basic term", - "text": "Function - Basic term\nFunction - Functions are “self contained” modules of code that accomplish specific tasks. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.\nA function might help you add numbers together, create a plot, or organize your data. In fact, we have already used three functions in the Module 1, including c(), matrix(), list(). Here is another one, sum()\n\nsum(1, 20234)\n\n[1] 20235" + "text": "Function - Basic term\nFunction - Functions are “self contained” modules of code that accomplish specific tasks. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.\nA function might help you add numbers together, create a plot, or organize your data. In fact, we have already used three functions in the Module 1, including c(), matrix(), list(). Here is another one, sum()\n\nsum(1, 20234)\n\n[1] 20235", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#function", "href": "modules/Module02-Functions.html#function", "title": "Module 2: Functions", "section": "Function", - "text": "Function\nThe general usage for a function is the name of the function followed by parentheses (i.e., the function signature). Within the parentheses are arguments.\n\nfunction_name(argument1, argument2, ...)" + "text": "Function\nThe general usage for a function is the name of the function followed by parentheses (i.e., the function signature). Within the parentheses are arguments.\n\nfunction_name(argument1, argument2, ...)", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#arguments---basic-term", "href": "modules/Module02-Functions.html#arguments---basic-term", "title": "Module 2: Functions", "section": "Arguments - Basic term", - "text": "Arguments - Basic term\nArguments are what you pass to the function and can include:\n\nthe physical object on which the function carries out a task (e.g., can be data such as a number 1 or 20234)\n\n\nsum(1, 20234)\n\n[1] 20235\n\n\n\noptions that alter the way the function operates (e.g., such as the base argument in the function log())\n\n\nlog(10, base = 10)\n\n[1] 1\n\nlog(10, base = 2)\n\n[1] 3.321928\n\nlog(10, base=exp(1))\n\n[1] 2.302585" + "text": "Arguments - Basic term\nArguments are what you pass to the function and can include:\n\nthe physical object on which the function carries out a task (e.g., can be data such as a number 1 or 20234)\n\n\nsum(1, 20234)\n\n[1] 20235\n\n\n\noptions that alter the way the function operates (e.g., such as the base argument in the function log())\n\n\nlog(10, base = 10)\n\n[1] 1\n\nlog(10, base = 2)\n\n[1] 3.321928\n\nlog(10, base=exp(1))\n\n[1] 2.302585", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#arguments", "href": "modules/Module02-Functions.html#arguments", "title": "Module 2: Functions", "section": "Arguments", - "text": "Arguments\nMost functions are created with default argument options. The defaults represent standard values that the author of the function specified as being “good enough in standard cases”. This means if you don’t specify an argument when calling the function, it will use a default.\n\nIf you want something specific, simply change the argument yourself with a value of your choice.\nIf an argument is required but you did not specify it and there is no default argument specified when the function was created, you will receive an error." + "text": "Arguments\nMost functions are created with default argument options. The defaults represent standard values that the author of the function specified as being “good enough in standard cases”. This means if you don’t specify an argument when calling the function, it will use a default.\n\nIf you want something specific, simply change the argument yourself with a value of your choice.\nIf an argument is required but you did not specify it and there is no default argument specified when the function was created, you will receive an error.", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#example", "href": "modules/Module02-Functions.html#example", "title": "Module 2: Functions", "section": "Example", - "text": "Example\nWhat is the default in the base argument of the log() function?\n\nlog(10)\n\n[1] 2.302585" + "text": "Example\nWhat is the default in the base argument of the log() function?\n\nlog(10)\n\n[1] 2.302585", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] + }, + { + "objectID": "modules/Module02-Functions.html#sure-that-is-easy-enough-but-how-do-you-know", + "href": "modules/Module02-Functions.html#sure-that-is-easy-enough-but-how-do-you-know", + "title": "Module 2: Functions", + "section": "Sure that is easy enough, but how do you know", + "text": "Sure that is easy enough, but how do you know\n\nthe purpose of a function?\nwhat arguments a function includes?\nhow to specify the arguments?", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#seeking-help-for-using-functions", "href": "modules/Module02-Functions.html#seeking-help-for-using-functions", "title": "Module 2: Functions", "section": "Seeking help for using functions (*)", - "text": "Seeking help for using functions (*)\nThe best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom RStudio Help panel. It provides a description of the function, usage, arguments, details, and examples. Lets look at the help file for the function round()" + "text": "Seeking help for using functions (*)\nThe best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom RStudio Help panel. It provides a description of the function, usage, arguments, details, and examples. Lets look at the help file for the function round()", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#how-to-specify-arguments", "href": "modules/Module02-Functions.html#how-to-specify-arguments", "title": "Module 2: Functions", "section": "How to specify arguments", - "text": "How to specify arguments\n\nArguments are separated with a comma\nYou can specify arguments by either including them in the correct order OR by assigning the argument within the function parentheses.\n\n\n\nlog(10, 2)\n\n[1] 3.321928\n\nlog(base=2, x=10)\n\n[1] 3.321928\n\nlog(x=10, 2)\n\n[1] 3.321928\n\nlog(10, base=2)\n\n[1] 3.321928" + "text": "How to specify arguments\n\nArguments are separated with a comma\nYou can specify arguments by either including them in the correct order OR by assigning the argument within the function parentheses.\n\n\n\nlog(10, 2)\n\n[1] 3.321928\n\nlog(base=2, x=10)\n\n[1] 3.321928\n\nlog(x=10, 2)\n\n[1] 3.321928\n\nlog(10, base=2)\n\n[1] 3.321928", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#package---basic-term", "href": "modules/Module02-Functions.html#package---basic-term", "title": "Module 2: Functions", "section": "Package - Basic term", - "text": "Package - Basic term\nWhen you download R, it has a “base” set of functions, that are associated with a “base” set of packages including: ‘base’, ‘datasets’, ‘graphics’, ‘grDevices’, ‘methods’, ‘stats’ (typically just referred to as Base R).\n\ne.g., the log() function comes from the ‘base’ package\n\nPackage - a package in R is a bundle or “package” of code (and or possibly data) that can be loaded together for easy repeated use or for sharing with others.\nPackages are analogous to software applications like Microsoft Word. After installation, your operating system allows you to use it, just like having Word installed allows you to use it." + "text": "Package - Basic term\nWhen you download R, it has a “base” set of functions, that are associated with a “base” set of packages including: ‘base’, ‘datasets’, ‘graphics’, ‘grDevices’, ‘methods’, ‘stats’ (typically just referred to as Base R).\n\ne.g., the log() function comes from the ‘base’ package\n\nPackage - a package in R is a bundle or “package” of code (and or possibly data) that can be loaded together for easy repeated use or for sharing with others.\nPackages are analogous to software applications like Microsoft Word. After installation, your operating system allows you to use it, just like having Word installed allows you to use it.", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#packages", "href": "modules/Module02-Functions.html#packages", "title": "Module 2: Functions", "section": "Packages", - "text": "Packages\nThe Packages pane in RStudio can help you identify what have been installed (listed), and which one have been attached (check mark).\nLets go look at the Packages pane, find the base package and find the log() function. It automatically loads the help file that we looked at earlier using ?log." + "text": "Packages\nThe Packages pane in RStudio can help you identify what have been installed (listed), and which one have been attached (check mark).\nLets go look at the Packages pane, find the base package and find the log() function. It automatically loads the help file that we looked at earlier using ?log.", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#additional-packages", "href": "modules/Module02-Functions.html#additional-packages", "title": "Module 2: Functions", "section": "Additional Packages", - "text": "Additional Packages\nYou can install additional packages for your use from CRAN or GitHub. These additional packages are written by RStudio or R users/developers (like us)\n\nNot all packages available on CRAN or GitHub are trustworthy\nRStudio (the company) makes a lot of great packages\nWho wrote it? Hadley Wickham is a major authority on R (Employee and Developer at RStudio)\nHow to trust an R package" + "text": "Additional Packages\nYou can install additional packages for your use from CRAN or GitHub. These additional packages are written by RStudio or R users/developers (like us)\n\nNot all packages available on CRAN or GitHub are trustworthy\nRStudio (the company) makes a lot of great packages\nWho wrote it? Hadley Wickham is a major authority on R (Employee and Developer at RStudio)\nHow to trust an R package", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] + }, + { + "objectID": "modules/Module02-Functions.html#installing-and-attaching-packages", + "href": "modules/Module02-Functions.html#installing-and-attaching-packages", + "title": "Module 2: Functions", + "section": "Installing and attaching packages", + "text": "Installing and attaching packages\nTo use the bundle or “package” of code (and or possibly data) from a package, you need to install and also attach the package.\nTo install a package you can\n\ngo to R Studio Menu Bar Tools Menu —> Install Packages in the RStudio header\n\nOR\n\nuse the following code:\n\n\ninstall.packages(\"package_name\")", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { - "objectID": "modules/Module02-Functions.html#installing-and-calling-packages", - "href": "modules/Module02-Functions.html#installing-and-calling-packages", + "objectID": "modules/Module02-Functions.html#installing-and-attaching-packages-1", + "href": "modules/Module02-Functions.html#installing-and-attaching-packages-1", "title": "Module 2: Functions", - "section": "Installing and calling packages", - "text": "Installing and calling packages\nTo use the bundle or “package” of code (and or possibly data) from a package, you need to install and also call the package.\nTo install a package you can\n\ngo to Tools —> Install Packages in the RStudio header\n\nOR\n\nuse the following code:\n\n\ninstall.packages(\"package_name\")" + "section": "Installing and attaching packages", + "text": "Installing and attaching packages\nTo attach (i.e., be able to use the package) you can use the following code:\n\nrequire(package_name) #library(package_name) also works\n\nMore on installing and attaching packages later…", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#mini-exercise", "href": "modules/Module02-Functions.html#mini-exercise", "title": "Module 2: Functions", "section": "Mini exercise", - "text": "Mini exercise\nFind and execute a Base R function that will round the number 0.86424 to two digits." + "text": "Mini exercise\nFind and execute a Base R function that will round the number 0.86424 to two digits.", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#functions-from-module-1", "href": "modules/Module02-Functions.html#functions-from-module-1", "title": "Module 2: Functions", "section": "Functions from Module 1", - "text": "Functions from Module 1\nThe combine function c() concatenate/collects/combines single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types.\n\n?c\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nCombine Values into a Vector or List\n\nDescription:\n\n This is a generic function which combines its arguments.\n\n The default method combines its arguments to form a vector. All\n arguments are coerced to a common type which is the type of the\n returned value, and all attributes except names are removed.\n\nUsage:\n\n ## S3 Generic function\n c(...)\n \n ## Default S3 method:\n c(..., recursive = FALSE, use.names = TRUE)\n \nArguments:\n\n ...: objects to be concatenated. All 'NULL' entries are dropped\n before method dispatch unless at the very beginning of the\n argument list.\n\nrecursive: logical. If 'recursive = TRUE', the function recursively\n descends through lists (and pairlists) combining all their\n elements into a vector.\n\nuse.names: logical indicating if 'names' should be preserved.\n\nDetails:\n\n The output type is determined from the highest type of the\n components in the hierarchy NULL < raw < logical < integer <\n double < complex < character < list < expression. Pairlists are\n treated as lists, whereas non-vector components (such as 'name's /\n 'symbol's and 'call's) are treated as one-element 'list's which\n cannot be unlisted even if 'recursive = TRUE'.\n\n There is a 'c.factor' method which combines factors into a factor.\n\n 'c' is sometimes used for its side effect of removing attributes\n except names, for example to turn an 'array' into a vector.\n 'as.vector' is a more intuitive way to do this, but also drops\n names. Note that methods other than the default are not required\n to do this (and they will almost certainly preserve a class\n attribute).\n\n This is a primitive function.\n\nValue:\n\n 'NULL' or an expression or a vector of an appropriate mode. (With\n no arguments the value is 'NULL'.)\n\nS4 methods:\n\n This function is S4 generic, but with argument list '(x, ...)'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'unlist' and 'as.vector' to produce attribute-free vectors.\n\nExamples:\n\n c(1,7:9)\n c(1:5, 10.5, \"next\")\n \n ## uses with a single argument to drop attributes\n x <- 1:4\n names(x) <- letters[1:4]\n x\n c(x) # has names\n as.vector(x) # no names\n dim(x) <- c(2,2)\n x\n c(x)\n as.vector(x)\n \n ## append to a list:\n ll <- list(A = 1, c = \"C\")\n ## do *not* use\n c(ll, d = 1:3) # which is == c(ll, as.list(c(d = 1:3)))\n ## but rather\n c(ll, d = list(1:3)) # c() combining two lists\n \n c(list(A = c(B = 1)), recursive = TRUE)\n \n c(options(), recursive = TRUE)\n c(list(A = c(B = 1, C = 2), B = c(E = 7)), recursive = TRUE)" + "text": "Functions from Module 1\nThe combine function c() concatenate/collects/combines single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types.\n\n?c\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nCombine Values into a Vector or List\n\nDescription:\n\n This is a generic function which combines its arguments.\n\n The default method combines its arguments to form a vector. All\n arguments are coerced to a common type which is the type of the\n returned value, and all attributes except names are removed.\n\nUsage:\n\n ## S3 Generic function\n c(...)\n \n ## Default S3 method:\n c(..., recursive = FALSE, use.names = TRUE)\n \nArguments:\n\n ...: objects to be concatenated. All 'NULL' entries are dropped\n before method dispatch unless at the very beginning of the\n argument list.\n\nrecursive: logical. If 'recursive = TRUE', the function recursively\n descends through lists (and pairlists) combining all their\n elements into a vector.\n\nuse.names: logical indicating if 'names' should be preserved.\n\nDetails:\n\n The output type is determined from the highest type of the\n components in the hierarchy NULL < raw < logical < integer <\n double < complex < character < list < expression. Pairlists are\n treated as lists, whereas non-vector components (such as 'name's /\n 'symbol's and 'call's) are treated as one-element 'list's which\n cannot be unlisted even if 'recursive = TRUE'.\n\n There is a 'c.factor' method which combines factors into a factor.\n\n 'c' is sometimes used for its side effect of removing attributes\n except names, for example to turn an 'array' into a vector.\n 'as.vector' is a more intuitive way to do this, but also drops\n names. Note that methods other than the default are not required\n to do this (and they will almost certainly preserve a class\n attribute).\n\n This is a primitive function.\n\nValue:\n\n 'NULL' or an expression or a vector of an appropriate mode. (With\n no arguments the value is 'NULL'.)\n\nS4 methods:\n\n This function is S4 generic, but with argument list '(x, ...)'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'unlist' and 'as.vector' to produce attribute-free vectors.\n\nExamples:\n\n c(1,7:9)\n c(1:5, 10.5, \"next\")\n \n ## uses with a single argument to drop attributes\n x <- 1:4\n names(x) <- letters[1:4]\n x\n c(x) # has names\n as.vector(x) # no names\n dim(x) <- c(2,2)\n x\n c(x)\n as.vector(x)\n \n ## append to a list:\n ll <- list(A = 1, c = \"C\")\n ## do *not* use\n c(ll, d = 1:3) # which is == c(ll, as.list(c(d = 1:3)))\n ## but rather\n c(ll, d = list(1:3)) # c() combining two lists\n \n c(list(A = c(B = 1)), recursive = TRUE)\n \n c(options(), recursive = TRUE)\n c(list(A = c(B = 1, C = 2), B = c(E = 7)), recursive = TRUE)", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#functions-from-module-1-1", "href": "modules/Module02-Functions.html#functions-from-module-1-1", "title": "Module 2: Functions", "section": "Functions from Module 1", - "text": "Functions from Module 1\nThe paste0() function concatenate/combines vectors after converting to character.\n\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nvector.object2\n?paste0\n\n\n\nConcatenate Strings\n\nDescription:\n\n Concatenate vectors after converting to character.\n\nUsage:\n\n paste (..., sep = \" \", collapse = NULL, recycle0 = FALSE)\n paste0(..., collapse = NULL, recycle0 = FALSE)\n \nArguments:\n\n ...: one or more R objects, to be converted to character vectors.\n\n sep: a character string to separate the terms. Not\n 'NA_character_'.\n\ncollapse: an optional character string to separate the results. Not\n 'NA_character_'.\n\nrecycle0: 'logical' indicating if zero-length character arguments\n should lead to the zero-length 'character(0)' after the\n 'sep'-phase (which turns into '\"\"' in the 'collapse'-phase,\n i.e., when 'collapse' is not 'NULL').\n\nDetails:\n\n 'paste' converts its arguments (_via_ 'as.character') to character\n strings, and concatenates them (separating them by the string\n given by 'sep'). If the arguments are vectors, they are\n concatenated term-by-term to give a character vector result.\n Vector arguments are recycled as needed, with zero-length\n arguments being recycled to '\"\"' only if 'recycle0' is not true\n _or_ 'collapse' is not 'NULL'.\n\n Note that 'paste()' coerces 'NA_character_', the character missing\n value, to '\"NA\"' which may seem undesirable, e.g., when pasting\n two character vectors, or very desirable, e.g. in 'paste(\"the\n value of p is \", p)'.\n\n 'paste0(..., collapse)' is equivalent to 'paste(..., sep = \"\",\n collapse)', slightly more efficiently.\n\n If a value is specified for 'collapse', the values in the result\n are then concatenated into a single string, with the elements\n being separated by the value of 'collapse'.\n\nValue:\n\n A character vector of the concatenated values. This will be of\n length zero if all the objects are, unless 'collapse' is non-NULL,\n in which case it is '\"\"' (a single empty string).\n\n If any input into an element of the result is in UTF-8 (and none\n are declared with encoding '\"bytes\"', see 'Encoding'), that\n element will be in UTF-8, otherwise in the current encoding in\n which case the encoding of the element is declared if the current\n locale is either Latin-1 or UTF-8, at least one of the\n corresponding inputs (including separators) had a declared\n encoding and all inputs were either ASCII or declared.\n\n If an input into an element is declared with encoding '\"bytes\"',\n no translation will be done of any of the elements and the\n resulting element will have encoding '\"bytes\"'. If 'collapse' is\n non-NULL, this applies also to the second, collapsing, phase, but\n some translation may have been done in pasting object together in\n the first phase.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'toString' typically calls 'paste(*, collapse=\", \")'. String\n manipulation with 'as.character', 'substr', 'nchar', 'strsplit';\n further, 'cat' which concatenates and writes to a file, and\n 'sprintf' for C like string construction.\n\n 'plotmath' for the use of 'paste' in plot annotation.\n\nExamples:\n\n ## When passing a single vector, paste0 and paste work like as.character.\n paste0(1:12)\n paste(1:12) # same\n as.character(1:12) # same\n \n ## If you pass several vectors to paste0, they are concatenated in a\n ## vectorized way.\n (nth <- paste0(1:12, c(\"st\", \"nd\", \"rd\", rep(\"th\", 9))))\n \n ## paste works the same, but separates each input with a space.\n ## Notice that the recycling rules make every input as long as the longest input.\n paste(month.abb, \"is the\", nth, \"month of the year.\")\n paste(month.abb, letters)\n \n ## You can change the separator by passing a sep argument\n ## which can be multiple characters.\n paste(month.abb, \"is the\", nth, \"month of the year.\", sep = \"_*_\")\n \n ## To collapse the output into a single string, pass a collapse argument.\n paste0(nth, collapse = \", \")\n \n ## For inputs of length 1, use the sep argument rather than collapse\n paste(\"1st\", \"2nd\", \"3rd\", collapse = \", \") # probably not what you wanted\n paste(\"1st\", \"2nd\", \"3rd\", sep = \", \")\n \n ## You can combine the sep and collapse arguments together.\n paste(month.abb, nth, sep = \": \", collapse = \"; \")\n \n ## Using paste() in combination with strwrap() can be useful\n ## for dealing with long strings.\n (title <- paste(strwrap(\n \"Stopping distance of cars (ft) vs. speed (mph) from Ezekiel (1930)\",\n width = 30), collapse = \"\\n\"))\n plot(dist ~ speed, cars, main = title)\n \n ## 'recycle0 = TRUE' allows more vectorized behaviour, i.e. zero-length recycling :\n valid <- FALSE\n val <- pi\n paste(\"The value is\", val[valid], \"-- not so good!\")\n paste(\"The value is\", val[valid], \"-- good: empty!\", recycle0=TRUE) # -> character(0)\n ## When 'collapse = <string>', the result is a length-1 string :\n paste(\"foo\", {}, \"bar\", collapse=\"|\") # |--> \"foo bar\"\n paste(\"foo\", {}, \"bar\", collapse=\"|\", recycle0 = TRUE) # |--> \"\"\n ## all empty args\n paste( collapse=\"|\") # |--> \"\" as do all these:\n paste( collapse=\"|\", recycle0 = TRUE)\n paste({}, collapse=\"|\")\n paste({}, collapse=\"|\", recycle0 = TRUE)" + "text": "Functions from Module 1\nThe paste0() function concatenate/combines vectors after converting to character.\n\nvector.object2 <- paste0(c(\"b\", \"t\", \"u\"), c(8,4,2))\nvector.object2\n?paste0\n\n\n\nConcatenate Strings\n\nDescription:\n\n Concatenate vectors after converting to character.\n\nUsage:\n\n paste (..., sep = \" \", collapse = NULL, recycle0 = FALSE)\n paste0(..., collapse = NULL, recycle0 = FALSE)\n \nArguments:\n\n ...: one or more R objects, to be converted to character vectors.\n\n sep: a character string to separate the terms. Not\n 'NA_character_'.\n\ncollapse: an optional character string to separate the results. Not\n 'NA_character_'.\n\nrecycle0: 'logical' indicating if zero-length character arguments\n should lead to the zero-length 'character(0)' after the\n 'sep'-phase (which turns into '\"\"' in the 'collapse'-phase,\n i.e., when 'collapse' is not 'NULL').\n\nDetails:\n\n 'paste' converts its arguments (_via_ 'as.character') to character\n strings, and concatenates them (separating them by the string\n given by 'sep'). If the arguments are vectors, they are\n concatenated term-by-term to give a character vector result.\n Vector arguments are recycled as needed, with zero-length\n arguments being recycled to '\"\"' only if 'recycle0' is not true\n _or_ 'collapse' is not 'NULL'.\n\n Note that 'paste()' coerces 'NA_character_', the character missing\n value, to '\"NA\"' which may seem undesirable, e.g., when pasting\n two character vectors, or very desirable, e.g. in 'paste(\"the\n value of p is \", p)'.\n\n 'paste0(..., collapse)' is equivalent to 'paste(..., sep = \"\",\n collapse)', slightly more efficiently.\n\n If a value is specified for 'collapse', the values in the result\n are then concatenated into a single string, with the elements\n being separated by the value of 'collapse'.\n\nValue:\n\n A character vector of the concatenated values. This will be of\n length zero if all the objects are, unless 'collapse' is non-NULL,\n in which case it is '\"\"' (a single empty string).\n\n If any input into an element of the result is in UTF-8 (and none\n are declared with encoding '\"bytes\"', see 'Encoding'), that\n element will be in UTF-8, otherwise in the current encoding in\n which case the encoding of the element is declared if the current\n locale is either Latin-1 or UTF-8, at least one of the\n corresponding inputs (including separators) had a declared\n encoding and all inputs were either ASCII or declared.\n\n If an input into an element is declared with encoding '\"bytes\"',\n no translation will be done of any of the elements and the\n resulting element will have encoding '\"bytes\"'. If 'collapse' is\n non-NULL, this applies also to the second, collapsing, phase, but\n some translation may have been done in pasting object together in\n the first phase.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'toString' typically calls 'paste(*, collapse=\", \")'. String\n manipulation with 'as.character', 'substr', 'nchar', 'strsplit';\n further, 'cat' which concatenates and writes to a file, and\n 'sprintf' for C like string construction.\n\n 'plotmath' for the use of 'paste' in plot annotation.\n\nExamples:\n\n ## When passing a single vector, paste0 and paste work like as.character.\n paste0(1:12)\n paste(1:12) # same\n as.character(1:12) # same\n \n ## If you pass several vectors to paste0, they are concatenated in a\n ## vectorized way.\n (nth <- paste0(1:12, c(\"st\", \"nd\", \"rd\", rep(\"th\", 9))))\n \n ## paste works the same, but separates each input with a space.\n ## Notice that the recycling rules make every input as long as the longest input.\n paste(month.abb, \"is the\", nth, \"month of the year.\")\n paste(month.abb, letters)\n \n ## You can change the separator by passing a sep argument\n ## which can be multiple characters.\n paste(month.abb, \"is the\", nth, \"month of the year.\", sep = \"_*_\")\n \n ## To collapse the output into a single string, pass a collapse argument.\n paste0(nth, collapse = \", \")\n \n ## For inputs of length 1, use the sep argument rather than collapse\n paste(\"1st\", \"2nd\", \"3rd\", collapse = \", \") # probably not what you wanted\n paste(\"1st\", \"2nd\", \"3rd\", sep = \", \")\n \n ## You can combine the sep and collapse arguments together.\n paste(month.abb, nth, sep = \": \", collapse = \"; \")\n \n ## Using paste() in combination with strwrap() can be useful\n ## for dealing with long strings.\n (title <- paste(strwrap(\n \"Stopping distance of cars (ft) vs. speed (mph) from Ezekiel (1930)\",\n width = 30), collapse = \"\\n\"))\n plot(dist ~ speed, cars, main = title)\n \n ## 'recycle0 = TRUE' allows more vectorized behaviour, i.e. zero-length recycling :\n valid <- FALSE\n val <- pi\n paste(\"The value is\", val[valid], \"-- not so good!\")\n paste(\"The value is\", val[valid], \"-- good: empty!\", recycle0=TRUE) # -> character(0)\n ## When 'collapse = <string>', the result is a length-1 string :\n paste(\"foo\", {}, \"bar\", collapse=\"|\") # |--> \"foo bar\"\n paste(\"foo\", {}, \"bar\", collapse=\"|\", recycle0 = TRUE) # |--> \"\"\n ## all empty args\n paste( collapse=\"|\") # |--> \"\" as do all these:\n paste( collapse=\"|\", recycle0 = TRUE)\n paste({}, collapse=\"|\")\n paste({}, collapse=\"|\", recycle0 = TRUE)", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] + }, + { + "objectID": "modules/Module02-Functions.html#functions-from-module-1-2", + "href": "modules/Module02-Functions.html#functions-from-module-1-2", + "title": "Module 2: Functions", + "section": "Functions from Module 1", + "text": "Functions from Module 1\nThe matrix() function creates a matrix from the given set of values.\n\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\nmatrix.object\n?matrix\n\n\n\nMatrices\n\nDescription:\n\n 'matrix' creates a matrix from the given set of values.\n\n 'as.matrix' attempts to turn its argument into a matrix.\n\n 'is.matrix' tests if its argument is a (strict) matrix.\n\nUsage:\n\n matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,\n dimnames = NULL)\n \n as.matrix(x, ...)\n ## S3 method for class 'data.frame'\n as.matrix(x, rownames.force = NA, ...)\n \n is.matrix(x)\n \nArguments:\n\n data: an optional data vector (including a list or 'expression'\n vector). Non-atomic classed R objects are coerced by\n 'as.vector' and all attributes discarded.\n\n nrow: the desired number of rows.\n\n ncol: the desired number of columns.\n\n byrow: logical. If 'FALSE' (the default) the matrix is filled by\n columns, otherwise the matrix is filled by rows.\n\ndimnames: A 'dimnames' attribute for the matrix: 'NULL' or a 'list' of\n length 2 giving the row and column names respectively. An\n empty list is treated as 'NULL', and a list of length one as\n row names. The list can be named, and the list names will be\n used as names for the dimensions.\n\n x: an R object.\n\n ...: additional arguments to be passed to or from methods.\n\nrownames.force: logical indicating if the resulting matrix should have\n character (rather than 'NULL') 'rownames'. The default,\n 'NA', uses 'NULL' rownames if the data frame has 'automatic'\n row.names or for a zero-row data frame.\n\nDetails:\n\n If one of 'nrow' or 'ncol' is not given, an attempt is made to\n infer it from the length of 'data' and the other parameter. If\n neither is given, a one-column matrix is returned.\n\n If there are too few elements in 'data' to fill the matrix, then\n the elements in 'data' are recycled. If 'data' has length zero,\n 'NA' of an appropriate type is used for atomic vectors ('0' for\n raw vectors) and 'NULL' for lists.\n\n 'is.matrix' returns 'TRUE' if 'x' is a vector and has a '\"dim\"'\n attribute of length 2 and 'FALSE' otherwise. Note that a\n 'data.frame' is *not* a matrix by this test. The function is\n generic: you can write methods to handle specific classes of\n objects, see InternalMethods.\n\n 'as.matrix' is a generic function. The method for data frames\n will return a character matrix if there is only atomic columns and\n any non-(numeric/logical/complex) column, applying 'as.vector' to\n factors and 'format' to other non-character columns. Otherwise,\n the usual coercion hierarchy (logical < integer < double <\n complex) will be used, e.g., all-logical data frames will be\n coerced to a logical matrix, mixed logical-integer will give a\n integer matrix, etc.\n\n The default method for 'as.matrix' calls 'as.vector(x)', and hence\n e.g. coerces factors to character vectors.\n\n When coercing a vector, it produces a one-column matrix, and\n promotes the names (if any) of the vector to the rownames of the\n matrix.\n\n 'is.matrix' is a primitive function.\n\n The 'print' method for a matrix gives a rectangular layout with\n dimnames or indices. For a list matrix, the entries of length not\n one are printed in the form 'integer,7' indicating the type and\n length.\n\nNote:\n\n If you just want to convert a vector to a matrix, something like\n\n dim(x) <- c(nx, ny)\n dimnames(x) <- list(row_names, col_names)\n \n will avoid duplicating 'x' _and_ preserve 'class(x)' which may be\n useful, e.g., for 'Date' objects.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'data.matrix', which attempts to convert to a numeric matrix.\n\n A matrix is the special case of a two-dimensional 'array'.\n 'inherits(m, \"array\")' is true for a 'matrix' 'm'.\n\nExamples:\n\n is.matrix(as.matrix(1:10))\n !is.matrix(warpbreaks) # data.frame, NOT matrix!\n warpbreaks[1:10,]\n as.matrix(warpbreaks[1:10,]) # using as.matrix.data.frame(.) method\n \n ## Example of setting row and column names\n mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,\n dimnames = list(c(\"row1\", \"row2\"),\n c(\"C.1\", \"C.2\", \"C.3\")))\n mdat", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#summary", "href": "modules/Module02-Functions.html#summary", "title": "Module 2: Functions", "section": "Summary", - "text": "Summary\n\nFunctions are “self contained” modules of code that accomplish specific tasks.\nArguments are what you pass to functions (e.g., objects on which you carry out the task or options for how to carry out the task)\nArguments may include defaults that the author of the function specified as being “good enough in standard cases”, but that can be changed.\nAn R Package is a bundle or “package” of code (and or possibly data) that can be used by installing it once and attaching it (using require()`) each time R/Rstudio is opened\nThe Help pane in RStudio is useful for to get more information about functions and packages" + "text": "Summary\n\nFunctions are “self contained” modules of code that accomplish specific tasks.\nArguments are what you pass to functions (e.g., objects on which you carry out the task or options for how to carry out the task)\nArguments may include defaults that the author of the function specified as being “good enough in standard cases”, but that can be changed.\nAn R Package is a bundle or “package” of code (and or possibly data) that can be used by installing it once and attaching it (using require()`) each time R/Rstudio is opened\nThe Help pane in RStudio is useful for to get more information about functions and packages", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { "objectID": "modules/Module02-Functions.html#acknowledgements", "href": "modules/Module02-Functions.html#acknowledgements", "title": "Module 2: Functions", "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R - ARCHIVED” from Harvard Chan Bioinformatics Core (HBC)" - }, - { - "objectID": "modules/Module02-Functions.html#sure-that-is-easy-enough-but-how-do-you-know", - "href": "modules/Module02-Functions.html#sure-that-is-easy-enough-but-how-do-you-know", - "title": "Module 2: Functions", - "section": "Sure that is easy enough, but how do you know", - "text": "Sure that is easy enough, but how do you know\n\nthe purpose of a function?\nwhat arguments a function includes?\nhow to specify the arguments?" - }, - { - "objectID": "modules/Module02-Functions.html#installing-and-calling-packages-1", - "href": "modules/Module02-Functions.html#installing-and-calling-packages-1", - "title": "Module 2: Functions", - "section": "Installing and calling packages", - "text": "Installing and calling packages\nTo call (i.e., be able to use the package) you can use the following code:\n\nlibrary(package_name)\n\nMore on installing and calling packages later…" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#learning-objectives", - "href": "modules/Module03-WorkingDirectories.html#learning-objectives", - "title": "Module 3: Working Directories", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 3, you should be able to…\n\nUnderstand your own systems’ file structure and the purpose of the working directory\nDetermine the working directory\nChange the working directory" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#file-structure", - "href": "modules/Module03-WorkingDirectories.html#file-structure", - "title": "Module 3: Working Directories", - "section": "File Structure", - "text": "File Structure\nxxzane slide(s)" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#working-directory-basic-term", - "href": "modules/Module03-WorkingDirectories.html#working-directory-basic-term", - "title": "Module 3: Working Directories", - "section": "Working Directory – Basic term", - "text": "Working Directory – Basic term\n\nR “looks” for files on your computer relative to the “working” directory\nFor example, if you want to load data into R or save a figure, you will need to tell R where to look for or store the file\nMany people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#getting-and-setting-the-working-directory-using-code", - "href": "modules/Module03-WorkingDirectories.html#getting-and-setting-the-working-directory-using-code", - "title": "Module 3: Working Directories", - "section": "Getting and setting the working directory using code", - "text": "Getting and setting the working directory using code\n\n## get the working directory\ngetwd()\nsetwd(\"~/\")" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#setting-a-working-directory", - "href": "modules/Module03-WorkingDirectories.html#setting-a-working-directory", - "title": "Module 3: Working Directories", - "section": "Setting a working directory", - "text": "Setting a working directory\n\nSetting the directory can sometimes (almost always when new to R) be finicky\n\nWindows: Default directory structure involves single backslashes (“\\”), but R interprets these as”escape” characters. So you must replace the backslash with forward slashes (“/”) or two backslashes (“\\\\”)\nMac/Linux: Default is forward slashes, so you are okay\n\nTypical directory structure syntax applies\n\n“..” - goes up one level\n“./” - is the current directory\n“~” - is your “home” directory" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#absoluate-vs.-relative-paths", - "href": "modules/Module03-WorkingDirectories.html#absoluate-vs.-relative-paths", - "title": "Module 3: Working Directories", - "section": "Absoluate vs. relative paths", - "text": "Absoluate vs. relative paths\nFrom Wiki\n\nAn absolute or full path points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. Absolute path is specific to your system alone. This means if I try your code, and you use absolute paths, it won’t work unless we have the exact same folder structure where R is looking (bad).\nBy contrast, a relative path starts from some given working directory, avoiding the need to provide the full absolute path." - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#relative-path", - "href": "modules/Module03-WorkingDirectories.html#relative-path", - "title": "Module 3: Working Directories", - "section": "Relative path", - "text": "Relative path\nYou want to set you code up based on relative paths. This allows sharing of code, and also, allows you to modify your own file structure (above the working directory) without breaking your own code." - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#setting-the-working-directory-using-your-cursor", - "href": "modules/Module03-WorkingDirectories.html#setting-the-working-directory-using-your-cursor", - "title": "Module 3: Working Directories", - "section": "Setting the working directory using your cursor", - "text": "Setting the working directory using your cursor\nRemember above “Many people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in.” To do so, go to Session –> Set Working Directory –> To Source File Location\nRStudio will show the code in the Console for the action you took with your cursor. This is a good way to learn about your file system how to set a correct working directory!\n\nsetwd(\"~/Dropbox/Git/SISMID-2024\")" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#setting-the-working-directory", - "href": "modules/Module03-WorkingDirectories.html#setting-the-working-directory", - "title": "Module 3: Working Directories", - "section": "Setting the Working Directory", - "text": "Setting the Working Directory\nIf you have not yet saved a “source” file, it will set working directory to the default location.Find the Tool Menu in the Menu Bar -> Global Opsions -> General for default location.\nTo change the working directory to another location, find Session Menu in the Menu Bar –> Set Working Directory –> Choose Directory`\nAgain, RStudio will show the code in the Console for the action you took with your cursor." - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#summary", - "href": "modules/Module03-WorkingDirectories.html#summary", - "title": "Module 3: Working Directories", - "section": "Summary", - "text": "Summary\n\nR “looks” for files on your computer relative to the “working” directory\nAbsolute path points to the same location in a file system - it is specific to your system and your system alone\nRelative path points is based on the current working directory\nTwo functions, setwd() and getwd() are useful for identifying and manipulating the working directory." - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#acknowledgements", - "href": "modules/Module03-WorkingDirectories.html#acknowledgements", - "title": "Module 3: Working Directories", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University" - }, - { - "objectID": "modules/Module03-WorkingDirectories.html#absolute-vs.-relative-paths", - "href": "modules/Module03-WorkingDirectories.html#absolute-vs.-relative-paths", - "title": "Module 3: Working Directories", - "section": "Absolute vs. relative paths", - "text": "Absolute vs. relative paths\nFrom Wiki\n\nAn absolute or full path points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. Absolute path is specific to your system alone. This means if I try your code, and you use absolute paths, it won’t work unless we have the exact same folder structure where R is looking (bad).\nBy contrast, a relative path starts from some given working directory, avoiding the need to provide the full absolute path." - }, - { - "objectID": "modules/Module04-RProject.html#learning-objectives", - "href": "modules/Module04-RProject.html#learning-objectives", - "title": "Module 4: R Project", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 4, you should be able to…\n\nCreate an R Project\nCheck you are in the desired R Project\nReference the Files pane in RStudio\nDescribe “good” R Project organization" - }, - { - "objectID": "modules/Module04-RProject.html#rstudio-project", - "href": "modules/Module04-RProject.html#rstudio-project", - "title": "Module 4: R Project", - "section": "RStudio Project", - "text": "RStudio Project\nRStudio “Project” is one highly recommended strategy to build organized and reproducible code in R.\n\nHelps with working directories by easily incorporating relative paths only.\nHelps you organize your code, data, and output.\nAllows you to open multiple RStudio sessions at once!" - }, - { - "objectID": "modules/Module04-RProject.html#rstudio-project-creation", - "href": "modules/Module04-RProject.html#rstudio-project-creation", - "title": "Module 4: R Project", - "section": "RStudio Project Creation", - "text": "RStudio Project Creation\nLet’s create a new RStudio Project.\nFind the File Menu in the Menu Bar –> New Project –> New Directory –> New Project\nName your Project “IntroToR_RProject”" + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R - ARCHIVED” from Harvard Chan Bioinformatics Core (HBC)", + "crumbs": [ + "Day 1", + "Module 2: Functions" + ] }, { - "objectID": "modules/Module04-RProject.html#rstudio-project-organization", - "href": "modules/Module04-RProject.html#rstudio-project-organization", - "title": "Module 4: R Project", - "section": "RStudio Project Organization", - "text": "RStudio Project Organization\nThis is my personal preference for organizing an R Project. But, for this workshop it will be mandatory as it will help us help you. A critical component of conducting any data analysis is being able to reproduce it! Organizing your code, data, output, and figures is a necessary (although not sufficient) condition for reproducibility.\nCreate 4 sub-directories with the following names within your “SISMID_IntroToR_RProject” folder:\n\ncode\ndata\noutput\nfigures\n\nWe will be working from this directory for the remainder of the Workshop. Take a moment to move any R scripts you have already created to the ‘code’ sub-directory." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#learning-goals", + "href": "modules/Module095-DataAnalysisWalkthrough.html#learning-goals", + "title": "Data Analysis Walkthrough", + "section": "Learning goals", + "text": "Learning goals\n\nUse logical operators, subsetting functions, and math calculations in R\nTranslate human-understandable problem descriptions into instructions that R can understand.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module04-RProject.html#some-things-to-notice-in-an-r-project", - "href": "modules/Module04-RProject.html#some-things-to-notice-in-an-r-project", - "title": "Module 4: R Project", - "section": "Some things to notice in an R Project", - "text": "Some things to notice in an R Project\n\nThe name of the R Project will be shown at the top of the RStudio Window\nIf you check the working directory using getwd() you will find the working directory is set to the location where the R Project was saved.\nThe Files pane in RStudio is also set to the location where the R Project was saved, making it easy to navigate to sub-directories directly from RStudio." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#instructions", + "href": "modules/Module095-DataAnalysisWalkthrough.html#instructions", + "title": "Data Analysis Walkthrough", + "section": "Instructions", + "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module04-RProject.html#r-project---common-issues", - "href": "modules/Module04-RProject.html#r-project---common-issues", - "title": "Module 4: R Project", - "section": "R Project - Common issues", - "text": "R Project - Common issues\nIf you simply open RStudio, it will not automatically open your R Project. As a result, when you say run a function to import data using the relative path based on your working directory, it won’t be able to find the data.\nTo open a previously created R Project, you need to open the R Project (i.e., double click on SISMID_IntroToR_RProject.RProj)" + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#instructions-1", + "href": "modules/Module095-DataAnalysisWalkthrough.html#instructions-1", + "title": "Data Analysis Walkthrough", + "section": "Instructions", + "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.\nThe str() of your dataset should look like this.\n\n\n\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module04-RProject.html#summary", - "href": "modules/Module04-RProject.html#summary", - "title": "Module 4: R Project", - "section": "Summary", - "text": "Summary\n\nR Projects are really helpful for lots of reasons, including to improve the reproducibility of your work\nConsistently set up your R Project’s sub-directories so that you can easily navigate the project\nIf you get an error that a file can’t be found, make sure you correctly opened the R Project by looking for the Project name at the top of the RStudio application window." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", + "title": "Data Analysis Walkthrough", + "section": "Q1: Was the overall prevalence higher in urban or rural areas?", + "text": "Q1: Was the overall prevalence higher in urban or rural areas?\n\n\nHow do we calculate the prevalence from the data?\nHow do we calculate the prevalence separately for urban and rural areas?\nHow do we determine which prevalence is higher and if the difference is meaningful?", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module04-RProject.html#mini-exercise", - "href": "modules/Module04-RProject.html#mini-exercise", - "title": "Module 4: R Project", - "section": "Mini Exercise", - "text": "Mini Exercise\n\nClose R Studio\nReopen your R Project\nCheck that you are actually in the R Project\nCreate a new R script and save it in your ‘code’ subdirectory\nCreate a vector of numbers\nCreate a vector a character values\nAdd comment(s) to your R script to explain your code." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-from-the-data", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-from-the-data", + "title": "Data Analysis Walkthrough", + "section": "Q1: How do we calculate the prevalence from the data?", + "text": "Q1: How do we calculate the prevalence from the data?\n\n\nThe variable DP_infection in our dataset is binary / dichotomous.\nThe prevalence is the number or percent of people who had the disease over some duration.\nThe average of a binary variable gives the prevalence!\n\n\n\n\nmean(diph$DP_infection)\n\n[1] 0.8", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, - { - "objectID": "modules/Module04-RProject.html#acknowledgements", - "href": "modules/Module04-RProject.html#acknowledgements", - "title": "Module 4: R Project", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture." + { + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", + "title": "Data Analysis Walkthrough", + "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", + "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n\n[1] 0.8235294\n\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n\n[1] 0.778626\n\n\n\n\n\nThere are many ways you could write this code! You can use subset() or you can write the indices many ways.\nUsing tbl_df objects from haven uses different [[ rules than a base R data frame.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#learning-objectives", - "href": "modules/Module05-DataImportExport.html#learning-objectives", - "title": "Module 5: Data Import and Export", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 5, you should be able to…\n\nUse Base R functions to load data\nInstall and attach external R Packages to extend R’s functionality\nLoad any type of data into R\nFind loaded data in the Environment pane of RStudio\nReading and writing R .Rds and .Rda/.RData files" + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", + "title": "Data Analysis Walkthrough", + "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", + "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\nOne easy way is to use the aggregate() function.\n\n\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-read-data", - "href": "modules/Module05-DataImportExport.html#import-read-data", - "title": "Module 5: Data Import and Export", - "section": "Import (read) Data", - "text": "Import (read) Data\n\nImporting or ‘Reading in’ data are the first step of any real project / data analysis\nR can read almost any file format, especially with external, non-Base R, packages\nWe are going to focus on simple delimited files first.\n\ncomma separated (e.g. ‘.csv’)\ntab delimited (e.g. ‘.txt’)\n\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column delimiters (IBM.com definition)" + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", + "title": "Data Analysis Walkthrough", + "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?", + "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n\nWe probably need to include a confidence interval in our calculation.\nThis is actually not so easy without more advanced tools that we will learn in upcoming modules.\nRight now the best options are to do it by hand or google a function.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#mini-exercise", - "href": "modules/Module05-DataImportExport.html#mini-exercise", - "title": "Module 5: Data Import and Export", - "section": "Mini exercise", - "text": "Mini exercise\n\nDownload Module 5 data from the website and save the data to your data subdirectory – specifically SISMID_IntroToR_RProject/data\nOpen the ‘.csv’ and ‘.txt’ data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)\nOpen the ‘.xlsx’ data file in excel and familiarize yourself with the data - if you use a Mac do not open in Numbers, it can corrupt the file - if you do not have excel, you can upload it to Google Sheets\nDetermine the delimiter of the two ‘.txt’ files" + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-by-hand", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-by-hand", + "title": "Data Analysis Walkthrough", + "section": "Q1: By hand", + "text": "Q1: By hand\n\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n \"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n round(p_urban - 1.96 * se_urban, 2), \", \",\n round(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n \"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n round(p_rural - 1.96 * se_rural, 2), \", \",\n round(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-delimited-data", - "href": "modules/Module05-DataImportExport.html#import-delimited-data", - "title": "Module 5: Data Import and Export", - "section": "Import delimited data", - "text": "Import delimited data\nWithin the Base R ‘util’ package we can find a handful of useful functions including read.csv() and read.delim() to importing data.\n\n?read.csv\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input, so 'na.strings' values may need\n their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file (not a connection) so the character data can be\n re-encoded. See the 'Encoding' section of the help for\n 'file', the 'R Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should nuls be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded nuls are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n <https://www.rfc-editor.org/rfc/rfc4180> for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")" + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-by-hand-1", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-by-hand-1", + "title": "Data Analysis Walkthrough", + "section": "Q1: By hand", + "text": "Q1: By hand\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-.csv-files", - "href": "modules/Module05-DataImportExport.html#import-.csv-files", - "title": "Module 5: Data Import and Export", - "section": "Import .csv files", - "text": "Import .csv files\nFunction signature reminder\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n\nNote #1, I assigned the data frame to an object called df. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object.\nNote #2, If the data is imported correct, you can expect to see the df object ready to be used." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#q1-googling-a-package", + "href": "modules/Module095-DataAnalysisWalkthrough.html#q1-googling-a-package", + "title": "Data Analysis Walkthrough", + "section": "Q1: Googling a package", + "text": "Q1: Googling a package\n\n\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-.csv-files-1", - "href": "modules/Module05-DataImportExport.html#import-.csv-files-1", - "title": "Module 5: Data Import and Export", - "section": "Import .csv files", - "text": "Import .csv files\nLets import a new data file\n\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n\nNote #1, I assigned the data frame to an object called df. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object.\nNote #2, Look to the Environment pane, you will see the df object ready to be used." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#you-try-it", + "href": "modules/Module095-DataAnalysisWalkthrough.html#you-try-it", + "title": "Data Analysis Walkthrough", + "section": "You try it!", + "text": "You try it!\n\nUsing any of the approaches you can think of, answer this question!\nHow many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-.txt-files", - "href": "modules/Module05-DataImportExport.html#import-.txt-files", - "title": "Module 5: Data Import and Export", - "section": "Import .txt files", - "text": "Import .txt files\nread.csv() is a special case of read.delim() – a general function to read a delimited file into a data frame\nReminder function signature\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n - `file` is the path to your file, in quotes \n - `delim` is what separates the fields within a record. The default for csv is comma\nWe can import the ‘.txt’ files given that we know that ‘serodata1.txt’ uses a tab delimiter and ‘serodata2.txt’ uses a semicolon delimiter.\n\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n\nThe dataset is now successfully read into your R workspace, many times actually. Notice, that each time we imported the data we assigned the data to the df object, meaning we replaced it each time we reassigned the df object." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#you-try-it-1", + "href": "modules/Module095-DataAnalysisWalkthrough.html#you-try-it-1", + "title": "Data Analysis Walkthrough", + "section": "You try it!", + "text": "You try it!\n\n# How many children under 5 were vaccinated\nsum(diph$DP_vacc[diph$age_months < 60])\n\n[1] 91\n\n# Prevalence in both vaccine groups for children under 5\naggregate(\n DP_infection ~ DP_vacc,\n data = subset(diph, age_months < 60),\n FUN = DescTools::MeanCI\n)\n\n DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 0 0.4285714 0.1977457 0.6593972\n2 1 0.6373626 0.5366845 0.7380407\n\n\nIt appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure.", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#import-.txt-files-1", - "href": "modules/Module05-DataImportExport.html#import-.txt-files-1", - "title": "Module 5: Data Import and Export", - "section": "Import .txt files", - "text": "Import .txt files\nLets first import ‘serodata1.txt’ which uses a tab delimiter and ‘serodata2.txt’ which uses a semicolon delimiter.\n\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n\nThe dataset is now successfully read into your R workspace, many times actually. Notice, that each time we imported the data we assigned the data to the df object, meaning we replaced it each time we reassinged the df object." + "objectID": "modules/Module095-DataAnalysisWalkthrough.html#congratulations-for-finishing-the-first-case-study", + "href": "modules/Module095-DataAnalysisWalkthrough.html#congratulations-for-finishing-the-first-case-study", + "title": "Data Analysis Walkthrough", + "section": "Congratulations for finishing the first case study!", + "text": "Congratulations for finishing the first case study!\n\nWhat R functions and skills did you practice?\nWhat other questions could you answer about the same dataset with the skills you know now?", + "crumbs": [ + "Day 2", + "Data Analysis Walkthrough" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#what-if-we-have-a-.xlsx-file---what-do-we-do", - "href": "modules/Module05-DataImportExport.html#what-if-we-have-a-.xlsx-file---what-do-we-do", - "title": "Module 5: Data Import and Export", - "section": "What if we have a .xlsx file - what do we do?", - "text": "What if we have a .xlsx file - what do we do?\n\nAsk Google / ChatGPT\nFind and vet function and package you want\nInstall package\nAttach package\nUse function" + "objectID": "modules/Module00-Welcome.html#welcome-to-sismid-workshop-introduction-to-r", + "href": "modules/Module00-Welcome.html#welcome-to-sismid-workshop-introduction-to-r", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Welcome to SISMID Workshop: Introduction to R!", + "text": "Welcome to SISMID Workshop: Introduction to R!\nAmy Winter (she/her)\nAssistant Professor, Department of Epidemiology and Biostatistics\nEmail: awinter@uga.edu\n\nZane Billings (he/him)\nPhD Candidate, Department of Epidemiology and Biostatistics\nEmail: Wesley.Billings@uga.edu", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#internet-search", - "href": "modules/Module05-DataImportExport.html#internet-search", - "title": "Module 5: Data Import and Export", - "section": "1. Internet Search", - "text": "1. Internet Search" + "objectID": "modules/Module00-Welcome.html#introductions", + "href": "modules/Module00-Welcome.html#introductions", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Introductions", + "text": "Introductions\n\nName?\nCurrent position / institution?\nPast experience with other statistical programs, including R?\nWhy do you want to learn R?\nFavorite useful app\nFavorite guilty pleasure app", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#find-and-vet-function-and-package-you-want", - "href": "modules/Module05-DataImportExport.html#find-and-vet-function-and-package-you-want", - "title": "Module 5: Data Import and Export", - "section": "2. Find and vet function and package you want", - "text": "2. Find and vet function and package you want\nI am getting consistent message to use the the read_excel() function found in the readxl package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task." + "objectID": "modules/Module00-Welcome.html#course-website", + "href": "modules/Module00-Welcome.html#course-website", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Course website", + "text": "Course website\n\nAll of the materials for this course can be found online here: here.\nThis contains the schedule, course resources, and online versions of all of our slide decks.\nThe Course Resources page contains download links for all of the data, exercises, and slides for this class.\nPlease feel free to download these resources and share them – all of the course content is under the Creative Commons BY-NC 4.0 license.", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#install-package", - "href": "modules/Module05-DataImportExport.html#install-package", - "title": "Module 5: Data Import and Export", - "section": "3. Install Package", - "text": "3. Install Package\nTo use the bundle or “package” of code (and or possibly data) from a package, you need to install and also attach the package.\nTo install a package you can\n\ngo to Tools —> Install Packages in the RStudio header\n\nOR\n\nuse the following code:\n\n\ninstall.packages(\"package_name\")\n\nTherefore,\n\ninstall.packages(\"readxl\")" + "objectID": "modules/Module00-Welcome.html#what-is-r", + "href": "modules/Module00-Welcome.html#what-is-r", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "What is R?", + "text": "What is R?\n\nR is a language and environment for statistical computing and graphics developed in 1991\nR is the open source implementation of the S language, which was developed by Bell laboratories in the 70s.\nThe aim of the S language, as expressed by John Chambers, is “to turn ideas into software, quickly and faithfully”", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#call-package", - "href": "modules/Module05-DataImportExport.html#call-package", - "title": "Module 5: Data Import and Export", - "section": "4. Call Package", - "text": "4. Call Package\nReminder – Installing and calling packages\nTo call (i.e., be able to use the package) you can use the following code:\n\nlibrary(package_name)\n\nTherefore,\n\nlibrary(readxl)" + "objectID": "modules/Module00-Welcome.html#what-is-r-1", + "href": "modules/Module00-Welcome.html#what-is-r-1", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "What is R?", + "text": "What is R?\n\nRoss Ihaka and Robert Gentleman at the University of Auckland, New Zealand developed R\nR is both open source and open development", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#use-function", - "href": "modules/Module05-DataImportExport.html#use-function", - "title": "Module 5: Data Import and Export", - "section": "5. Use Function", - "text": "5. Use Function\n\n?read_excel\n\nRead xls and xlsx files\nDescription:\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\nUsage:\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\npath: Path to the xls/xlsx file.\nsheet: Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via ‘range’. If neither argument specifies the sheet, defaults to the first sheet.\nrange: A cell range to read from, as described in cell-specification. Includes typical Excel ranges like “B3:D87”, possibly including the sheet name like “Budget!B2:G14”, and more. Interpreted strictly, even if the range forces the inclusion of leading or trailing empty rows or columns. Takes precedence over ‘skip’, ‘n_max’ and ‘sheet’.\ncol_names: ‘TRUE’ to use the first row as column names, ‘FALSE’ to get default names, or a character vector giving a name for each column. If user provides ‘col_types’ as a vector, ‘col_names’ can have one entry per column, i.e. have the same length as ‘col_types’, or one entry per unskipped column.\ncol_types: Either ‘NULL’ to guess all from the spreadsheet or a character vector containing one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. If exactly one ‘col_type’ is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from ‘col_types = NULL’, but on a cell-by-cell basis.\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\ntrim_ws: Should leading and trailing whitespace be trimmed?\nskip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\nn_max: Maximum number of data rows to read. Trailing empty rows are automatically skipped, so this is an upper bound on the number of rows in the returned tibble. Ignored if ‘range’ is given.\nguess_max: Maximum number of data rows to use for guessing column types.\nprogress: Display a progress spinner? By default, the spinner appears only in an interactive session, outside the context of knitting a document, and when the call is likely to run for several seconds or more. See ‘readxl_progress()’ for more details.\n.name_repair: Handling of column names. Passed along to ‘tibble::as_tibble()’. readxl’s default is `.name_repair = “unique”, which ensures column names are not empty and are unique.\nValue:\n A tibble\nSee Also:\n cell-specification for more details on targetting cells with the\n 'range' argument\nExamples:\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )" + "objectID": "modules/Module00-Welcome.html#what-is-r-2", + "href": "modules/Module00-Welcome.html#what-is-r-2", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "What is R?", + "text": "What is R?\n\nR possesses an extensive catalog of statistical and graphical methods\n\nincludes machine learning algorithm, linear regression, time series, statistical inference to name a few.\n\nData analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#use-function-1", - "href": "modules/Module05-DataImportExport.html#use-function-1", - "title": "Module 5: Data Import and Export", - "section": "5. Use Function", - "text": "5. Use Function\nReminder of function signature\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\nLet’s practice\n\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")" + "objectID": "modules/Module00-Welcome.html#what-is-r-3", + "href": "modules/Module00-Welcome.html#what-is-r-3", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "What is R?", + "text": "What is R?\n\nProgram: R is a clear and accessible programming tool\nTransform: R is made up of a collection of packages/libraries designed specifically for statistical computing\nDiscover: Investigate the data, refine your hypothesis and analyze them\nModel: R provides a wide array of tools to capture the right model for your data\nCommunicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#lets-make-some-mistakes", - "href": "modules/Module05-DataImportExport.html#lets-make-some-mistakes", - "title": "Module 5: Data Import and Export", - "section": "Lets make some mistakes", - "text": "Lets make some mistakes\n\nWhat if we read in the data without assigning it to an object (i.e., read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\"))?\nWhat if we forget to specify the sheet argument? (i.e., dd <- read_excel(path = \"data/serodata.xlsx\"))?" + "objectID": "modules/Module00-Welcome.html#why-r", + "href": "modules/Module00-Welcome.html#why-r", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Why R?", + "text": "Why R?\n\nFree (open source)\nHigh level language designed for statistical computing\nPowerful and flexible - especially for data wrangling and visualization\nExtensive add-on software (packages)\nStrong community", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#installing-and-calling-packages---common-confusion", - "href": "modules/Module05-DataImportExport.html#installing-and-calling-packages---common-confusion", - "title": "Module 5: Data Import and Export", - "section": "Installing and calling packages - Common confusion", - "text": "Installing and calling packages - Common confusion\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to call or load a package each time you want to use it.\n\nThe exception to this rule are the “base” set of packages (i.e., Base R) that are installed automatically when you install R and that automatically called whenever you open R or RStudio." + "objectID": "modules/Module00-Welcome.html#why-not-r", + "href": "modules/Module00-Welcome.html#why-not-r", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Why not R?", + "text": "Why not R?\n\nLittle centralized support, relies on online community and package developers\nAnnoying to update\nSlower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#common-error", - "href": "modules/Module05-DataImportExport.html#common-error", - "title": "Module 5: Data Import and Export", - "section": "Common Error", - "text": "Common Error\nBe prepared to see this error\n\nError: could not find function \"some_function_name\"\n\nThis usually means that either\n\nyou called the function by the wrong name\nyou have not installed a package that contains the function\nyou have installed a package but you forgot to attach it (i.e., require(package_name)) – most likely" + "objectID": "modules/Module00-Welcome.html#is-r-difficult", + "href": "modules/Module00-Welcome.html#is-r-difficult", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Is R Difficult?", + "text": "Is R Difficult?\n\nShort answer – It has a steep learning curve, like all programming languages\nYears ago, R was a difficult language to master.\nHadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#export-write-data", - "href": "modules/Module05-DataImportExport.html#export-write-data", - "title": "Module 5: Data Import and Export", - "section": "Export (write) Data", - "text": "Export (write) Data\n\nExporting or ‘Writing out’ data allows you to save modified files for future use or sharing\nR can write almost any file format, especially with external, non-Base R, packages\nWe are going to focus again on writing delimited files" + "objectID": "modules/Module00-Welcome.html#overall-workshop-objectives", + "href": "modules/Module00-Welcome.html#overall-workshop-objectives", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Overall Workshop Objectives", + "text": "Overall Workshop Objectives\nBy the end of this workshop, you should be able to\n\nstart a new project, read in data, and conduct basic data manipulation, analysis, and visualization\nknow how to use and find packages/functions that we did not specifically learn in class\ntroubleshoot errors", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#export-delimited-data", - "href": "modules/Module05-DataImportExport.html#export-delimited-data", - "title": "Module 5: Data Import and Export", - "section": "Export delimited data", - "text": "Export delimited data\nWithin the Base R ‘util’ package we can find a handful of useful functions including write.csv() and write.table() to exporting data.\n\n\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"<filename>\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n (<https://www.rfc-editor.org/rfc/rfc4180>) for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)" + "objectID": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidyverse", + "href": "modules/Module00-Welcome.html#this-workshop-differs-from-introduction-to-tidyverse", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "This workshop differs from “Introduction to Tidyverse”", + "text": "This workshop differs from “Introduction to Tidyverse”\nWe will focus this class on using Base R functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\nThe Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more intuitive as compared to the original older Base R. Tidyverse advantages:\n\nconsistent structure - making it easier to learn how to use different packages\nparticularly good for wrangling (manipulating, cleaning, joining) data\n\nmore flexible for visualizing data", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#export-delimited-data-1", - "href": "modules/Module05-DataImportExport.html#export-delimited-data-1", - "title": "Module 5: Data Import and Export", - "section": "Export delimited data", - "text": "Export delimited data\nLet’s practice exporting the data as three files with three different delimiters (comma, tab, semicolon)\n\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n\nNote, I wrote the data to new file names. Even though we didn’t change the data at all in this module, it is good practice to keep raw data raw, and not to write over it." + "objectID": "modules/Module00-Welcome.html#workshop-overview", + "href": "modules/Module00-Welcome.html#workshop-overview", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Workshop Overview", + "text": "Workshop Overview\n14 lecture blocks that will each:\n\nStart with learning objectives\nEnd with summary slides\nInclude mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n\nReproducibility\nGood coding techniques\nThinking algorithmically\nBasic terms / R jargon", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#r-.rds-and-.rdardata-files", - "href": "modules/Module05-DataImportExport.html#r-.rds-and-.rdardata-files", - "title": "Module 5: Data Import and Export", - "section": "R .rds and .rda/RData files", - "text": "R .rds and .rda/RData files\nThere are two file extensions worth discussing.\nR has two native data formats—‘Rdata’ (sometimes shortened to ‘Rda’) and ‘Rds’. These formats are used when R objects are saved for later use. ‘Rdata’ is used to save multiple R objects, while ‘Rds’ is used to save a single R object. ‘Rds’ is fast to write/read and is very small." + "objectID": "modules/Module00-Welcome.html#reproducibility", + "href": "modules/Module00-Welcome.html#reproducibility", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Reproducibility", + "text": "Reproducibility\n\nReproducible research: the idea that other people should be able to verify the claims you make – usually by being able to see your data and run your code.\n\n\n\n2023 was the US government’s year of open science – specific aspects of reproducibility will be mandated for federally funded research!\nSharing and documenting your code is a massive step towards making your work reproducible, and the R ecosystem can play a big role in that!", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#rds-binary-file", - "href": "modules/Module05-DataImportExport.html#rds-binary-file", - "title": "Module 5: Data Import and Export", - "section": ".rds binary file", - "text": ".rds binary file\nSaving datasets in .rds format can save time if you have to read it back in later.\nwrite_rds() and read_rds() from readr package can be used to write/read a single R object to/from file.\nrequire(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")" + "objectID": "modules/Module00-Welcome.html#useful-free-resources", + "href": "modules/Module00-Welcome.html#useful-free-resources", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Useful (+ Free) Resources", + "text": "Useful (+ Free) Resources\nWant more?\n\nR for Data Science: http://r4ds.had.co.nz/\n(great general information)\nFundamentals of Data Visualization: https://clauswilke.com/dataviz/\nR for Epidemiology: https://www.r4epi.com/\nThe Epidemiologist R Handbook: https://epirhandbook.com/en/\nR basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html (great general information)\nOpen Case Studies: https://www.opencasestudies.org/\n(resource for specific public health cases with statistical implementation and interpretation)", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#rdardata-files", - "href": "modules/Module05-DataImportExport.html#rdardata-files", - "title": "Module 5: Data Import and Export", - "section": ".rda/RData files", - "text": ".rda/RData files\nThe Base R functions save() and load() can be used to save and load multiple R objects.\nsave() writes an external representation of R objects to the specified file, and can by loaded back into the environment using load(). A nice feature about using save and load is that the R object(s) is directly imported into the environment and you don’t have to specify the name. The files can be saved as .RData or .Rda files.\nFunction signature\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\nNote, that you separate the objects you want to save with commas." + "objectID": "modules/Module00-Welcome.html#useful-free-resources-1", + "href": "modules/Module00-Welcome.html#useful-free-resources-1", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Useful (+Free) Resources", + "text": "Useful (+Free) Resources\nNeed help?\n\nVarious “Cheat Sheets”: https://github.com/rstudio/cheatsheets/\nR reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf\nR jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#summary", - "href": "modules/Module05-DataImportExport.html#summary", - "title": "Module 5: Data Import and Export", - "section": "Summary", - "text": "Summary\n\nImporting or ‘Reading in’ data are the first step of any real project / data analysis\nThe Base R ‘util’ package has useful functions including read.csv() and read.delim() to importing/reading data or write.csv() and write.table() for exporting/writing data\nWhen importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used\nIf data are imported correctly, they can be found in the Environment pane of RStudio\nYou only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it.\nTo complete a task you don’t know how to do (e.g., reading in an excel data file) use the following steps: 1. Asl Google / ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function" + "objectID": "modules/Module00-Welcome.html#installing-r", + "href": "modules/Module00-Welcome.html#installing-r", + "title": "Welcome to SISMID Workshop: Introduction to R", + "section": "Installing R", + "text": "Installing R\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone’s laptops.\n\nInstall the latest version from: http://cran.r-project.org/\nInstall RStudio", + "crumbs": [ + "Day 1", + "Welcome to SISMID Workshop: Introduction to R" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#acknowledgements", - "href": "modules/Module05-DataImportExport.html#acknowledgements", - "title": "Module 5: Data Import and Export", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University" + "objectID": "references.html", + "href": "references.html", + "title": "Course Resources", + "section": "", + "text": "Data and Exercise downloads\n\nDownload all datasets here: click to download.\nDownload all exercises and solution files here: click to download\nDownload all slide decks here: click to download\nGet the example R Markdown document for Module 11 here: click to download\n\nAnd the sample bibligraphy “bib” file is here: click to download\nAnd the rendered HTML file is here: click to download\n\nCourse GitHub where all materials can be found (to download the entire course as a zip file click the green “Code” button): https://github.com/UGA-IDD/SISMID-2024.\n\n\n\nNeed help?\n\nVarious “Cheat Sheets”: https://github.com/rstudio/cheatsheets/\nR reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf\n\nR jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n\nOther references\n\n\nBatra, Neale, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Florence Isaac, Natalie Fischer, et al. 2021. epiR Handbook. Edited by Neale Batra. https://epirhandbook.com/; Applied Epi Incorporated.\n\n\nCarchedi, Nick, and Sean Kross. 2024. “Learn r, in r.” Swirl. https://swirlstats.com/.\n\n\nKeyes, David. 2024. R for the Rest of Us: A Statistics-Free Introduction. San Francisco, CA: No Starch Press.\n\n\nMatloff, Norman. 2011. The Art of R Programming. San Francisco, CA: No Starch Press.\n\n\nR Core team. 2024. An Introduction to R. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. Sebastopol, CA: https://r4ds.hadley.nz/; O’Reilly Media.\n\n\n\n\n\n\n\n\nReuseCC BY-NC 4.0", + "crumbs": [ + "Course Resources" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#mini-exercise-1", - "href": "modules/Module05-DataImportExport.html#mini-exercise-1", - "title": "Module 5: Data Import and Export", - "section": "Mini exercise", - "text": "Mini exercise" + "objectID": "index.html", + "href": "index.html", + "title": "Welcome", + "section": "", + "text": "Welcome to “Introduction to R”!\nThis website contains all of the material for the 2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module “Introduction to R”.", + "crumbs": [ + "Welcome!" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#mini-exercise-2", - "href": "modules/Module05-DataImportExport.html#mini-exercise-2", - "title": "Module 5: Data Import and Export", - "section": "Mini exercise", - "text": "Mini exercise\nIf your R Project is not already open, open it so we take advantage of it setting a useful working directory for us in order to import data." + "objectID": "index.html#prerequisities", + "href": "index.html#prerequisities", + "title": "Welcome", + "section": "Prerequisities", + "text": "Prerequisities\nFamiliary with basic statistical concepts on the level of an introductory statistics class is assumed for our course\nBefore the course begins, you should install R and RStudio on your laptop. If you are using an older version of R, you should update it before the course begins. You will need at least R version 4.3.0 for this course, but using the most recent version (4.4.1 at the time of writing) is always preferable.\n\nYou can install R from the CRAN website by clicking on the correct download link for your OS.\nYou can install RStudio from the Posit website.", + "crumbs": [ + "Welcome!" + ] }, { - "objectID": "modules/Module06-DataSubset.html#learning-objectives", - "href": "modules/Module06-DataSubset.html#learning-objectives", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 6, you should be able to…\n\nUse basic functions to get to know you data\nUse three indexing approaches\nRely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\nDescribe what logical operators are and how to use them\nUse on the subset() function to subset data" + "objectID": "index.html#about-the-instructors", + "href": "index.html#about-the-instructors", + "title": "Welcome", + "section": "About the instructors", + "text": "About the instructors\n\n\n\nCo-Instructor: Dr. Amy Winter\n\n\nDr. Winter is an Assistant Professor of Epidemiology at the University of Georgia. She has been coding in R for 10 years, and uses R day-to-day to conduct her research addressing policy-relevant questions on the transmission and control of infectious diseases in human populations, particularly VPDs. She teaches a semester-long course titled Introduction to Coding in R for Public Health to graduate students at the University of Georgia.\n\n\n\nCo-Instructor: Zane Billings\n\n\nZane Billings is a PhD student in Epidemiology and Biostatistics at the University of Georgia, working with Andreas Handel. He has been using R since 2017, and uses R for nearly all of his statistics and data science practice. Zane’s research focuses on the immune response to influenza vaccination, and uses machine learning and multilevel regression modeling (in R!) to improve our understanding of influenza immunology.", + "crumbs": [ + "Welcome!" + ] }, { - "objectID": "modules/Module06-DataSubset.html#getting-to-know-our-data", - "href": "modules/Module06-DataSubset.html#getting-to-know-our-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Getting to know our data", - "text": "Getting to know our data\nThe dim(), nrow(), and ncol() functions are good options to check the dimensions of your data before moving forward.\nLet’s first read in the data from the previous module.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n\n\ndim(df) # rows, columns\n\n[1] 651 5\n\nnrow(df) # number of rows\n\n[1] 651\n\nncol(df) # number of columns\n\n[1] 5" + "objectID": "modules/Module04-RProject.html#learning-objectives", + "href": "modules/Module04-RProject.html#learning-objectives", + "title": "Module 4: R Project", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 4, you should be able to…\n\nCreate an R Project\nCheck you are in the desired R Project\nReference the Files pane in RStudio\nDescribe “good” R Project organization", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#quick-summary-of-data", - "href": "modules/Module06-DataSubset.html#quick-summary-of-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Quick summary of data", - "text": "Quick summary of data\nThe colnames(), str() and summary()functions from Base R are great functions to assess the data type and some summary statistics.\n\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\nstr(df)\n\n'data.frame': 651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n\nsummary(df)\n\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n\n\nNote, if you have a very large dataset with 15+ variables, summary() is not so efficient." + "objectID": "modules/Module04-RProject.html#rstudio-project", + "href": "modules/Module04-RProject.html#rstudio-project", + "title": "Module 4: R Project", + "section": "RStudio Project", + "text": "RStudio Project\nRStudio “Project” is one highly recommended strategy to build organized and reproducible code in R.\n\nHelps with working directories by easily incorporating relative paths only.\nHelps you organize your code, data, and output.\nAllows you to open multiple RStudio sessions at once!", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#description-of-data", - "href": "modules/Module06-DataSubset.html#description-of-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Description of data", - "text": "Description of data\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop." + "objectID": "modules/Module04-RProject.html#rstudio-project-creation", + "href": "modules/Module04-RProject.html#rstudio-project-creation", + "title": "Module 4: R Project", + "section": "RStudio Project Creation", + "text": "RStudio Project Creation\nLet’s create a new RStudio Project.\nFind the File Menu in the Menu Bar –> New Project –> New Directory –> New Project\nName your Project “IntroToR_RProject”", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe", - "href": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "View the data as a whole dataframe", - "text": "View the data as a whole dataframe\nThe View() function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.\n\nView(df)" + "objectID": "modules/Module04-RProject.html#rstudio-project-organization", + "href": "modules/Module04-RProject.html#rstudio-project-organization", + "title": "Module 4: R Project", + "section": "RStudio Project Organization", + "text": "RStudio Project Organization\nThis is my personal preference for organizing an R Project. But, for this workshop it will be mandatory as it will help us help you. A critical component of conducting any data analysis is being able to reproduce it! Organizing your code, data, output, and figures is a necessary (although not sufficient) condition for reproducibility.\nCreate 4 sub-directories with the following names within your “SISMID_IntroToR_RProject” folder:\n\ncode\ndata\noutput\nfigures\n\nWe will be working from this directory for the remainder of the Workshop. Take a moment to move any R scripts you have already created to the ‘code’ sub-directory.", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe-1", - "href": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe-1", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "View the data as a whole dataframe", - "text": "View the data as a whole dataframe\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane\n\nYou can also hold down Cmd or CTRL and click on the name of a data frame in your code." + "objectID": "modules/Module04-RProject.html#some-things-to-notice-in-an-r-project", + "href": "modules/Module04-RProject.html#some-things-to-notice-in-an-r-project", + "title": "Module 4: R Project", + "section": "Some things to notice in an R Project", + "text": "Some things to notice in an R Project\n\nThe name of the R Project will be shown at the top of the RStudio Window\nIf you check the working directory using getwd() you will find the working directory is set to the location where the R Project was saved.\nThe Files pane in RStudio is also set to the location where the R Project was saved, making it easy to navigate to sub-directories directly from RStudio.", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#indexing", - "href": "modules/Module06-DataSubset.html#indexing", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Indexing", - "text": "Indexing\nR contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: [, [[ and $.\n\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list" + "objectID": "modules/Module04-RProject.html#r-project---common-issues", + "href": "modules/Module04-RProject.html#r-project---common-issues", + "title": "Module 4: R Project", + "section": "R Project - Common issues", + "text": "R Project - Common issues\nIf you simply open RStudio, it will not automatically open your R Project. As a result, when you say run a function to import data using the relative path based on your working directory, it won’t be able to find the data.\nTo open a previously created R Project, you need to open the R Project (i.e., double click on SISMID_IntroToR_RProject.RProj)", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#vectors-and-multi-dimensional-objects", - "href": "modules/Module06-DataSubset.html#vectors-and-multi-dimensional-objects", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Vectors and multi-dimensional objects", - "text": "Vectors and multi-dimensional objects\nTo index a vector, vector[i] select the ith element. To index a multi-dimensional objects such as a matrix, matrix[i, j] selects the element in row i and column j, where as in a three dimensional array[k, i, j] selects the element in matrix k, row i, and column j.\nLet’s practice by first creating the same objects as we did in Module 1.\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n\nHere is a reminder of what these objects look like.\n\nvector.object1\n\n[1] 2 3 4 5\n\nmatrix.object\n\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nFinally, let’s use indexing to pull out elements of the objects.\n\nvector.object1[2] #pulling the second element\n\n[1] 3\n\nmatrix.object[1,2] #pulling the element in row 1 column 2\n\n[1] 3" + "objectID": "modules/Module04-RProject.html#summary", + "href": "modules/Module04-RProject.html#summary", + "title": "Module 4: R Project", + "section": "Summary", + "text": "Summary\n\nR Projects are really helpful for lots of reasons, including to improve the reproducibility of your work\nConsistently set up your R Project’s sub-directories so that you can easily navigate the project\nIf you get an error that a file can’t be found, make sure you correctly opened the R Project by looking for the Project name at the top of the RStudio application window.", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#list-objects", - "href": "modules/Module06-DataSubset.html#list-objects", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "List objects", - "text": "List objects\nFor lists, one generally uses list[[p]] to select any single element p.\nLet’s practice by creating the same list as we did in Module 1.\n\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\nlist.object[[3]]\n\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nWhat happens if we use a single square bracket?\n\nlist.object[3]\n\n[[1]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nThe [[ operator is called the “extract” operator and gives us the element from the list. The [ operator is called the “subset” operator and gives us a subset of the list, that is still a list." + "objectID": "modules/Module04-RProject.html#mini-exercise", + "href": "modules/Module04-RProject.html#mini-exercise", + "title": "Module 4: R Project", + "section": "Mini Exercise", + "text": "Mini Exercise\n\nClose R Studio\nReopen your R Project\nCheck that you are actually in the R Project\nCreate a new R script and save it in your ‘code’ subdirectory\nCreate a vector of numbers\nCreate a vector a character values\nAdd comment(s) to your R script to explain your code.", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#for-indexing", - "href": "modules/Module06-DataSubset.html#for-indexing", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "$ for indexing", - "text": "$ for indexing\n$ allows only a literal character string or a symbol as the index.\n\ndf$IgG_concentration\n\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n\n\nNote, if you have spaces in your variable name, you will need to use back ticks ` after the $. This is a good reason to not create variables / column names with spaces." + "objectID": "modules/Module04-RProject.html#acknowledgements", + "href": "modules/Module04-RProject.html#acknowledgements", + "title": "Module 4: R Project", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.", + "crumbs": [ + "Day 1", + "Module 4: R Project" + ] }, { - "objectID": "modules/Module06-DataSubset.html#for-indexing-with-lists", - "href": "modules/Module06-DataSubset.html#for-indexing-with-lists", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "$ for indexing with lists", - "text": "$ for indexing with lists\n$ allows only a literal character string or a symbol as the index. For a list it extracts a named element.\nList elements can be named\n\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nIf list elements are named, than you can reference data from list using $ or using double square brackets, [[\n\nlist.object.named$uga \n\n[1] \"blue\" \"red\" \"yellow\"\n\nlist.object.named[[\"uga\"]] \n\n[1] \"blue\" \"red\" \"yellow\"" + "objectID": "schedule.html", + "href": "schedule.html", + "title": "Course Schedule", + "section": "", + "text": "Meeting times:\nLocation: Randal Rollins Building (RR) 201, Emory University", + "crumbs": [ + "Course Schedule" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-to-rename-columns", - "href": "modules/Module06-DataSubset.html#using-indexing-to-rename-columns", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing to rename columns", - "text": "Using indexing to rename columns\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\ncolnames(df) \n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncolnames(df)[2:3] <- c(\"IgG_concentration_IU/mL\", \"age_year\") # reassigns\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age_year\" \"gender\" \n[5] \"slum\" \n\n\n\nFor the sake of the module, I am going to reassign them back to the original variable names\n\ncolnames(df)[2:3] <- c(\"IgG_concentration\", \"age\") #reset" + "objectID": "schedule.html#day-01-monday", + "href": "schedule.html#day-01-monday", + "title": "Course Schedule", + "section": "Day 01 – Monday", + "text": "Day 01 – Monday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 09:00 am\nModule 0 (Amy and Zane)\n\n\n09:00 am - 10:00 am\nModule 1 (Amy)\n\n\n10:00 am - 10:30 am\nCoffee break\n\n\n10:30 am - 11:15 am\nModule 2 (Amy)\n\n\n11:15 am - 11:30 am\nModule 3 (Zane)\n\n\n11:30 am - 12:00 pm\nModule 4 (Zane)\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby)\n\n\n01:30 pm - 02:15 pm\nModule 5 (Amy)\n\n\n02:15 pm - 02:45 pm\nExercise 1\n\n\n02:45 pm - 03:00 pm\nStart Module 6 (Amy)\n\n\n03:00 pm - 03:30 pm\nCoffee break\n\n\n03:30 pm - 04:00 pm\nFinish Module 6 (Amy or Zane)\n\n\n04:00 pm - 05:00 pm\nModule 7, exercise 2 in remaining time (Zane)\n\n\n05:00 pm - 07:00 pm\nNetworking night and poster session, Randal Rollins P01", + "crumbs": [ + "Course Schedule" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-to-subset-data", - "href": "modules/Module06-DataSubset.html#using-indexing-to-subset-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing to subset data", - "text": "Using indexing to subset data\nWe can also subset a data frames and matrices (2-dimensional objects) using the bracket [, ].\nWe can subset by columns and pull the x column using the index of the column or the column name (“age”)\n\ndf[, \"age\"] #same as df[, 2]\n\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n\n\nWe can select multiple columns using multiple column names:\n\ndf[, c(\"age\", \"gender\")]\n\n age gender\n1 3.176895e-01 Female\n2 3.436823e+00 Female\n3 3.000000e-01 Male\n4 1.432363e+02 Male\n5 4.476534e-01 Male\n6 2.527076e-02 Male\n7 6.101083e-01 Female\n8 3.000000e-01 Female\n9 2.916968e+00 Male\n10 1.649819e+00 Male\n11 4.574007e+00 Male\n12 1.583904e+02 Female\n13 NA Male\n14 1.065068e+02 Male\n15 1.113870e+02 Male\n16 4.144893e+01 Male\n17 3.000000e-01 Male\n18 2.527076e-01 Female\n19 8.159247e+01 Female\n20 1.825342e+02 Male\n21 4.244656e+01 Male\n22 1.193493e+02 Female\n23 3.000000e-01 Male\n24 3.000000e-01 Female\n25 9.025271e-01 Female\n26 3.501805e-01 Male\n27 3.000000e-01 Male\n28 1.227437e+00 Female\n29 1.702055e+02 Female\n30 3.000000e-01 Female\n31 4.801444e-01 Male\n32 2.527076e-02 Male\n33 3.000000e-01 Female\n34 5.776173e-02 Male\n35 4.801444e-01 Female\n36 3.826715e-01 Female\n37 3.000000e-01 Male\n38 4.048558e+02 Male\n39 3.000000e-01 Male\n40 5.451264e-01 Male\n41 3.000000e-01 Female\n42 5.590753e+01 Male\n43 2.202166e-01 Female\n44 1.709760e+02 Male\n45 1.227437e+00 Male\n46 4.567527e+02 Male\n47 4.838480e+01 Male\n48 1.227437e-01 Female\n49 1.877256e-01 Female\n50 3.000000e-01 Female\n51 3.501805e-01 Male\n52 3.339350e+00 Male\n53 3.000000e-01 Female\n54 5.451264e-01 Female\n55 NA Male\n56 2.104693e+00 Male\n57 NA Male\n58 3.826715e-01 Female\n59 3.926366e+01 Female\n60 1.129964e+00 Male\n61 3.501805e+00 Female\n62 7.542808e+01 Female\n63 4.800475e+01 Female\n64 1.000000e+00 Male\n65 4.068884e+01 Male\n66 3.000000e-01 Female\n67 4.377672e+01 Female\n68 1.193493e+02 Male\n69 6.977740e+01 Male\n70 1.373288e+02 Female\n71 1.642979e+02 Male\n72 NA Female\n73 1.542808e+02 Male\n74 6.033058e-01 Male\n75 2.809917e-01 Male\n76 1.966942e+00 Male\n77 2.041322e+00 Male\n78 2.115702e+00 Female\n79 4.663043e+02 Male\n80 3.000000e-01 Male\n81 1.500796e+02 Male\n82 1.543790e+02 Female\n83 2.561983e-01 Female\n84 1.596338e+02 Male\n85 1.732484e+02 Female\n86 4.641304e+02 Female\n87 3.736364e+01 Male\n88 1.572452e+02 Female\n89 3.000000e-01 Male\n90 3.000000e-01 Male\n91 8.264463e-02 Male\n92 6.776859e-01 Female\n93 7.272727e-01 Male\n94 2.066116e-01 Female\n95 1.966942e+00 Male\n96 3.000000e-01 Male\n97 3.000000e-01 Male\n98 2.809917e-01 Female\n99 8.016529e-01 Female\n100 1.818182e-01 Female\n101 1.818182e-01 Male\n102 8.264463e-02 Female\n103 3.422727e+01 Female\n104 8.743506e+00 Male\n105 3.000000e-01 Male\n106 1.641720e+02 Female\n107 4.049587e-01 Male\n108 1.001592e+02 Male\n109 4.489130e+02 Female\n110 1.101911e+02 Female\n111 4.440909e+01 Male\n112 1.288217e+02 Female\n113 2.840909e+01 Male\n114 1.003981e+02 Female\n115 8.512397e-01 Female\n116 1.322314e-01 Male\n117 1.297521e+00 Female\n118 1.570248e-01 Male\n119 1.966942e+00 Female\n120 1.536624e+02 Male\n121 3.000000e-01 Female\n122 3.000000e-01 Female\n123 1.074380e+00 Male\n124 1.099174e+00 Female\n125 3.057851e-01 Female\n126 3.000000e-01 Female\n127 5.785124e-02 Female\n128 4.391304e+02 Female\n129 6.130435e+02 Female\n130 1.074380e-01 Male\n131 7.125796e+01 Male\n132 4.222727e+01 Male\n133 1.620223e+02 Female\n134 3.750000e+01 Female\n135 1.534236e+02 Female\n136 6.239130e+02 Female\n137 5.521739e+02 Male\n138 5.785124e-02 Female\n139 6.547945e-01 Female\n140 8.767123e-02 Female\n141 3.000000e-01 Male\n142 2.849315e+00 Female\n143 3.835616e-02 Male\n144 2.849315e-01 Male\n145 4.649315e+00 Male\n146 1.369863e-01 Female\n147 3.589041e-01 Male\n148 1.049315e+00 Male\n149 4.668998e+01 Female\n150 1.473510e+02 Female\n151 4.589744e+01 Male\n152 2.109589e-01 Male\n153 1.741722e+02 Female\n154 2.496503e+01 Female\n155 1.850993e+02 Male\n156 1.863014e-01 Male\n157 1.863014e-01 Male\n158 4.589744e+01 Female\n159 1.942881e+02 Female\n160 5.079646e+02 Female\n161 8.767123e-01 Male\n162 2.750685e+00 Male\n163 1.503311e+02 Female\n164 3.000000e-01 Male\n165 3.095890e-01 Male\n166 3.000000e-01 Male\n167 6.371681e+02 Female\n168 6.054795e-01 Female\n169 1.955298e+02 Female\n170 1.786424e+02 Male\n171 1.120861e+02 Female\n172 1.331954e+02 Male\n173 2.159292e+02 Male\n174 5.628319e+02 Male\n175 1.900662e+02 Female\n176 6.547945e-01 Male\n177 1.665753e+00 Male\n178 1.739238e+02 Male\n179 9.991722e+01 Male\n180 9.321192e+01 Male\n181 8.767123e-02 Female\n182 NA Male\n183 6.794521e-01 Female\n184 5.808219e-01 Male\n185 1.369863e-01 Female\n186 2.060274e+00 Female\n187 1.610099e+02 Male\n188 4.082192e-01 Female\n189 8.273973e-01 Male\n190 4.601770e+02 Female\n191 1.389073e+02 Female\n192 3.867133e+01 Female\n193 9.260274e-01 Female\n194 5.918874e+01 Female\n195 1.870861e+02 Female\n196 4.328767e-01 Male\n197 6.301370e-02 Male\n198 3.000000e-01 Female\n199 1.548013e+02 Male\n200 5.819536e+01 Female\n201 1.724338e+02 Female\n202 1.932401e+01 Female\n203 2.164420e+00 Female\n204 9.757412e-01 Female\n205 1.509434e-01 Male\n206 1.509434e-01 Female\n207 7.766571e+01 Male\n208 4.319563e+01 Female\n209 1.752022e-01 Male\n210 3.094775e+01 Female\n211 1.266846e-01 Male\n212 2.919806e+01 Male\n213 9.545455e+00 Female\n214 2.735115e+01 Female\n215 1.314841e+02 Female\n216 3.643985e+01 Male\n217 1.498559e+02 Female\n218 9.363636e+00 Female\n219 2.479784e-01 Male\n220 5.390836e-02 Female\n221 8.787062e-01 Female\n222 1.994609e-01 Male\n223 3.000000e-01 Female\n224 3.000000e-01 Male\n225 5.390836e-03 Female\n226 4.177898e-01 Female\n227 3.000000e-01 Female\n228 2.479784e-01 Male\n229 2.964960e-02 Male\n230 2.964960e-01 Male\n231 5.148248e+00 Female\n232 1.994609e-01 Male\n233 3.000000e-01 Male\n234 1.779539e+02 Male\n235 3.290210e+02 Female\n236 3.000000e-01 Male\n237 1.809798e+02 Female\n238 4.905660e-01 Male\n239 1.266846e-01 Male\n240 1.543948e+02 Female\n241 1.379683e+02 Female\n242 6.153846e+02 Male\n243 1.474784e+02 Male\n244 3.000000e-01 Female\n245 1.024259e+00 Male\n246 4.444056e+02 Female\n247 3.000000e-01 Male\n248 2.504043e+00 Female\n249 3.000000e-01 Female\n250 3.000000e-01 Female\n251 7.816712e-02 Female\n252 3.000000e-01 Female\n253 5.390836e-02 Male\n254 1.494236e+02 Female\n255 5.972622e+01 Male\n256 6.361186e-01 Female\n257 1.837896e+02 Female\n258 1.320809e+02 Female\n259 1.571906e-01 Male\n260 1.520231e+02 Male\n261 3.000000e-01 Female\n262 3.000000e-01 Female\n263 1.823699e+02 Male\n264 3.000000e-01 Male\n265 2.173913e+00 Male\n266 2.142202e+01 Male\n267 3.000000e-01 Female\n268 3.408027e+00 Male\n269 4.155963e+01 Male\n270 9.698997e-02 Male\n271 1.238532e+01 Female\n272 9.528926e+00 Male\n273 1.916185e+02 Female\n274 1.060201e+00 Male\n275 3.679104e+02 Female\n276 4.288991e+01 Male\n277 9.971098e+01 Male\n278 3.000000e-01 Male\n279 1.208092e+02 Male\n280 3.000000e-01 Male\n281 6.688963e-03 Female\n282 2.505017e+00 Female\n283 1.481605e+00 Male\n284 3.000000e-01 Female\n285 5.183946e-01 Female\n286 3.000000e-01 Female\n287 1.872910e-01 Male\n288 3.678930e-01 Female\n289 3.000000e-01 Male\n290 4.529851e+02 Female\n291 3.169725e+01 Female\n292 3.000000e-01 Male\n293 4.922018e+01 Male\n294 2.548507e+02 Male\n295 1.661850e+02 Male\n296 9.164179e+02 Male\n297 3.678930e-01 Female\n298 1.236994e+02 Male\n299 6.705202e+01 Male\n300 3.834862e+01 Male\n301 1.963211e+00 Female\n302 3.000000e-01 Male\n303 2.474916e-01 Male\n304 3.000000e-01 Female\n305 2.173913e-01 Male\n306 8.193980e-01 Male\n307 2.444816e+00 Female\n308 3.000000e-01 Male\n309 1.571906e-01 Female\n310 1.849711e+02 Male\n311 6.119403e+02 Female\n312 3.000000e-01 Female\n313 4.280936e-01 Female\n314 9.698997e-02 Male\n315 3.678930e-02 Female\n316 4.832090e+02 Male\n317 1.390173e+02 Female\n318 3.000000e-01 Male\n319 6.555970e+02 Female\n320 1.526012e+02 Female\n321 3.000000e-01 Female\n322 7.222222e-01 Male\n323 7.724426e+01 Male\n324 3.000000e-01 Male\n325 6.111111e-01 Female\n326 1.555556e+00 Female\n327 3.055556e-01 Male\n328 1.500000e+00 Male\n329 1.470772e+02 Male\n330 1.694444e+00 Female\n331 3.138298e+02 Female\n332 1.414405e+02 Female\n333 1.990605e+02 Female\n334 4.212766e+02 Male\n335 3.000000e-01 Male\n336 3.000000e-01 Male\n337 6.478723e+02 Male\n338 3.000000e-01 Male\n339 2.222222e+00 Female\n340 3.000000e-01 Male\n341 2.055556e+00 Male\n342 2.777778e-02 Female\n343 8.333333e-02 Male\n344 1.032359e+02 Female\n345 1.611111e+00 Female\n346 8.333333e-02 Female\n347 2.333333e+00 Female\n348 5.755319e+02 Male\n349 1.686848e+02 Female\n350 1.111111e-01 Male\n351 3.000000e-01 Male\n352 8.372340e+02 Female\n353 3.000000e-01 Male\n354 3.784504e+01 Male\n355 3.819149e+02 Male\n356 5.555556e-02 Female\n357 3.000000e+02 Female\n358 1.855950e+02 Male\n359 1.944444e-01 Female\n360 3.000000e-01 Male\n361 5.555556e-02 Female\n362 1.138889e+00 Male\n363 4.254237e+01 Female\n364 3.000000e-01 Male\n365 3.000000e-01 Male\n366 3.000000e-01 Female\n367 3.000000e-01 Female\n368 3.138298e+02 Female\n369 1.235908e+02 Male\n370 4.159574e+02 Male\n371 3.009685e+01 Female\n372 1.567850e+02 Female\n373 1.367432e+02 Female\n374 3.731235e+01 Female\n375 9.164927e+01 Male\n376 2.936170e+02 Female\n377 8.820459e+01 Female\n378 1.035491e+02 Male\n379 7.379958e+01 Female\n380 3.000000e-01 Male\n381 1.718750e+02 Male\n382 2.128527e+00 Male\n383 1.253918e+00 Female\n384 2.382445e-01 Male\n385 4.639498e-01 Female\n386 1.253918e-01 Male\n387 1.253918e-01 Male\n388 3.000000e-01 Female\n389 1.000000e+00 Male\n390 1.570043e+02 Male\n391 4.344086e+02 Female\n392 2.184953e+00 Male\n393 1.507837e+00 Female\n394 3.228840e-01 Female\n395 4.588024e+01 Male\n396 1.660560e+02 Male\n397 3.000000e-01 Male\n398 3.043011e+02 Male\n399 2.612903e+02 Female\n400 1.621767e+02 Male\n401 3.228840e-01 Male\n402 4.639498e-01 Female\n403 2.495298e+00 Female\n404 3.257053e+00 Female\n405 3.793103e-01 Female\n406 NA Male\n407 6.896552e-02 Female\n408 3.000000e-01 Male\n409 1.423197e+00 Female\n410 3.000000e-01 Female\n411 3.000000e-01 Female\n412 1.786638e+02 Male\n413 3.279570e+02 Male\n414 NA Female\n415 1.903017e+02 Male\n416 1.654095e+02 Female\n417 4.639498e-01 Female\n418 1.815733e+02 Male\n419 1.366771e+00 Male\n420 1.536050e-01 Female\n421 1.306587e+01 Male\n422 2.129032e+02 Female\n423 1.925647e+02 Male\n424 3.000000e-01 Female\n425 1.028213e+00 Female\n426 3.793103e-01 Female\n427 8.025078e-01 Female\n428 4.860215e+02 Female\n429 3.000000e-01 Female\n430 2.100313e-01 Male\n431 2.767665e+01 Female\n432 1.592476e+00 Male\n433 9.717868e-02 Female\n434 1.028213e+00 Female\n435 3.793103e-01 Male\n436 1.292026e+02 Male\n437 4.425150e+01 Female\n438 3.193548e+02 Female\n439 1.860991e+02 Female\n440 6.614420e-01 Female\n441 5.203762e-01 Male\n442 1.330819e+02 Male\n443 1.673491e+02 Female\n444 3.000000e-01 Male\n445 1.117457e+02 Male\n446 3.045509e+01 Female\n447 3.000000e-01 Male\n448 8.280255e-02 Female\n449 3.000000e-01 Female\n450 1.200637e+00 Female\n451 1.687898e-01 Male\n452 7.367273e+02 Female\n453 8.280255e-02 Male\n454 5.127389e-01 Male\n455 1.974522e-01 Male\n456 7.993631e-01 Female\n457 3.000000e-01 Male\n458 3.298182e+02 Male\n459 9.736842e+01 Female\n460 3.000000e-01 Female\n461 3.000000e-01 Female\n462 4.214545e+02 Female\n463 3.000000e-01 Male\n464 2.578182e+02 Female\n465 2.261147e-01 Male\n466 3.000000e-01 Female\n467 1.883901e+02 Male\n468 9.458204e+01 Female\n469 3.000000e-01 Female\n470 3.000000e-01 Male\n471 7.707006e-01 Female\n472 5.032727e+02 Male\n473 1.544586e+00 Female\n474 1.431115e+02 Female\n475 3.000000e-01 Male\n476 1.458599e+00 Male\n477 1.247678e+02 Female\n478 NA Female\n479 4.334545e+02 Male\n480 3.000000e-01 Female\n481 6.156364e+02 Female\n482 9.574303e+01 Male\n483 1.928019e+02 Male\n484 1.888545e+02 Male\n485 1.598297e+02 Female\n486 5.127389e-01 Male\n487 1.171053e+02 Female\n488 NA Male\n489 2.547771e-02 Female\n490 1.707430e+02 Female\n491 3.000000e-01 Male\n492 1.869969e+02 Male\n493 4.731481e+01 Male\n494 1.988390e+02 Female\n495 3.000000e-01 Male\n496 8.808050e+01 Male\n497 2.003185e+00 Female\n498 3.000000e-01 Male\n499 3.509259e+01 Female\n500 9.365325e+01 Female\n501 3.000000e-01 Male\n502 3.736111e+01 Female\n503 1.674923e+02 Female\n504 8.808050e+01 Male\n505 1.656347e+02 Female\n506 3.722222e+01 Female\n507 6.756364e+02 Female\n508 3.000000e-01 Male\n509 1.698142e+02 Male\n510 1.628483e+02 Female\n511 5.985130e-01 Male\n512 1.903346e+00 Female\n513 3.000000e-01 Male\n514 3.000000e-01 Male\n515 8.996283e-01 Male\n516 3.977695e-01 Female\n517 3.000000e-01 Male\n518 3.000000e-01 Male\n519 3.000000e-01 Male\n520 3.000000e-01 Female\n521 7.446809e+02 Male\n522 6.095745e+02 Female\n523 1.427445e+02 Male\n524 3.000000e-01 Female\n525 2.973978e-02 Male\n526 3.977695e-01 Female\n527 4.095745e+02 Female\n528 4.595745e+02 Male\n529 3.000000e-01 Female\n530 1.976341e+02 Female\n531 3.776596e+02 Female\n532 1.777603e+02 Female\n533 4.312268e-01 Male\n534 6.765957e+02 Female\n535 7.978723e+02 Male\n536 9.665427e-02 Male\n537 1.879338e+02 Male\n538 4.358670e+01 Female\n539 3.000000e-01 Female\n540 3.000000e-01 Male\n541 2.638955e+01 Male\n542 3.180523e+01 Female\n543 1.746845e+02 Male\n544 1.876972e+02 Male\n545 1.044164e+02 Male\n546 1.202681e+02 Male\n547 1.630915e+02 Female\n548 1.276025e+02 Female\n549 8.880126e+01 Male\n550 3.563830e+02 Male\n551 2.212766e+02 Male\n552 1.969121e+01 Female\n553 3.755319e+02 Female\n554 1.214511e+02 Male\n555 1.034700e+02 Female\n556 3.000000e-01 Female\n557 3.643123e-01 Female\n558 6.319703e-02 Female\n559 3.000000e-01 Male\n560 3.000000e-01 Male\n561 3.000000e-01 Female\n562 3.000000e-01 Female\n563 3.000000e-01 Male\n564 3.000000e-01 Male\n565 3.000000e-01 Female\n566 3.000000e-01 Male\n567 1.664038e+02 Female\n568 2.946809e+02 Female\n569 4.391924e+01 Male\n570 1.874606e+02 Female\n571 1.143533e+02 Male\n572 1.600158e+02 Male\n573 1.635688e-01 Male\n574 8.809148e+01 Female\n575 1.337539e+02 Male\n576 1.985804e+02 Male\n577 1.578864e+02 Female\n578 3.000000e-01 Female\n579 3.000000e-01 Male\n580 1.953642e-01 Female\n581 1.119205e+00 Male\n582 2.523636e+02 Male\n583 3.000000e-01 Male\n584 4.844371e+00 Female\n585 3.000000e-01 Male\n586 1.492553e+02 Female\n587 1.993617e+02 Male\n588 2.847682e-01 Female\n589 3.145695e-01 Female\n590 3.000000e-01 Male\n591 3.406429e+01 Female\n592 6.595745e+01 Male\n593 3.000000e-01 Male\n594 2.174545e+02 Male\n595 NA Female\n596 5.957447e+01 Female\n597 7.236364e+02 Female\n598 3.000000e-01 Male\n599 3.000000e-01 Female\n600 3.000000e-01 Male\n601 2.676364e+02 Male\n602 1.891489e+02 Male\n603 3.036364e+02 Female\n604 3.000000e-01 Female\n605 3.000000e-01 Male\n606 3.000000e-01 Male\n607 3.000000e-01 Female\n608 3.000000e-01 Male\n609 1.447020e+00 Male\n610 2.130909e+02 Female\n611 1.357616e-01 Female\n612 3.000000e-01 Female\n613 3.000000e-01 Female\n614 5.534545e+02 Female\n615 1.891489e+02 Female\n616 7.202128e+01 Female\n617 3.250287e+01 Male\n618 1.655629e-02 Male\n619 3.123636e+02 Male\n620 3.000000e-01 Male\n621 7.138298e+01 Male\n622 3.000000e-01 Female\n623 6.946809e+01 Female\n624 4.012629e+01 Male\n625 1.629787e+02 Female\n626 1.508511e+02 Female\n627 1.655629e-02 Male\n628 3.000000e-01 Male\n629 4.635762e-02 Male\n630 3.000000e-01 Female\n631 3.000000e-01 Female\n632 3.000000e-01 Male\n633 1.942553e+02 Male\n634 3.690909e+02 Male\n635 3.000000e-01 Female\n636 3.000000e-01 Female\n637 2.847682e+00 Male\n638 1.435106e+02 Female\n639 3.000000e-01 Male\n640 4.752009e+01 Female\n641 2.621125e+01 Female\n642 1.055319e+02 Female\n643 3.000000e-01 Female\n644 1.149007e+00 Male\n645 2.927273e+02 Female\n646 3.000000e-01 Female\n647 3.000000e-01 Female\n648 4.839265e+01 Male\n649 3.000000e-01 Male\n650 3.000000e-01 Female\n651 2.251656e-01 Female\n\n\nWe can remove select columns using column names as well: (xxzane - why - c(“slum”) not working)\n\ndf[, -3] #remove column 3, \"slum\" variable\n\n IgG_concentration age gender slum\n1 5772 3.176895e-01 Female Non slum\n2 8095 3.436823e+00 Female Non slum\n3 9784 3.000000e-01 Male Non slum\n4 9338 1.432363e+02 Male Non slum\n5 6369 4.476534e-01 Male Non slum\n6 6885 2.527076e-02 Male Non slum\n7 6252 6.101083e-01 Female Non slum\n8 8913 3.000000e-01 Female Non slum\n9 7332 2.916968e+00 Male Non slum\n10 6941 1.649819e+00 Male Non slum\n11 5104 4.574007e+00 Male Non slum\n12 9078 1.583904e+02 Female Non slum\n13 9960 NA Male Non slum\n14 9651 1.065068e+02 Male Non slum\n15 9229 1.113870e+02 Male Non slum\n16 5210 4.144893e+01 Male Non slum\n17 5105 3.000000e-01 Male Non slum\n18 7607 2.527076e-01 Female Non slum\n19 7582 8.159247e+01 Female Non slum\n20 8179 1.825342e+02 Male Non slum\n21 5660 4.244656e+01 Male Non slum\n22 6696 1.193493e+02 Female Non slum\n23 7842 3.000000e-01 Male Mixed\n24 6578 3.000000e-01 Female Mixed\n25 9619 9.025271e-01 Female Mixed\n26 9838 3.501805e-01 Male Mixed\n27 6935 3.000000e-01 Male Mixed\n28 5885 1.227437e+00 Female Mixed\n29 9657 1.702055e+02 Female Mixed\n30 9146 3.000000e-01 Female Mixed\n31 7056 4.801444e-01 Male Mixed\n32 9144 2.527076e-02 Male Mixed\n33 8696 3.000000e-01 Female Mixed\n34 7042 5.776173e-02 Male Mixed\n35 5278 4.801444e-01 Female Mixed\n36 6541 3.826715e-01 Female Mixed\n37 6070 3.000000e-01 Male Mixed\n38 5490 4.048558e+02 Male Mixed\n39 6527 3.000000e-01 Male Mixed\n40 5389 5.451264e-01 Male Mixed\n41 9003 3.000000e-01 Female Mixed\n42 6682 5.590753e+01 Male Mixed\n43 7844 2.202166e-01 Female Mixed\n44 8257 1.709760e+02 Male Mixed\n45 7767 1.227437e+00 Male Mixed\n46 8391 4.567527e+02 Male Mixed\n47 8317 4.838480e+01 Male Mixed\n48 7397 1.227437e-01 Female Mixed\n49 8495 1.877256e-01 Female Non slum\n50 8093 3.000000e-01 Female Non slum\n51 7375 3.501805e-01 Male Non slum\n52 5255 3.339350e+00 Male Non slum\n53 8445 3.000000e-01 Female Non slum\n54 8959 5.451264e-01 Female Non slum\n55 8400 NA Male Non slum\n56 7420 2.104693e+00 Male Non slum\n57 5206 NA Male Non slum\n58 7431 3.826715e-01 Female Non slum\n59 7230 3.926366e+01 Female Non slum\n60 8208 1.129964e+00 Male Non slum\n61 8538 3.501805e+00 Female Non slum\n62 6125 7.542808e+01 Female Non slum\n63 5767 4.800475e+01 Female Non slum\n64 5487 1.000000e+00 Male Non slum\n65 5539 4.068884e+01 Male Non slum\n66 5759 3.000000e-01 Female Non slum\n67 6845 4.377672e+01 Female Non slum\n68 7170 1.193493e+02 Male Non slum\n69 6588 6.977740e+01 Male Non slum\n70 7939 1.373288e+02 Female Non slum\n71 5006 1.642979e+02 Male Non slum\n72 9180 NA Female Non slum\n73 9638 1.542808e+02 Male Non slum\n74 7781 6.033058e-01 Male Non slum\n75 6932 2.809917e-01 Male Non slum\n76 8120 1.966942e+00 Male Non slum\n77 9292 2.041322e+00 Male Non slum\n78 9228 2.115702e+00 Female Non slum\n79 8185 4.663043e+02 Male Non slum\n80 6797 3.000000e-01 Male Non slum\n81 5970 1.500796e+02 Male Non slum\n82 7219 1.543790e+02 Female Non slum\n83 6870 2.561983e-01 Female Non slum\n84 7653 1.596338e+02 Male Non slum\n85 8824 1.732484e+02 Female Non slum\n86 8311 4.641304e+02 Female Non slum\n87 9458 3.736364e+01 Male Non slum\n88 8275 1.572452e+02 Female Non slum\n89 6786 3.000000e-01 Male Non slum\n90 6595 3.000000e-01 Male Non slum\n91 5264 8.264463e-02 Male Non slum\n92 9188 6.776859e-01 Female Non slum\n93 6611 7.272727e-01 Male Non slum\n94 6840 2.066116e-01 Female Non slum\n95 5663 1.966942e+00 Male Non slum\n96 9611 3.000000e-01 Male Non slum\n97 7717 3.000000e-01 Male Non slum\n98 8374 2.809917e-01 Female Non slum\n99 5134 8.016529e-01 Female Non slum\n100 8122 1.818182e-01 Female Non slum\n101 6192 1.818182e-01 Male Non slum\n102 9668 8.264463e-02 Female Non slum\n103 9577 3.422727e+01 Female Non slum\n104 6403 8.743506e+00 Male Non slum\n105 9464 3.000000e-01 Male Non slum\n106 8157 1.641720e+02 Female Non slum\n107 9451 4.049587e-01 Male Non slum\n108 6615 1.001592e+02 Male Non slum\n109 9074 4.489130e+02 Female Non slum\n110 7479 1.101911e+02 Female Non slum\n111 8946 4.440909e+01 Male Non slum\n112 5296 1.288217e+02 Female Non slum\n113 6238 2.840909e+01 Male Non slum\n114 6303 1.003981e+02 Female Non slum\n115 6662 8.512397e-01 Female Non slum\n116 6251 1.322314e-01 Male Non slum\n117 9110 1.297521e+00 Female Non slum\n118 8480 1.570248e-01 Male Non slum\n119 5229 1.966942e+00 Female Non slum\n120 9173 1.536624e+02 Male Non slum\n121 9896 3.000000e-01 Female Non slum\n122 5057 3.000000e-01 Female Non slum\n123 7732 1.074380e+00 Male Non slum\n124 6882 1.099174e+00 Female Non slum\n125 9587 3.057851e-01 Female Non slum\n126 9930 3.000000e-01 Female Non slum\n127 6960 5.785124e-02 Female Non slum\n128 6335 4.391304e+02 Female Non slum\n129 6286 6.130435e+02 Female Non slum\n130 9035 1.074380e-01 Male Non slum\n131 5720 7.125796e+01 Male Non slum\n132 7368 4.222727e+01 Male Non slum\n133 5170 1.620223e+02 Female Non slum\n134 6691 3.750000e+01 Female Non slum\n135 6173 1.534236e+02 Female Non slum\n136 8170 6.239130e+02 Female Non slum\n137 9637 5.521739e+02 Male Non slum\n138 9482 5.785124e-02 Female Non slum\n139 7880 6.547945e-01 Female Non slum\n140 6307 8.767123e-02 Female Non slum\n141 8822 3.000000e-01 Male Non slum\n142 8190 2.849315e+00 Female Non slum\n143 7554 3.835616e-02 Male Non slum\n144 6519 2.849315e-01 Male Non slum\n145 9764 4.649315e+00 Male Non slum\n146 8792 1.369863e-01 Female Non slum\n147 6721 3.589041e-01 Male Non slum\n148 9042 1.049315e+00 Male Non slum\n149 7407 4.668998e+01 Female Non slum\n150 7229 1.473510e+02 Female Non slum\n151 7532 4.589744e+01 Male Non slum\n152 6516 2.109589e-01 Male Non slum\n153 7941 1.741722e+02 Female Non slum\n154 8124 2.496503e+01 Female Non slum\n155 7869 1.850993e+02 Male Non slum\n156 5647 1.863014e-01 Male Non slum\n157 9120 1.863014e-01 Male Non slum\n158 6608 4.589744e+01 Female Non slum\n159 8635 1.942881e+02 Female Mixed\n160 9341 5.079646e+02 Female Mixed\n161 9982 8.767123e-01 Male Mixed\n162 6976 2.750685e+00 Male Mixed\n163 6008 1.503311e+02 Female Mixed\n164 5432 3.000000e-01 Male Mixed\n165 5749 3.095890e-01 Male Mixed\n166 6428 3.000000e-01 Male Mixed\n167 5947 6.371681e+02 Female Mixed\n168 6027 6.054795e-01 Female Mixed\n169 5064 1.955298e+02 Female Mixed\n170 5861 1.786424e+02 Male Mixed\n171 6702 1.120861e+02 Female Mixed\n172 7851 1.331954e+02 Male Mixed\n173 8310 2.159292e+02 Male Mixed\n174 5897 5.628319e+02 Male Mixed\n175 9249 1.900662e+02 Female Mixed\n176 9163 6.547945e-01 Male Mixed\n177 6550 1.665753e+00 Male Mixed\n178 5859 1.739238e+02 Male Mixed\n179 5607 9.991722e+01 Male Mixed\n180 8746 9.321192e+01 Male Mixed\n181 5274 8.767123e-02 Female Mixed\n182 9412 NA Male Mixed\n183 5691 6.794521e-01 Female Mixed\n184 9016 5.808219e-01 Male Mixed\n185 9128 1.369863e-01 Female Mixed\n186 8539 2.060274e+00 Female Mixed\n187 5703 1.610099e+02 Male Mixed\n188 9573 4.082192e-01 Female Mixed\n189 5852 8.273973e-01 Male Mixed\n190 5971 4.601770e+02 Female Mixed\n191 7015 1.389073e+02 Female Mixed\n192 8221 3.867133e+01 Female Mixed\n193 6752 9.260274e-01 Female Mixed\n194 7436 5.918874e+01 Female Mixed\n195 6869 1.870861e+02 Female Mixed\n196 8947 4.328767e-01 Male Mixed\n197 7360 6.301370e-02 Male Mixed\n198 7494 3.000000e-01 Female Mixed\n199 8243 1.548013e+02 Male Mixed\n200 6176 5.819536e+01 Female Mixed\n201 6818 1.724338e+02 Female Mixed\n202 8083 1.932401e+01 Female Mixed\n203 6711 2.164420e+00 Female Non slum\n204 8890 9.757412e-01 Female Non slum\n205 5576 1.509434e-01 Male Non slum\n206 8396 1.509434e-01 Female Non slum\n207 5986 7.766571e+01 Male Non slum\n208 9758 4.319563e+01 Female Non slum\n209 5444 1.752022e-01 Male Non slum\n210 6394 3.094775e+01 Female Non slum\n211 5694 1.266846e-01 Male Non slum\n212 9604 2.919806e+01 Male Non slum\n213 7895 9.545455e+00 Female Non slum\n214 5141 2.735115e+01 Female Non slum\n215 8034 1.314841e+02 Female Non slum\n216 6566 3.643985e+01 Male Non slum\n217 6827 1.498559e+02 Female Non slum\n218 7400 9.363636e+00 Female Non slum\n219 9094 2.479784e-01 Male Non slum\n220 9474 5.390836e-02 Female Non slum\n221 7984 8.787062e-01 Female Slum\n222 9524 1.994609e-01 Male Slum\n223 9598 3.000000e-01 Female Slum\n224 9664 3.000000e-01 Male Slum\n225 9910 5.390836e-03 Female Slum\n226 9216 4.177898e-01 Female Slum\n227 9706 3.000000e-01 Female Slum\n228 5320 2.479784e-01 Male Slum\n229 5256 2.964960e-02 Male Slum\n230 9006 2.964960e-01 Male Slum\n231 6413 5.148248e+00 Female Slum\n232 8717 1.994609e-01 Male Slum\n233 9873 3.000000e-01 Male Slum\n234 6699 1.779539e+02 Male Slum\n235 8228 3.290210e+02 Female Slum\n236 6494 3.000000e-01 Male Slum\n237 9294 1.809798e+02 Female Slum\n238 7680 4.905660e-01 Male Slum\n239 7534 1.266846e-01 Male Slum\n240 9920 1.543948e+02 Female Slum\n241 9814 1.379683e+02 Female Slum\n242 5363 6.153846e+02 Male Slum\n243 5842 1.474784e+02 Male Slum\n244 7992 3.000000e-01 Female Non slum\n245 5565 1.024259e+00 Male Non slum\n246 5258 4.444056e+02 Female Non slum\n247 8200 3.000000e-01 Male Non slum\n248 8795 2.504043e+00 Female Non slum\n249 7676 3.000000e-01 Female Non slum\n250 7029 3.000000e-01 Female Non slum\n251 7535 7.816712e-02 Female Non slum\n252 5026 3.000000e-01 Female Non slum\n253 8630 5.390836e-02 Male Non slum\n254 6989 1.494236e+02 Female Non slum\n255 8454 5.972622e+01 Male Non slum\n256 9741 6.361186e-01 Female Non slum\n257 6418 1.837896e+02 Female Non slum\n258 9922 1.320809e+02 Female Slum\n259 8504 1.571906e-01 Male Slum\n260 6491 1.520231e+02 Male Slum\n261 6002 3.000000e-01 Female Slum\n262 7127 3.000000e-01 Female Slum\n263 8540 1.823699e+02 Male Slum\n264 7115 3.000000e-01 Male Slum\n265 7268 2.173913e+00 Male Slum\n266 8279 2.142202e+01 Male Slum\n267 8880 3.000000e-01 Female Slum\n268 8076 3.408027e+00 Male Slum\n269 6250 4.155963e+01 Male Slum\n270 8542 9.698997e-02 Male Slum\n271 5393 1.238532e+01 Female Slum\n272 9197 9.528926e+00 Male Slum\n273 6651 1.916185e+02 Female Slum\n274 7473 1.060201e+00 Male Slum\n275 6589 3.679104e+02 Female Slum\n276 6867 4.288991e+01 Male Slum\n277 5413 9.971098e+01 Male Slum\n278 6765 3.000000e-01 Male Slum\n279 8933 1.208092e+02 Male Slum\n280 6294 3.000000e-01 Male Non slum\n281 8688 6.688963e-03 Female Non slum\n282 8108 2.505017e+00 Female Non slum\n283 6926 1.481605e+00 Male Non slum\n284 5880 3.000000e-01 Female Non slum\n285 5529 5.183946e-01 Female Non slum\n286 8963 3.000000e-01 Female Non slum\n287 9594 1.872910e-01 Male Non slum\n288 8075 3.678930e-01 Female Non slum\n289 5680 3.000000e-01 Male Non slum\n290 5617 4.529851e+02 Female Non slum\n291 5080 3.169725e+01 Female Non slum\n292 7719 3.000000e-01 Male Non slum\n293 6780 4.922018e+01 Male Non slum\n294 8768 2.548507e+02 Male Non slum\n295 7031 1.661850e+02 Male Non slum\n296 7740 9.164179e+02 Male Non slum\n297 8855 3.678930e-01 Female Non slum\n298 7241 1.236994e+02 Male Non slum\n299 8156 6.705202e+01 Male Non slum\n300 7333 3.834862e+01 Male Non slum\n301 6906 1.963211e+00 Female Mixed\n302 9511 3.000000e-01 Male Mixed\n303 9336 2.474916e-01 Male Mixed\n304 6644 3.000000e-01 Female Mixed\n305 5554 2.173913e-01 Male Mixed\n306 8094 8.193980e-01 Male Mixed\n307 8836 2.444816e+00 Female Mixed\n308 7147 3.000000e-01 Male Mixed\n309 7745 1.571906e-01 Female Mixed\n310 9345 1.849711e+02 Male Mixed\n311 5606 6.119403e+02 Female Mixed\n312 9766 3.000000e-01 Female Mixed\n313 6666 4.280936e-01 Female Mixed\n314 9965 9.698997e-02 Male Mixed\n315 7927 3.678930e-02 Female Mixed\n316 6266 4.832090e+02 Male Mixed\n317 9487 1.390173e+02 Female Mixed\n318 7089 3.000000e-01 Male Mixed\n319 5731 6.555970e+02 Female Mixed\n320 7962 1.526012e+02 Female Mixed\n321 9532 3.000000e-01 Female Mixed\n322 6687 7.222222e-01 Male Non slum\n323 6570 7.724426e+01 Male Non slum\n324 5781 3.000000e-01 Male Non slum\n325 8935 6.111111e-01 Female Non slum\n326 5780 1.555556e+00 Female Non slum\n327 9029 3.055556e-01 Male Non slum\n328 5668 1.500000e+00 Male Non slum\n329 8203 1.470772e+02 Male Non slum\n330 7381 1.694444e+00 Female Non slum\n331 7734 3.138298e+02 Female Non slum\n332 7257 1.414405e+02 Female Non slum\n333 8418 1.990605e+02 Female Non slum\n334 8259 4.212766e+02 Male Non slum\n335 5587 3.000000e-01 Male Non slum\n336 8499 3.000000e-01 Male Non slum\n337 7897 6.478723e+02 Male Non slum\n338 8300 3.000000e-01 Male Non slum\n339 9691 2.222222e+00 Female Non slum\n340 5873 3.000000e-01 Male Non slum\n341 6690 2.055556e+00 Male Non slum\n342 9970 2.777778e-02 Female Non slum\n343 8978 8.333333e-02 Male Non slum\n344 6181 1.032359e+02 Female Non slum\n345 8218 1.611111e+00 Female Non slum\n346 5387 8.333333e-02 Female Non slum\n347 7850 2.333333e+00 Female Non slum\n348 7326 5.755319e+02 Male Non slum\n349 8448 1.686848e+02 Female Non slum\n350 7264 1.111111e-01 Male Non slum\n351 8361 3.000000e-01 Male Non slum\n352 7497 8.372340e+02 Female Non slum\n353 5559 3.000000e-01 Male Non slum\n354 7321 3.784504e+01 Male Non slum\n355 8372 3.819149e+02 Male Non slum\n356 5030 5.555556e-02 Female Non slum\n357 6936 3.000000e+02 Female Non slum\n358 9628 1.855950e+02 Male Non slum\n359 8558 1.944444e-01 Female Non slum\n360 7840 3.000000e-01 Male Non slum\n361 5100 5.555556e-02 Female Non slum\n362 8244 1.138889e+00 Male Non slum\n363 9115 4.254237e+01 Female Non slum\n364 5489 3.000000e-01 Male Non slum\n365 5766 3.000000e-01 Male Non slum\n366 5024 3.000000e-01 Female Non slum\n367 8599 3.000000e-01 Female Non slum\n368 8895 3.138298e+02 Female Non slum\n369 7708 1.235908e+02 Male Non slum\n370 7646 4.159574e+02 Male Non slum\n371 6640 3.009685e+01 Female Non slum\n372 8958 1.567850e+02 Female Non slum\n373 6477 1.367432e+02 Female Non slum\n374 7910 3.731235e+01 Female Non slum\n375 7829 9.164927e+01 Male Non slum\n376 7503 2.936170e+02 Female Non slum\n377 5209 8.820459e+01 Female Non slum\n378 6763 1.035491e+02 Male Non slum\n379 8976 7.379958e+01 Female Non slum\n380 9223 3.000000e-01 Male Non slum\n381 7692 1.718750e+02 Male Non slum\n382 7453 2.128527e+00 Male Non slum\n383 9775 1.253918e+00 Female Non slum\n384 9662 2.382445e-01 Male Non slum\n385 8733 4.639498e-01 Female Non slum\n386 5695 1.253918e-01 Male Non slum\n387 7714 1.253918e-01 Male Non slum\n388 9224 3.000000e-01 Female Non slum\n389 7635 1.000000e+00 Male Non slum\n390 7176 1.570043e+02 Male Non slum\n391 6102 4.344086e+02 Female Non slum\n392 7817 2.184953e+00 Male Non slum\n393 9719 1.507837e+00 Female Non slum\n394 9740 3.228840e-01 Female Non slum\n395 9528 4.588024e+01 Male Non slum\n396 7142 1.660560e+02 Male Non slum\n397 5689 3.000000e-01 Male Non slum\n398 5439 3.043011e+02 Male Non slum\n399 6718 2.612903e+02 Female Non slum\n400 6569 1.621767e+02 Male Non slum\n401 9444 3.228840e-01 Male Mixed\n402 6964 4.639498e-01 Female Mixed\n403 6420 2.495298e+00 Female Mixed\n404 9189 3.257053e+00 Female Mixed\n405 9368 3.793103e-01 Female Mixed\n406 6360 NA Male Mixed\n407 8196 6.896552e-02 Female Mixed\n408 8297 3.000000e-01 Male Mixed\n409 6674 1.423197e+00 Female Mixed\n410 5269 3.000000e-01 Female Mixed\n411 6599 3.000000e-01 Female Mixed\n412 7713 1.786638e+02 Male Mixed\n413 8644 3.279570e+02 Male Mixed\n414 9680 NA Female Mixed\n415 6305 1.903017e+02 Male Mixed\n416 8493 1.654095e+02 Female Mixed\n417 5297 4.639498e-01 Female Mixed\n418 7723 1.815733e+02 Male Mixed\n419 7510 1.366771e+00 Male Mixed\n420 5102 1.536050e-01 Female Mixed\n421 7816 1.306587e+01 Male Mixed\n422 5143 2.129032e+02 Female Mixed\n423 7414 1.925647e+02 Male Mixed\n424 5127 3.000000e-01 Female Non slum\n425 5830 1.028213e+00 Female Non slum\n426 8929 3.793103e-01 Female Non slum\n427 7993 8.025078e-01 Female Non slum\n428 8092 4.860215e+02 Female Non slum\n429 9750 3.000000e-01 Female Non slum\n430 6660 2.100313e-01 Male Non slum\n431 8054 2.767665e+01 Female Non slum\n432 6086 1.592476e+00 Male Non slum\n433 6878 9.717868e-02 Female Non slum\n434 8125 1.028213e+00 Female Non slum\n435 9500 3.793103e-01 Male Non slum\n436 8105 1.292026e+02 Male Non slum\n437 9593 4.425150e+01 Female Non slum\n438 5202 3.193548e+02 Female Non slum\n439 7207 1.860991e+02 Female Non slum\n440 5518 6.614420e-01 Female Non slum\n441 9820 5.203762e-01 Male Non slum\n442 6958 1.330819e+02 Male Non slum\n443 9445 1.673491e+02 Female Non slum\n444 8774 3.000000e-01 Male Non slum\n445 9614 1.117457e+02 Male Non slum\n446 9810 3.045509e+01 Female Non slum\n447 7271 3.000000e-01 Male Non slum\n448 8031 8.280255e-02 Female Non slum\n449 7232 3.000000e-01 Female Non slum\n450 7452 1.200637e+00 Female Non slum\n451 5921 1.687898e-01 Male Non slum\n452 8136 7.367273e+02 Female Non slum\n453 6605 8.280255e-02 Male Non slum\n454 5125 5.127389e-01 Male Non slum\n455 5911 1.974522e-01 Male Non slum\n456 9644 7.993631e-01 Female Non slum\n457 5760 3.000000e-01 Male Non slum\n458 7055 3.298182e+02 Male Non slum\n459 9064 9.736842e+01 Female Non slum\n460 6925 3.000000e-01 Female Non slum\n461 7757 3.000000e-01 Female Non slum\n462 8527 4.214545e+02 Female Non slum\n463 8521 3.000000e-01 Male Non slum\n464 6260 2.578182e+02 Female Non slum\n465 9578 2.261147e-01 Male Non slum\n466 9570 3.000000e-01 Female Non slum\n467 6246 1.883901e+02 Male Non slum\n468 9622 9.458204e+01 Female Non slum\n469 7661 3.000000e-01 Female Non slum\n470 9374 3.000000e-01 Male Non slum\n471 8446 7.707006e-01 Female Non slum\n472 8332 5.032727e+02 Male Non slum\n473 8008 1.544586e+00 Female Non slum\n474 9365 1.431115e+02 Female Non slum\n475 9819 3.000000e-01 Male Non slum\n476 5173 1.458599e+00 Male Non slum\n477 6722 1.247678e+02 Female Non slum\n478 7668 NA Female Non slum\n479 8980 4.334545e+02 Male Non slum\n480 5204 3.000000e-01 Female Non slum\n481 6412 6.156364e+02 Female Non slum\n482 6404 9.574303e+01 Male Non slum\n483 5693 1.928019e+02 Male Non slum\n484 8100 1.888545e+02 Male Non slum\n485 9760 1.598297e+02 Female Non slum\n486 6377 5.127389e-01 Male Non slum\n487 6012 1.171053e+02 Female Non slum\n488 6224 NA Male Non slum\n489 6561 2.547771e-02 Female Non slum\n490 8475 1.707430e+02 Female Non slum\n491 6629 3.000000e-01 Male Non slum\n492 7200 1.869969e+02 Male Non slum\n493 9453 4.731481e+01 Male Non slum\n494 6449 1.988390e+02 Female Non slum\n495 9452 3.000000e-01 Male Non slum\n496 7162 8.808050e+01 Male Non slum\n497 8962 2.003185e+00 Female Non slum\n498 7328 3.000000e-01 Male Non slum\n499 9097 3.509259e+01 Female Non slum\n500 9131 9.365325e+01 Female Non slum\n501 7280 3.000000e-01 Male Non slum\n502 5783 3.736111e+01 Female Non slum\n503 9895 1.674923e+02 Female Non slum\n504 7986 8.808050e+01 Male Non slum\n505 7146 1.656347e+02 Female Non slum\n506 8671 3.722222e+01 Female Non slum\n507 5273 6.756364e+02 Female Non slum\n508 5063 3.000000e-01 Male Non slum\n509 6729 1.698142e+02 Male Non slum\n510 9085 1.628483e+02 Female Non slum\n511 9929 5.985130e-01 Male Non slum\n512 8479 1.903346e+00 Female Non slum\n513 7395 3.000000e-01 Male Non slum\n514 6374 3.000000e-01 Male Non slum\n515 7878 8.996283e-01 Male Non slum\n516 9603 3.977695e-01 Female Non slum\n517 7994 3.000000e-01 Male Non slum\n518 5277 3.000000e-01 Male Non slum\n519 5054 3.000000e-01 Male Non slum\n520 5440 3.000000e-01 Female Non slum\n521 6551 7.446809e+02 Male Non slum\n522 5281 6.095745e+02 Female Non slum\n523 7145 1.427445e+02 Male Non slum\n524 5275 3.000000e-01 Female Non slum\n525 9542 2.973978e-02 Male Non slum\n526 9371 3.977695e-01 Female Non slum\n527 5598 4.095745e+02 Female Non slum\n528 7148 4.595745e+02 Male Non slum\n529 5624 3.000000e-01 Female Non slum\n530 6998 1.976341e+02 Female Non slum\n531 9286 3.776596e+02 Female Non slum\n532 7589 1.777603e+02 Female Non slum\n533 7095 4.312268e-01 Male Mixed\n534 5455 6.765957e+02 Female Mixed\n535 6257 7.978723e+02 Male Mixed\n536 8627 9.665427e-02 Male Mixed\n537 9786 1.879338e+02 Male Mixed\n538 8176 4.358670e+01 Female Mixed\n539 9198 3.000000e-01 Female Mixed\n540 6586 3.000000e-01 Male Mixed\n541 8850 2.638955e+01 Male Mixed\n542 9560 3.180523e+01 Female Mixed\n543 7144 1.746845e+02 Male Mixed\n544 8230 1.876972e+02 Male Mixed\n545 7559 1.044164e+02 Male Mixed\n546 5312 1.202681e+02 Male Mixed\n547 6560 1.630915e+02 Female Mixed\n548 6091 1.276025e+02 Female Mixed\n549 5578 8.880126e+01 Male Mixed\n550 5837 3.563830e+02 Male Mixed\n551 8347 2.212766e+02 Male Mixed\n552 6453 1.969121e+01 Female Mixed\n553 5758 3.755319e+02 Female Mixed\n554 5569 1.214511e+02 Male Non slum\n555 8766 1.034700e+02 Female Non slum\n556 8002 3.000000e-01 Female Non slum\n557 7839 3.643123e-01 Female Non slum\n558 5434 6.319703e-02 Female Non slum\n559 7636 3.000000e-01 Male Non slum\n560 6164 3.000000e-01 Male Non slum\n561 9243 3.000000e-01 Female Non slum\n562 5872 3.000000e-01 Female Non slum\n563 8079 3.000000e-01 Male Non slum\n564 9762 3.000000e-01 Male Non slum\n565 9476 3.000000e-01 Female Non slum\n566 8345 3.000000e-01 Male Non slum\n567 8128 1.664038e+02 Female Non slum\n568 7956 2.946809e+02 Female Non slum\n569 8677 4.391924e+01 Male Non slum\n570 5881 1.874606e+02 Female Non slum\n571 7498 1.143533e+02 Male Non slum\n572 8134 1.600158e+02 Male Non slum\n573 7748 1.635688e-01 Male Non slum\n574 7990 8.809148e+01 Female Non slum\n575 6184 1.337539e+02 Male Non slum\n576 6339 1.985804e+02 Male Non slum\n577 5113 1.578864e+02 Female Non slum\n578 9449 3.000000e-01 Female Non slum\n579 8110 3.000000e-01 Male Non slum\n580 9307 1.953642e-01 Female Non slum\n581 5555 1.119205e+00 Male Non slum\n582 9152 2.523636e+02 Male Non slum\n583 7969 3.000000e-01 Male Non slum\n584 6116 4.844371e+00 Female Non slum\n585 8294 3.000000e-01 Male Non slum\n586 8938 1.492553e+02 Female Non slum\n587 9539 1.993617e+02 Male Non slum\n588 9470 2.847682e-01 Female Non slum\n589 6677 3.145695e-01 Female Non slum\n590 8752 3.000000e-01 Male Non slum\n591 5574 3.406429e+01 Female Non slum\n592 5989 6.595745e+01 Male Non slum\n593 9813 3.000000e-01 Male Non slum\n594 6150 2.174545e+02 Male Non slum\n595 5730 NA Female Non slum\n596 8038 5.957447e+01 Female Non slum\n597 5964 7.236364e+02 Female Non slum\n598 9043 3.000000e-01 Male Non slum\n599 5095 3.000000e-01 Female Non slum\n600 8922 3.000000e-01 Male Non slum\n601 5469 2.676364e+02 Male Non slum\n602 6726 1.891489e+02 Male Non slum\n603 7495 3.036364e+02 Female Non slum\n604 8159 3.000000e-01 Female Non slum\n605 6709 3.000000e-01 Male Non slum\n606 5855 3.000000e-01 Male Non slum\n607 6058 3.000000e-01 Female Non slum\n608 7292 3.000000e-01 Male Non slum\n609 6437 1.447020e+00 Male Non slum\n610 9326 2.130909e+02 Female Non slum\n611 8222 1.357616e-01 Female Non slum\n612 6789 3.000000e-01 Female Non slum\n613 6348 3.000000e-01 Female Non slum\n614 5958 5.534545e+02 Female Non slum\n615 9211 1.891489e+02 Female Non slum\n616 9450 7.202128e+01 Female Non slum\n617 6540 3.250287e+01 Male Non slum\n618 8796 1.655629e-02 Male Non slum\n619 7971 3.123636e+02 Male Non slum\n620 7549 3.000000e-01 Male Non slum\n621 9799 7.138298e+01 Male Non slum\n622 7013 3.000000e-01 Female Non slum\n623 5599 6.946809e+01 Female Non slum\n624 8601 4.012629e+01 Male Non slum\n625 7383 1.629787e+02 Female Non slum\n626 6656 1.508511e+02 Female Non slum\n627 5641 1.655629e-02 Male Non slum\n628 6222 3.000000e-01 Male Non slum\n629 7674 4.635762e-02 Male Non slum\n630 5293 3.000000e-01 Female Non slum\n631 6715 3.000000e-01 Female Non slum\n632 7057 3.000000e-01 Male Non slum\n633 7072 1.942553e+02 Male Non slum\n634 6380 3.690909e+02 Male Non slum\n635 6762 3.000000e-01 Female Non slum\n636 5799 3.000000e-01 Female Non slum\n637 6681 2.847682e+00 Male Non slum\n638 8755 1.435106e+02 Female Non slum\n639 6896 3.000000e-01 Male Non slum\n640 5945 4.752009e+01 Female Non slum\n641 5035 2.621125e+01 Female Non slum\n642 6776 1.055319e+02 Female Non slum\n643 7863 3.000000e-01 Female Non slum\n644 9836 1.149007e+00 Male Non slum\n645 7860 2.927273e+02 Female Non slum\n646 5248 3.000000e-01 Female Non slum\n647 5677 3.000000e-01 Female Non slum\n648 9576 4.839265e+01 Male Non slum\n649 5824 3.000000e-01 Male Non slum\n650 9184 3.000000e-01 Female Non slum\n651 5397 2.251656e-01 Female Non slum\n\n#Note df$slum <- NULL would also work\n\nWe can also grab the age column using the $ operator.\n\ndf$age\n\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n\n\nOr we can subset by rows and pull the 100th observation/row.\n\ndf[100,] \n\n IgG_concentration age age gender slum\n100 8122 0.1818182 5 Female Non slum\n\n\nor maybe the age of the 100th observation/row.\n\ndf[100,\"age\"] \n\n[1] 0.1818182" + "objectID": "schedule.html#day-02-tuesday", + "href": "schedule.html#day-02-tuesday", + "title": "Course Schedule", + "section": "Day 02 – Tuesday", + "text": "Day 02 – Tuesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 09:00 am\nexercise review and questions / catchup\n\n\n09:00 am - 09:15 am\nModule 8\n\n\n09:15 am - 10:00 am\nExercise 3 work time\n\n\n10:00 am - 10:30 am\nCoffee break\n\n\n10:30 am - 10:45 am\nExercise review\n\n\n10:45 am - 11:15 am\nModule 9\n\n\n11:15 am - 12:00 pm\nData analysis walkthrough\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby); Lunch and Learn!\n\n\n01:30 pm - 02:00 pm\nExercise 4\n\n\n02:00 pm - 02:30 pm\nExercise 4 review\n\n\n02:30 pm - 03:00 pm\nModule 10\n\n\n03:00 pm - 03:30 pm\nCoffee break\n\n\n03:30 pm - 04:00 pm\nExercise 5\n\n\n04:00 pm - 04:30 pm\nReview exercise 5\n\n\n04:30 pm - 05:00 pm\nModule 11", + "crumbs": [ + "Course Schedule" + ] }, { - "objectID": "modules/Module06-DataSubset.html#logical-operators", - "href": "modules/Module06-DataSubset.html#logical-operators", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Logical operators", - "text": "Logical operators\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\n\n\noperator\noperator option\ndescription\n\n\n\n\n<\n%l%\nless than\n\n\n<=\n%le%\nless than or equal to\n\n\n>\n%g%\ngreater than\n\n\n>=\n%ge%\ngreater than or equal to\n\n\n==\n\nequal to\n\n\n!=\n\nnot equal to\n\n\nx&y\n\nx and y\n\n\nx|y\n\nx or y\n\n\n%in%\n\nmatch\n\n\n%!in%\n\ndo not match" + "objectID": "schedule.html#day-03-wednesday", + "href": "schedule.html#day-03-wednesday", + "title": "Course Schedule", + "section": "Day 03 – Wednesday", + "text": "Day 03 – Wednesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 10:00 am\ntbd; Modules 12 (Amy) and 13 (Zane)\n\n\n10:00 am - 10:15 am\nCoffee break\n\n\n10:30 am - 12:00 pm\ntbd; Module 14, practice, questions, review", + "crumbs": [ + "Course Schedule" + ] }, { - "objectID": "modules/Module06-DataSubset.html#logical-operators-examples", - "href": "modules/Module06-DataSubset.html#logical-operators-examples", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Logical operators examples", - "text": "Logical operators examples\nLet’s practice. First, here is a reminder of what the number.object contains.\n\nnumber.object\n\n[1] 3\n\n\nNow, we will use logical operators to evaluate the object.\n\nnumber.object<4\n\n[1] TRUE\n\nnumber.object>=3\n\n[1] TRUE\n\nnumber.object!=5\n\n[1] TRUE\n\nnumber.object %in% c(6,7,2)\n\n[1] FALSE\n\n\nWe can use any of these logical operators to subset our data.\n\n# Overall mean\nmean(df$IgG_concentration, na.rm=TRUE)\n\n[1] 87.36826\n\n# Mean for all children who are not age 3\nmean(df$IgG_concentration[df$age != 3], na.rm=TRUE)\n\n[1] 90.32824\n\n# Mean for all children who are between 0 and 3 or between 7 and 10 years old\nmean(df$IgG_concentration[df$age %in% c(0:3, 7:10)], na.rm=TRUE)\n\n[1] 74.0914" + "objectID": "modules/Module11-RMarkdown.html#learning-goals", + "href": "modules/Module11-RMarkdown.html#learning-goals", + "title": "Module 11: Literate Programming", + "section": "Learning goals", + "text": "Learning goals\n\nDefine literate programming\nImplement literate programming in R using knitr and either R Markdown or Quarto\nInclude plots, tables, and references along with your code in a written report.\nLocate additional resources for literate programming with R Markdown or Quarto.", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-rename-columns", - "href": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-rename-columns", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing and logical operators to rename columns", - "text": "Using indexing and logical operators to rename columns\n\nWe can assign the column names from data frame df to an object cn, then we can modify cn directly using indexing and logical operators, finally we reassign the column names, cn, back to the data frame df:\n\n\ncn <- colnames(df)\ncn\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncn==\"IgG_concentration\"\n\n[1] FALSE TRUE FALSE FALSE FALSE\n\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_mIU\" \"age\" \n[4] \"gender\" \"slum\" \n\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset" + "objectID": "modules/Module11-RMarkdown.html#what-is-literate-programming", + "href": "modules/Module11-RMarkdown.html#what-is-literate-programming", + "title": "Module 11: Literate Programming", + "section": "What is literate programming?", + "text": "What is literate programming?\n\nProgramming files contain code along with text, code results, and other supporting information.\nInstead of having separate code and text, that you glue together in Word, we have one document which combines code and text.", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-subset-data", - "href": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-subset-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing and logical operators to subset data", - "text": "Using indexing and logical operators to subset data\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to df_lt10. Note that the logical operators df$age<=10 is before the comma because I want to subset by rows (the first dimension).\n\ndf_lte10 <- df[df$age<=10, ]\n\nLets check that my subsets worked using the summary() function.\n\nsummary(df_lte10$age)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.0 3.0 4.0 4.8 7.0 10.0 9 \n\n\n\nIn the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n\nLets check that my subsets worked using the summary() function.\n\nsummary(df_lte5_gt10$age)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.00 2.50 4.00 6.08 11.00 15.00 9" + "objectID": "modules/Module11-RMarkdown.html#what-is-literate-programming-1", + "href": "modules/Module11-RMarkdown.html#what-is-literate-programming-1", + "title": "Module 11: Literate Programming", + "section": "What is literate programming?", + "text": "What is literate programming?\n\nR markdown example, from https://rmarkdown.rstudio.com/authoring_quick_tour.html", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#missing-values", - "href": "modules/Module06-DataSubset.html#missing-values", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Missing values", - "text": "Missing values\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\nTypes of “missing” values:\n\nNA - Not Applicable general missing data\nNaN - stands for “Not a Number”, happens when you do 0/0.\nInf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.\nblank space - sometimes when data is read it, there is a blank space left\nan empty string (e.g., \"\")\nNULL- undefined value that represents something that does not exist" + "objectID": "modules/Module11-RMarkdown.html#literate-programming-examples", + "href": "modules/Module11-RMarkdown.html#literate-programming-examples", + "title": "Module 11: Literate Programming", + "section": "Literate programming examples", + "text": "Literate programming examples\n\nWriting a research paper with R Markdown: https://github.com/wzbillings/Patient-vs-Clinician-Symptom-Reports\nWriting a book with R Markdown: https://github.com/moderndive/ModernDive_book\nPersonal websites (like my tutorial!): https://jadeyryan.com/blog/2024-02-19_beginner-quarto-netlify/\nOther examples: https://bookdown.org/yihui/rmarkdown/basics-examples.html", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#more-logical-operators", - "href": "modules/Module06-DataSubset.html#more-logical-operators", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "More Logical Operators", - "text": "More Logical Operators\n\n\n\noperator\noperator option\ndescription\n\n\n\n\nis.na\n\nis NAN or NA\n\n\nis.nan\n\nis NAN\n\n\n!is.na\n\nis not NAN or NA\n\n\n!is.nan\n\nis not NAN\n\n\nis.infinite\n\nis infinite\n\n\nany\n\nare any TRUE" + "objectID": "modules/Module11-RMarkdown.html#r-markdown-and-quarto", + "href": "modules/Module11-RMarkdown.html#r-markdown-and-quarto", + "title": "Module 11: Literate Programming", + "section": "R Markdown and Quarto", + "text": "R Markdown and Quarto\n\nR Markdown and Quarto are both implementations of literate programming using R, with the knitr package for the backend. Both are supported by RStudio.\nTo use R Markdown, you need to install.packages(\"rmarkdown\").\nQuarto comes with new versions of RStudio, but you can also install the latest version from the Quarto website.\nR Markdown is older and now very commonly used. Quarto is newer and so has many fancy new features, but more bugs that are constantly being found and fixed.\nIn this class, we will use R Markdown. But if you decide to use quarto, 90% of your knowledge will transfer since they are very similar.\n\nAdvantages of R Markdown: more online resources, most common bugs have been fixed over the years, many people are familiar with it.\nAdvantages of Quarto: supports other programming languages like Python and Julia, uses more modern syntax, less slapped together overall.", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#more-logical-operators-examples", - "href": "modules/Module06-DataSubset.html#more-logical-operators-examples", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "More logical operators examples", - "text": "More logical operators examples\n\ntest <- c(0,NA, -1)/0\ntest\n\n[1] NaN NA -Inf\n\nis.na(test)\n\n[1] TRUE TRUE FALSE\n\nis.nan(test)\n\n[1] TRUE FALSE FALSE\n\nis.infinite(test)\n\n[1] FALSE FALSE TRUE" + "objectID": "modules/Module11-RMarkdown.html#a-few-sticking-points", + "href": "modules/Module11-RMarkdown.html#a-few-sticking-points", + "title": "Module 11: Literate Programming", + "section": "A few sticking points", + "text": "A few sticking points\n\nKnitting to html format is really easy, but most scientist don’t like html format for some reason. If you want to knit to pdf, you should install the package tinytex and read the intro.\nIf you want to knit to word (what many journals in epidemiology require), you need to have Word installed on your computer. Note that with word, you are a bit more restricted in your formatting options, so if weird things happen you’ll have to try some other options.\nYou maybe noticed in the tutorial that I used the here::here() function for all of my file paths. This is because R Markdown and Quarto files use a different working directory from the R Project. Using here::here() translates relative paths into absolute paths based on your R Project, so it makes sure your R Markdown files can always find the right path!", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#more-logical-operators-examples-1", - "href": "modules/Module06-DataSubset.html#more-logical-operators-examples-1", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "More logical operators examples", - "text": "More logical operators examples\nany(is.na(x)) means do we have any NA’s in the object x?\n\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n\n[1] TRUE\n\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n\n[1] FALSE\n\n\nwhich(is.na(x)) means which of the elements in object x are NA’s?\n\nwhich(is.na(df$IgG_concentration)) \n\n [1] 13 55 57 72 182 406 414 478 488 595\n\nwhich(is.na(df$slum)) \n\ninteger(0)" + "objectID": "modules/Module11-RMarkdown.html#you-try-it", + "href": "modules/Module11-RMarkdown.html#you-try-it", + "title": "Module 11: Literate Programming", + "section": "You try it!", + "text": "You try it!\n\nCreate an R Markdown document. Write about either the measles or diphtheria example data sets, and include a figure and a table.\nBONUS EXERCISE: read the intro of the bookdown book, and create a bookdown document. Modify your writeup to have a few references with a bibliography, and cross-references with your figures and tables.\nBONUS: Try to structure your document like a report, with a section stating the questions you want to answer (intro), a section with your R code and results, and a section with your interpretations (discussion). This is a very open ended exercise but by now I believe you can do it, and you’ll have a nice document you can put on your portfolio or show employers!", + "crumbs": [ + "Day 2", + "Module 11: Literate Programming" + ] }, { - "objectID": "modules/Module06-DataSubset.html#subset-function", - "href": "modules/Module06-DataSubset.html#subset-function", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "subset() function", - "text": "subset() function\nThe Base R subset() function is a slightly easier way to select variables and observations.\n\n?subset\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\nSubsetting Vectors, Matrices and Data Frames\nDescription:\n Return subsets of vectors, matrices or data frames which meet\n conditions.\nUsage:\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n x: object to be subsetted.\nsubset: logical expression indicating elements or rows to keep: missing values are taken as false.\nselect: expression, indicating columns to select from a data frame.\ndrop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\nDetails:\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\nValue:\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\nWarning:\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\nAuthor(s):\n Peter Dalgaard and Brian Ripley\nSee Also:\n '[', 'transform' 'droplevels'\nExamples:\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)" + "objectID": "modules/Module03-WorkingDirectories.html#learning-objectives", + "href": "modules/Module03-WorkingDirectories.html#learning-objectives", + "title": "Module 3: Working Directories", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 3, you should be able to…\n\nUnderstand your own systems’ file structure and the purpose of the working directory\nDetermine the working directory\nChange the working directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#subsetting-use-the-subset-function", - "href": "modules/Module06-DataSubset.html#subsetting-use-the-subset-function", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Subsetting use the subset() function", - "text": "Subsetting use the subset() function\nHere are a few examples using the subset() function\n\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))" + "objectID": "modules/Module03-WorkingDirectories.html#file-structure", + "href": "modules/Module03-WorkingDirectories.html#file-structure", + "title": "Module 3: Working Directories", + "section": "File Structure", + "text": "File Structure\nThe internal file structure of the computer is completely nested!\n\nknitr::include_graphics(here::here(\"images\", \"presentation4.webp\"))\n\n\nComputer scientists call this the “file tree”.", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#subset-function-vs-logical-operators", - "href": "modules/Module06-DataSubset.html#subset-function-vs-logical-operators", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "subset() function vs logical operators", - "text": "subset() function vs logical operators\nsubset() automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\nsummary(df_lte10$age) #created with indexing\n\n\n\n\nMin.\n1st Qu.\nMedian\nMean\n3rd Qu.\nMax.\nNA’s\n\n\n\n\n1\n3\n4\n4.8\n7\n10\n9\n\n\n\n\nsummary(df_lte10_v2$age) #created with the subset function\n\n\n\n\nMin.\n1st Qu.\nMedian\nMean\n3rd Qu.\nMax.\n\n\n\n\n1\n3\n4\n4.8\n7\n10\n\n\n\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\nnrow(df_lte10)\n\n[1] 504\n\nnrow(df_lte10_v2)\n\n[1] 495" + "objectID": "modules/Module03-WorkingDirectories.html#working-directory-basic-term", + "href": "modules/Module03-WorkingDirectories.html#working-directory-basic-term", + "title": "Module 3: Working Directories", + "section": "Working Directory – Basic term", + "text": "Working Directory – Basic term\n\nR “looks” for files on your computer relative to the “working” directory\nFor example, if you want to load data into R or save a figure, you will need to tell R where to look for or store the file\nMany people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#summary", - "href": "modules/Module06-DataSubset.html#summary", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Summary", - "text": "Summary\n\ncolnames(), str() and summary()functions from Base R are functions to assess the data type and some summary statistics\nThere are three basic indexing syntax: [, [[ and $\nIndexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\nThere are 7 “types” of missing values, the most common being “NA”\nLogical operators meant to determine missing values are very helpful for data cleaning\nThe Base R subset() function is a slightly easier way to select variables and observations." + "objectID": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory", + "href": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory", + "title": "Module 3: Working Directories", + "section": "Understanding the working directory", + "text": "Understanding the working directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#acknowledgements", - "href": "modules/Module06-DataSubset.html#acknowledgements", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University\n“Indexing” CRAN Project\n“Logical operators” CRAN Project" + "objectID": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-1", + "href": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-1", + "title": "Module 3: Working Directories", + "section": "Understanding the working directory", + "text": "Understanding the working directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-columns", - "href": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-columns", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing to subset by columns", - "text": "Using indexing to subset by columns\nWe can also subset data frames and matrices (2-dimensional objects) using the bracket [ row , column ]. We can subset by columns and pull the x column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.\nFor example, here I am pulling the 3rd column, which has the variable name age, for all of rows.\n\ndf[ , \"age\"] #same as df[ , 3]\n\nWe can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.\n\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n\n age gender\n1 2 Female\n2 4 Female\n3 4 Male\n4 4 Male\n5 1 Male\n6 4 Male\n7 4 Female\n8 NA Female\n9 4 Male\n10 2 Male\n11 3 Male\n12 15 Female\n13 8 Male\n14 12 Male\n15 15 Male\n16 9 Male\n17 8 Male\n18 7 Female\n19 11 Female\n20 10 Male\n21 8 Male\n22 11 Female\n23 2 Male\n24 2 Female\n25 3 Female\n26 5 Male\n27 1 Male\n28 3 Female\n29 5 Female\n30 5 Female\n31 3 Male\n32 1 Male\n33 4 Female\n34 3 Male\n35 2 Female\n36 11 Female\n37 7 Male\n38 8 Male\n39 6 Male\n40 6 Male\n41 11 Female\n42 10 Male\n43 6 Female\n44 12 Male\n45 11 Male\n46 10 Male\n47 11 Male\n48 13 Female\n49 3 Female\n50 4 Female\n51 3 Male\n52 1 Male\n53 2 Female\n54 2 Female\n55 4 Male\n56 2 Male\n57 2 Male\n58 3 Female\n59 3 Female\n60 4 Male\n61 1 Female\n62 13 Female\n63 13 Female\n64 6 Male\n65 13 Male\n66 5 Female\n67 13 Female\n68 14 Male\n69 13 Male\n70 8 Female\n71 7 Male\n72 6 Female\n73 13 Male\n74 3 Male\n75 4 Male\n76 2 Male\n77 NA Male\n78 5 Female\n79 3 Male\n80 3 Male\n81 14 Male\n82 11 Female\n83 7 Female\n84 7 Male\n85 11 Female\n86 9 Female\n87 14 Male\n88 13 Female\n89 1 Male\n90 1 Male\n91 4 Male\n92 1 Female\n93 2 Male\n94 3 Female\n95 2 Male\n96 1 Male\n97 2 Male\n98 2 Female\n99 4 Female\n100 5 Female\n101 5 Male\n102 6 Female\n103 14 Female\n104 14 Male\n105 10 Male\n106 6 Female\n107 6 Male\n108 8 Male\n109 6 Female\n110 12 Female\n111 12 Male\n112 14 Female\n113 15 Male\n114 12 Female\n115 4 Female\n116 4 Male\n117 3 Female\n118 NA Male\n119 2 Female\n120 3 Male\n121 NA Female\n122 3 Female\n123 3 Male\n124 2 Female\n125 4 Female\n126 10 Female\n127 7 Female\n128 11 Female\n129 6 Female\n130 11 Male\n131 9 Male\n132 6 Male\n133 13 Female\n134 10 Female\n135 6 Female\n136 11 Female\n137 7 Male\n138 6 Female\n139 4 Female\n140 4 Female\n141 4 Male\n142 4 Female\n143 4 Male\n144 4 Male\n145 3 Male\n146 4 Female\n147 3 Male\n148 3 Male\n149 13 Female\n150 7 Female\n151 10 Male\n152 6 Male\n153 10 Female\n154 12 Female\n155 10 Male\n156 10 Male\n157 13 Male\n158 13 Female\n159 5 Female\n160 3 Female\n161 4 Male\n162 1 Male\n163 3 Female\n164 4 Male\n165 4 Male\n166 1 Male\n167 5 Female\n168 6 Female\n169 14 Female\n170 6 Male\n171 13 Female\n172 9 Male\n173 11 Male\n174 10 Male\n175 5 Female\n176 14 Male\n177 7 Male\n178 10 Male\n179 6 Male\n180 5 Male\n181 3 Female\n182 4 Male\n183 2 Female\n184 3 Male\n185 3 Female\n186 2 Female\n187 3 Male\n188 5 Female\n189 2 Male\n190 3 Female\n191 14 Female\n192 9 Female\n193 14 Female\n194 9 Female\n195 8 Female\n196 7 Male\n197 13 Male\n198 8 Female\n199 6 Male\n200 12 Female\n201 14 Female\n202 15 Female\n203 2 Female\n204 4 Female\n205 3 Male\n206 3 Female\n207 3 Male\n208 4 Female\n209 3 Male\n210 14 Female\n211 8 Male\n212 7 Male\n213 14 Female\n214 13 Female\n215 13 Female\n216 7 Male\n217 8 Female\n218 10 Female\n219 9 Male\n220 9 Female\n221 3 Female\n222 4 Male\n223 4 Female\n224 4 Male\n225 2 Female\n226 1 Female\n227 3 Female\n228 2 Male\n229 3 Male\n230 5 Male\n231 2 Female\n232 2 Male\n233 9 Male\n234 13 Male\n235 10 Female\n236 6 Male\n237 13 Female\n238 11 Male\n239 10 Male\n240 8 Female\n241 9 Female\n242 10 Male\n243 14 Male\n244 1 Female\n245 2 Male\n246 3 Female\n247 2 Male\n248 3 Female\n249 2 Female\n250 3 Female\n251 5 Female\n252 10 Female\n253 7 Male\n254 13 Female\n255 15 Male\n256 11 Female\n257 10 Female\n258 3 Female\n259 2 Male\n260 3 Male\n261 3 Female\n262 3 Female\n263 4 Male\n264 3 Male\n265 2 Male\n266 4 Male\n267 2 Female\n268 8 Male\n269 11 Male\n270 6 Male\n271 14 Female\n272 14 Male\n273 5 Female\n274 5 Male\n275 10 Female\n276 13 Male\n277 6 Male\n278 5 Male\n279 12 Male\n280 2 Male\n281 3 Female\n282 1 Female\n283 1 Male\n284 1 Female\n285 2 Female\n286 5 Female\n287 5 Male\n288 4 Female\n289 2 Male\n290 NA Female\n291 6 Female\n292 8 Male\n293 15 Male\n294 11 Male\n295 14 Male\n296 6 Male\n297 10 Female\n298 12 Male\n299 14 Male\n300 10 Male\n301 1 Female\n302 3 Male\n303 2 Male\n304 3 Female\n305 4 Male\n306 3 Male\n307 4 Female\n308 4 Male\n309 1 Female\n310 7 Male\n311 11 Female\n312 7 Female\n313 5 Female\n314 10 Male\n315 9 Female\n316 13 Male\n317 11 Female\n318 13 Male\n319 9 Female\n320 15 Female\n321 7 Female\n322 4 Male\n323 1 Male\n324 1 Male\n325 2 Female\n326 2 Female\n327 3 Male\n328 2 Male\n329 3 Male\n330 4 Female\n331 7 Female\n332 11 Female\n333 10 Female\n334 5 Male\n335 8 Male\n336 15 Male\n337 14 Male\n338 2 Male\n339 2 Female\n340 2 Male\n341 5 Male\n342 4 Female\n343 3 Male\n344 5 Female\n345 4 Female\n346 2 Female\n347 1 Female\n348 7 Male\n349 8 Female\n350 NA Male\n351 9 Male\n352 8 Female\n353 5 Male\n354 14 Male\n355 14 Male\n356 7 Female\n357 13 Female\n358 2 Male\n359 1 Female\n360 1 Male\n361 4 Female\n362 3 Male\n363 4 Female\n364 3 Male\n365 1 Male\n366 5 Female\n367 4 Female\n368 4 Female\n369 4 Male\n370 11 Male\n371 15 Female\n372 12 Female\n373 11 Female\n374 8 Female\n375 13 Male\n376 10 Female\n377 10 Female\n378 15 Male\n379 8 Female\n380 14 Male\n381 4 Male\n382 1 Male\n383 5 Female\n384 2 Male\n385 2 Female\n386 4 Male\n387 4 Male\n388 2 Female\n389 3 Male\n390 11 Male\n391 10 Female\n392 6 Male\n393 12 Female\n394 10 Female\n395 8 Male\n396 8 Male\n397 13 Male\n398 10 Male\n399 13 Female\n400 10 Male\n401 2 Male\n402 4 Female\n403 3 Female\n404 2 Female\n405 1 Female\n406 3 Male\n407 3 Female\n408 4 Male\n409 5 Female\n410 5 Female\n411 1 Female\n412 11 Male\n413 6 Male\n414 14 Female\n415 8 Male\n416 8 Female\n417 9 Female\n418 7 Male\n419 6 Male\n420 12 Female\n421 8 Male\n422 11 Female\n423 14 Male\n424 3 Female\n425 1 Female\n426 5 Female\n427 2 Female\n428 3 Female\n429 4 Female\n430 2 Male\n431 3 Female\n432 4 Male\n433 1 Female\n434 7 Female\n435 10 Male\n436 11 Male\n437 7 Female\n438 10 Female\n439 14 Female\n440 7 Female\n441 11 Male\n442 12 Male\n443 10 Female\n444 6 Male\n445 13 Male\n446 8 Female\n447 2 Male\n448 3 Female\n449 1 Female\n450 2 Female\n451 NA Male\n452 NA Female\n453 4 Male\n454 4 Male\n455 1 Male\n456 2 Female\n457 2 Male\n458 12 Male\n459 12 Female\n460 8 Female\n461 14 Female\n462 13 Female\n463 6 Male\n464 11 Female\n465 11 Male\n466 10 Female\n467 12 Male\n468 14 Female\n469 11 Female\n470 1 Male\n471 2 Female\n472 3 Male\n473 3 Female\n474 5 Female\n475 3 Male\n476 1 Male\n477 4 Female\n478 4 Female\n479 4 Male\n480 2 Female\n481 5 Female\n482 7 Male\n483 8 Male\n484 10 Male\n485 6 Female\n486 7 Male\n487 10 Female\n488 6 Male\n489 6 Female\n490 15 Female\n491 5 Male\n492 3 Male\n493 5 Male\n494 3 Female\n495 5 Male\n496 5 Male\n497 1 Female\n498 1 Male\n499 7 Female\n500 14 Female\n501 9 Male\n502 10 Female\n503 10 Female\n504 11 Male\n505 11 Female\n506 12 Female\n507 11 Female\n508 12 Male\n509 12 Male\n510 10 Female\n511 1 Male\n512 2 Female\n513 4 Male\n514 2 Male\n515 3 Male\n516 3 Female\n517 2 Male\n518 4 Male\n519 3 Male\n520 1 Female\n521 4 Male\n522 12 Female\n523 6 Male\n524 7 Female\n525 7 Male\n526 13 Female\n527 8 Female\n528 7 Male\n529 8 Female\n530 8 Female\n531 11 Female\n532 14 Female\n533 3 Male\n534 2 Female\n535 2 Male\n536 3 Male\n537 2 Male\n538 2 Female\n539 3 Female\n540 2 Male\n541 5 Male\n542 10 Female\n543 14 Male\n544 9 Male\n545 6 Male\n546 7 Male\n547 14 Female\n548 7 Female\n549 7 Male\n550 9 Male\n551 14 Male\n552 10 Female\n553 13 Female\n554 5 Male\n555 4 Female\n556 4 Female\n557 5 Female\n558 4 Female\n559 4 Male\n560 4 Male\n561 3 Female\n562 1 Female\n563 4 Male\n564 1 Male\n565 1 Female\n566 7 Male\n567 13 Female\n568 10 Female\n569 14 Male\n570 12 Female\n571 14 Male\n572 8 Male\n573 7 Male\n574 11 Female\n575 8 Male\n576 12 Male\n577 9 Female\n578 5 Female\n579 4 Male\n580 3 Female\n581 2 Male\n582 2 Male\n583 3 Male\n584 4 Female\n585 4 Male\n586 4 Female\n587 5 Male\n588 3 Female\n589 6 Female\n590 3 Male\n591 11 Female\n592 11 Male\n593 7 Male\n594 8 Male\n595 6 Female\n596 10 Female\n597 8 Female\n598 8 Male\n599 9 Female\n600 8 Male\n601 13 Male\n602 11 Male\n603 8 Female\n604 2 Female\n605 4 Male\n606 2 Male\n607 2 Female\n608 4 Male\n609 2 Male\n610 4 Female\n611 2 Female\n612 4 Female\n613 1 Female\n614 4 Female\n615 12 Female\n616 7 Female\n617 11 Male\n618 6 Male\n619 8 Male\n620 14 Male\n621 11 Male\n622 7 Female\n623 14 Female\n624 6 Male\n625 13 Female\n626 13 Female\n627 3 Male\n628 1 Male\n629 3 Male\n630 1 Female\n631 1 Female\n632 2 Male\n633 4 Male\n634 4 Male\n635 2 Female\n636 4 Female\n637 5 Male\n638 3 Female\n639 3 Male\n640 6 Female\n641 11 Female\n642 9 Female\n643 7 Female\n644 8 Male\n645 NA Female\n646 8 Female\n647 14 Female\n648 10 Male\n649 10 Male\n650 11 Female\n651 13 Female\n\n\nWe can remove select columns using indexing as well, OR by simply changing the column to NULL\n\ndf[, -5] #remove column 5, \"slum\" variable\n\n\ndf$slum <- NULL # this is the same as above\n\nWe can also grab the age column using the $ operator, again this is selecting the variable for all of the rows.\n\ndf$age" + "objectID": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-2", + "href": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-2", + "title": "Module 3: Working Directories", + "section": "Understanding the working directory", + "text": "Understanding the working directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-rows", - "href": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-rows", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Using indexing to subset by rows", - "text": "Using indexing to subset by rows\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\ndf[100,] \n\n observation_id IgG_concentration age gender slum\n100 8122 0.1818182 5 Female Non slum\n\n\nAnd, here we pull the age of the 100th observation/row.\n\ndf[100,\"age\"] \n\n[1] 5" + "objectID": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-3", + "href": "modules/Module03-WorkingDirectories.html#understanding-the-working-directory-3", + "title": "Module 3: Working Directories", + "section": "Understanding the working directory", + "text": "Understanding the working directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#logical-operators-to-help-identify-and-missing-data", - "href": "modules/Module06-DataSubset.html#logical-operators-to-help-identify-and-missing-data", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "Logical operators to help identify and missing data", - "text": "Logical operators to help identify and missing data\n\n\n\noperator\ndescription\n\n\n\n\n\nis.na\nis NAN or NA\n\n\n\nis.nan\nis NAN\n\n\n\n!is.na\nis not NAN or NA\n\n\n\n!is.nan\nis not NAN\n\n\n\nis.infinite\nis infinite\n\n\n\nany\nare any TRUE\n\n\n\nall\nall are TRUE\n\n\n\nwhich\nwhich are TRUE" + "objectID": "modules/Module03-WorkingDirectories.html#getting-and-setting-the-working-directory-using-code", + "href": "modules/Module03-WorkingDirectories.html#getting-and-setting-the-working-directory-using-code", + "title": "Module 3: Working Directories", + "section": "Getting and setting the working directory using code", + "text": "Getting and setting the working directory using code\n\n## get the working directory\ngetwd()\nsetwd(\"~/\")", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module02-Functions.html#installing-and-attaching-packages", - "href": "modules/Module02-Functions.html#installing-and-attaching-packages", - "title": "Module 2: Functions", - "section": "Installing and attaching packages", - "text": "Installing and attaching packages\nTo use the bundle or “package” of code (and or possibly data) from a package, you need to install and also attach the package.\nTo install a package you can\n\ngo to R Studio Menu Bar Tools Menu —> Install Packages in the RStudio header\n\nOR\n\nuse the following code:\n\n\ninstall.packages(\"package_name\")" + "objectID": "modules/Module03-WorkingDirectories.html#setting-a-working-directory", + "href": "modules/Module03-WorkingDirectories.html#setting-a-working-directory", + "title": "Module 3: Working Directories", + "section": "Setting a working directory", + "text": "Setting a working directory\n\nSetting the directory can sometimes (almost always when new to R) be finicky\n\nWindows: Default directory structure involves single backslashes (“\\”), but R interprets these as”escape” characters. So you must replace the backslash with forward slashes (“/”) or two backslashes (“\\\\”)\nMac/Linux: Default is forward slashes, so you are okay\n\nTypical directory structure syntax applies\n\n“..” - goes up one level\n“./” - is the current directory\n“~” - is your “home” directory", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module02-Functions.html#installing-and-attaching-packages-1", - "href": "modules/Module02-Functions.html#installing-and-attaching-packages-1", - "title": "Module 2: Functions", - "section": "Installing and attaching packages", - "text": "Installing and attaching packages\nTo attach (i.e., be able to use the package) you can use the following code:\n\nrequire(package_name) #library(package_name) also works\n\nMore on installing and attaching packages later…" + "objectID": "modules/Module03-WorkingDirectories.html#absolute-vs.-relative-paths", + "href": "modules/Module03-WorkingDirectories.html#absolute-vs.-relative-paths", + "title": "Module 3: Working Directories", + "section": "Absolute vs. relative paths", + "text": "Absolute vs. relative paths\nFrom Wiki\n\nAn absolute or full path points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. Absolute path is specific to your system alone. This means if I try your code, and you use absolute paths, it won’t work unless we have the exact same folder structure where R is looking (bad).\nBy contrast, a relative path starts from some given working directory, avoiding the need to provide the full absolute path.", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#attach-package", - "href": "modules/Module05-DataImportExport.html#attach-package", - "title": "Module 5: Data Import and Export", - "section": "4. Attach Package", - "text": "4. Attach Package\nReminder - To attach (i.e., be able to use the package) you can use the following code:\n\nrequire(package_name)\n\nTherefore,\n\nrequire(readxl)" + "objectID": "modules/Module03-WorkingDirectories.html#relative-path", + "href": "modules/Module03-WorkingDirectories.html#relative-path", + "title": "Module 3: Working Directories", + "section": "Relative path", + "text": "Relative path\nYou want to set you code up based on relative paths. This allows sharing of code, and also, allows you to modify your own file structure (above the working directory) without breaking your own code.", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module06-DataSubset.html#for-indexing-for-data-frame", - "href": "modules/Module06-DataSubset.html#for-indexing-for-data-frame", - "title": "Module 6: Get to Know Your Data and Subsetting", - "section": "$ for indexing for data frame", - "text": "$ for indexing for data frame\n$ allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.\n\ndf$IgG_concentration\n\nNote, if you have spaces in your variable name, you will need to use back ticks ` after the $. This is a good reason to not create variables / column names with spaces." + "objectID": "modules/Module03-WorkingDirectories.html#setting-the-working-directory-using-your-cursor", + "href": "modules/Module03-WorkingDirectories.html#setting-the-working-directory-using-your-cursor", + "title": "Module 3: Working Directories", + "section": "Setting the working directory using your cursor", + "text": "Setting the working directory using your cursor\nRemember above “Many people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in.” To do so, go to Session –> Set Working Directory –> To Source File Location\nRStudio will show the code in the Console for the action you took with your cursor. This is a good way to learn about your file system how to set a correct working directory!\n\nsetwd(\"~/Dropbox/Git/SISMID-2024\")", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#learning-objectives", - "href": "modules/Module07-VarCreationClassesSummaries.html#learning-objectives", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 7, you should be able to…\n\nCreate new variables\nCharacterize variable classes\nManipulate the classes of variables\nConduct 1 variable data summaries" + "objectID": "modules/Module03-WorkingDirectories.html#setting-the-working-directory", + "href": "modules/Module03-WorkingDirectories.html#setting-the-working-directory", + "title": "Module 3: Working Directories", + "section": "Setting the Working Directory", + "text": "Setting the Working Directory\nIf you have not yet saved a “source” file, it will set working directory to the default location.Find the Tool Menu in the Menu Bar -> Global Opsions -> General for default location.\nTo change the working directory to another location, find Session Menu in the Menu Bar –> Set Working Directory –> Choose Directory`\nAgain, RStudio will show the code in the Console for the action you took with your cursor.", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#import-data-for-this-module", - "href": "modules/Module07-VarCreationClassesSummaries.html#import-data-for-this-module", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Import data for this module", - "text": "Import data for this module\nLet’s first read in the data from the previous module and look at it briefly with a new function head(). head() allows us to look at the first n observations.\n\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum" + "objectID": "modules/Module03-WorkingDirectories.html#summary", + "href": "modules/Module03-WorkingDirectories.html#summary", + "title": "Module 3: Working Directories", + "section": "Summary", + "text": "Summary\n\nR “looks” for files on your computer relative to the “working” directory\nAbsolute path points to the same location in a file system - it is specific to your system and your system alone\nRelative path points is based on the current working directory\nTwo functions, setwd() and getwd() are useful for identifying and manipulating the working directory.", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns", - "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Adding new columns", - "text": "Adding new columns\nYou can add a new column, called log_IgG to df, using the $ operator:\n\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.146681\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.234547\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.203973\n\n\n\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error." + "objectID": "modules/Module03-WorkingDirectories.html#acknowledgements", + "href": "modules/Module03-WorkingDirectories.html#acknowledgements", + "title": "Module 3: Working Directories", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University", + "crumbs": [ + "Day 1", + "Module 3: Working Directories" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#creating-conditional-variables", - "href": "modules/Module07-VarCreationClassesSummaries.html#creating-conditional-variables", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Creating conditional variables", - "text": "Creating conditional variables\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE.”\n\n?ifelse\n\nConditional Element Selection\nDescription:\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\nUsage:\n ifelse(test, yes, no)\n \nArguments:\ntest: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\nDetails:\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\nValue:\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\nWarning:\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'if'.\nExamples:\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1" + "objectID": "modules/Module05-DataImportExport.html#learning-objectives", + "href": "modules/Module05-DataImportExport.html#learning-objectives", + "title": "Module 5: Data Import and Export", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 5, you should be able to…\n\nUse Base R functions to load data\nInstall and attach external R Packages to extend R’s functionality\nLoad any type of data into R\nFind loaded data in the Environment pane of RStudio\nReading and writing R .Rds and .Rda/.RData files", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example", - "href": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "ifelse example", - "text": "ifelse example\nReminder of the first three arguments in the ifelse() function are ifelse(test, yes, no).\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\nage_group\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.1466807\nFALSE\nyoung\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.2345475\nFALSE\nyoung\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.2039728\nFALSE\nyoung\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n4.9644957\nTRUE\nyoung\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n-0.8037359\nFALSE\nyoung\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n-3.6781074\nFALSE\nyoung" + "objectID": "modules/Module05-DataImportExport.html#import-read-data", + "href": "modules/Module05-DataImportExport.html#import-read-data", + "title": "Module 5: Data Import and Export", + "section": "Import (read) Data", + "text": "Import (read) Data\n\nImporting or ‘Reading in’ data are the first step of any real project / data analysis\nR can read almost any file format, especially with external, non-Base R, packages\nWe are going to focus on simple delimited files first.\n\ncomma separated (e.g. ‘.csv’)\ntab delimited (e.g. ‘.txt’)\n\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column delimiters (IBM.com definition)", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#nesting-ifelse-statements-example", - "href": "modules/Module07-VarCreationClassesSummaries.html#nesting-ifelse-statements-example", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Nesting ifelse statements example", - "text": "Nesting ifelse statements example\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n\nLet’s use the table() function to check if it worked.\n\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n\n\n\n\nage/\nmiddle\nold\nyoung\nNA\n\n\n\n\n1\n0\n0\n44\n0\n\n\n2\n0\n0\n72\n0\n\n\n3\n0\n0\n79\n0\n\n\n4\n0\n0\n80\n0\n\n\n5\n0\n0\n41\n0\n\n\n6\n38\n0\n0\n0\n\n\n7\n38\n0\n0\n0\n\n\n8\n39\n0\n0\n0\n\n\n9\n20\n0\n0\n0\n\n\n10\n44\n0\n0\n0\n\n\n11\n0\n41\n0\n0\n\n\n12\n0\n23\n0\n0\n\n\n13\n0\n35\n0\n0\n\n\n14\n0\n37\n0\n0\n\n\n15\n0\n11\n0\n0\n\n\nNA\n0\n0\n0\n9\n\n\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later." + "objectID": "modules/Module05-DataImportExport.html#mini-exercise", + "href": "modules/Module05-DataImportExport.html#mini-exercise", + "title": "Module 5: Data Import and Export", + "section": "Mini exercise", + "text": "Mini exercise\n\nDownload Module 5 data from the website and save the data to your data subdirectory – specifically SISMID_IntroToR_RProject/data\nOpen the ‘.csv’ and ‘.txt’ data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)\nOpen the ‘.xlsx’ data file in excel and familiarize yourself with the data - if you use a Mac do not open in Numbers, it can corrupt the file - if you do not have excel, you can upload it to Google Sheets\nDetermine the delimiter of the two ‘.txt’ files", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#overview---data-classes", - "href": "modules/Module07-VarCreationClassesSummaries.html#overview---data-classes", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Overview - Data Classes", - "text": "Overview - Data Classes\n\nOne dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\nTwo dimensional types (e.g., matrix, data frame, tibble)\nSpecial data classes (e.g., lists, dates)." + "objectID": "modules/Module05-DataImportExport.html#mini-exercise-1", + "href": "modules/Module05-DataImportExport.html#mini-exercise-1", + "title": "Module 5: Data Import and Export", + "section": "Mini exercise", + "text": "Mini exercise", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#class-function", - "href": "modules/Module07-VarCreationClassesSummaries.html#class-function", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "class() function", - "text": "class() function\nThe class() function allows you to evaluate the class of an object.\n\nclass(df$IgG_concentration)\n\n[1] \"numeric\"\n\nclass(df$age)\n\n[1] \"integer\"\n\nclass(df$gender)\n\n[1] \"character\"" + "objectID": "modules/Module05-DataImportExport.html#import-delimited-data", + "href": "modules/Module05-DataImportExport.html#import-delimited-data", + "title": "Module 5: Data Import and Export", + "section": "Import delimited data", + "text": "Import delimited data\nWithin the Base R ‘util’ package we can find a handful of useful functions including read.csv() and read.delim() to importing data.\n\n?read.csv\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input, so 'na.strings' values may need\n their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file (not a connection) so the character data can be\n re-encoded. See the 'Encoding' section of the help for\n 'file', the 'R Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should nuls be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded nuls are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n <https://www.rfc-editor.org/rfc/rfc4180> for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#one-dimensional-data-types", - "href": "modules/Module07-VarCreationClassesSummaries.html#one-dimensional-data-types", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "One dimensional data types", - "text": "One dimensional data types\n\nCharacter: strings or individual characters, quoted\nNumeric: any real number(s)\n\nDouble: contains fractional values (i.e., double precision) - default numeric\nInteger: any integer(s)/whole numbers\n\nLogical: variables composed of TRUE or FALSE\nFactor: categorical/qualitative variables" + "objectID": "modules/Module05-DataImportExport.html#import-.csv-files", + "href": "modules/Module05-DataImportExport.html#import-.csv-files", + "title": "Module 5: Data Import and Export", + "section": "Import .csv files", + "text": "Import .csv files\nFunction signature reminder\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n\nNote #1, I assigned the data frame to an object called df. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object.\nNote #2, If the data is imported correct, you can expect to see the df object ready to be used.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-and-numeric", - "href": "modules/Module07-VarCreationClassesSummaries.html#character-and-numeric", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Character and numeric", - "text": "Character and numeric\nThis can also be a bit tricky.\nIf only one character in the whole vector, the class is assumed to be character\n\nclass(c(1, 2, \"tree\")) \n\n[1] \"character\"\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\nclass(c(\"1\", \"4\", \"7\")) \n\n[1] \"character\"\n\n\nNote, instead of creating a new vector object (e.g., x <- c(\"1\", \"4\", \"7\")) and then feeding the vector object x into the first argument of the class() function (e.g., class(x)), we combined the two steps and directly fed a vector object into the class function." + "objectID": "modules/Module05-DataImportExport.html#import-.txt-files", + "href": "modules/Module05-DataImportExport.html#import-.txt-files", + "title": "Module 5: Data Import and Export", + "section": "Import .txt files", + "text": "Import .txt files\nread.csv() is a special case of read.delim() – a general function to read a delimited file into a data frame\nReminder function signature\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n - `file` is the path to your file, in quotes \n - `delim` is what separates the fields within a record. The default for csv is comma\nWe can import the ‘.txt’ files given that we know that ‘serodata1.txt’ uses a tab delimiter and ‘serodata2.txt’ uses a semicolon delimiter.\n\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n\nThe dataset is now successfully read into your R workspace, many times actually. Notice, that each time we imported the data we assigned the data to the df object, meaning we replaced it each time we reassigned the df object.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-subclasses", - "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-subclasses", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Numeric Subclasses", - "text": "Numeric Subclasses\nThere are two major numeric subclasses\n\nDouble is a special subset of numeric that contains fractional values. Double stands for double-precision\nInteger is a special subset of numeric that contains only whole numbers.\n\ntypeof() identifies the vector type (double, integer, logical, or character), whereas class() identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\nclass(df$IgG_concentration)\n\n[1] \"numeric\"\n\nclass(df$age)\n\n[1] \"integer\"\n\ntypeof(df$IgG_concentration)\n\n[1] \"double\"\n\ntypeof(df$age)\n\n[1] \"integer\"" + "objectID": "modules/Module05-DataImportExport.html#what-if-we-have-a-.xlsx-file---what-do-we-do", + "href": "modules/Module05-DataImportExport.html#what-if-we-have-a-.xlsx-file---what-do-we-do", + "title": "Module 5: Data Import and Export", + "section": "What if we have a .xlsx file - what do we do?", + "text": "What if we have a .xlsx file - what do we do?\n\nAsk Google / ChatGPT\nFind and vet function and package you want\nInstall package\nAttach package\nUse function", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#logical", - "href": "modules/Module07-VarCreationClassesSummaries.html#logical", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Logical", - "text": "Logical\nReminder logical is a type that only has three possible elements: TRUE and FALSE and NA\n\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n\n[1] \"logical\"\n\n\nNote that when creating logical object the TRUE and FALSE are NOT in quotes. Putting R special classes (e.g., NA or FALSE) in quotations turns them into character value." + "objectID": "modules/Module05-DataImportExport.html#internet-search", + "href": "modules/Module05-DataImportExport.html#internet-search", + "title": "Module 5: Data Import and Export", + "section": "1. Internet Search", + "text": "1. Internet Search", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#other-useful-functions-for-evaluatingsetting-classes", - "href": "modules/Module07-VarCreationClassesSummaries.html#other-useful-functions-for-evaluatingsetting-classes", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Other useful functions for evaluating/setting classes", - "text": "Other useful functions for evaluating/setting classes\nThere are two useful functions associated with practically all R classes:\n\nis.CLASS_NAME(x) to logically check whether or not x is of certain class. For example, is.integer or is.character or is.numeric\nas.CLASS_NAME(x) to coerce between classes x from current x class into a another class. For example, as.integer or as.character or as.numeric. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later)." + "objectID": "modules/Module05-DataImportExport.html#find-and-vet-function-and-package-you-want", + "href": "modules/Module05-DataImportExport.html#find-and-vet-function-and-package-you-want", + "title": "Module 5: Data Import and Export", + "section": "2. Find and vet function and package you want", + "text": "2. Find and vet function and package you want\nI am getting consistent message to use the the read_excel() function found in the readxl package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#examples-is.class_namex", - "href": "modules/Module07-VarCreationClassesSummaries.html#examples-is.class_namex", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Examples is.CLASS_NAME(x)", - "text": "Examples is.CLASS_NAME(x)\n\nis.numeric(df$IgG_concentration)\n\n[1] TRUE\n\nis.character(df$age)\n\n[1] FALSE\n\nis.character(df$gender)\n\n[1] TRUE" + "objectID": "modules/Module05-DataImportExport.html#install-package", + "href": "modules/Module05-DataImportExport.html#install-package", + "title": "Module 5: Data Import and Export", + "section": "3. Install Package", + "text": "3. Install Package\nTo use the bundle or “package” of code (and or possibly data) from a package, you need to install and also attach the package.\nTo install a package you can\n\ngo to Tools —> Install Packages in the RStudio header\n\nOR\n\nuse the following code:\n\n\ninstall.packages(\"package_name\")\n\nTherefore,\n\ninstall.packages(\"readxl\")", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#examples-as.class_namex", - "href": "modules/Module07-VarCreationClassesSummaries.html#examples-as.class_namex", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Examples as.CLASS_NAME(x)", - "text": "Examples as.CLASS_NAME(x)\nIn some cases, coercing is seamless\n\nas.character(c(1, 4, 7))\n\n[1] \"1\" \"4\" \"7\"\n\nas.numeric(c(\"1\", \"4\", \"7\"))\n\n[1] 1 4 7\n\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n\n[1] TRUE FALSE FALSE\n\n\nIn some cases the coercing is not possible; if executed, will return NA\n\nas.numeric(c(\"1\", \"4\", \"7a\"))\n\nWarning: NAs introduced by coercion\n\n\n[1] 1 4 NA\n\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n\n[1] TRUE FALSE NA" + "objectID": "modules/Module05-DataImportExport.html#attach-package", + "href": "modules/Module05-DataImportExport.html#attach-package", + "title": "Module 5: Data Import and Export", + "section": "4. Attach Package", + "text": "4. Attach Package\nReminder - To attach (i.e., be able to use the package) you can use the following code:\n\nrequire(package_name)\n\nTherefore,\n\nrequire(readxl)", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#factors", - "href": "modules/Module07-VarCreationClassesSummaries.html#factors", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Factors", - "text": "Factors\nA factor is a special character vector where the elements have pre-defined groups or ‘levels’. You can think of these as qualitative or categorical variables. Use the factor() function to create factors from character values.\n\nclass(df$age_group)\n\n[1] \"character\"\n\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n\n[1] \"factor\"\n\nlevels(df$age_group_factor)\n\n[1] \"middle\" \"old\" \"young\" \n\n\nNote 1, that levels are, by default, set to alphanumerical order! And, the first is always the “reference” group. However, we often prefer a different reference group.\nNote 2, we can also make ordered factors using factor(... ordered=TRUE), but we won’t talk more about that." + "objectID": "modules/Module05-DataImportExport.html#use-function", + "href": "modules/Module05-DataImportExport.html#use-function", + "title": "Module 5: Data Import and Export", + "section": "5. Use Function", + "text": "5. Use Function\n\n?read_excel\n\nRead xls and xlsx files\nDescription:\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\nUsage:\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\npath: Path to the xls/xlsx file.\nsheet: Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Ignored if the sheet is specified via ‘range’. If neither argument specifies the sheet, defaults to the first sheet.\nrange: A cell range to read from, as described in cell-specification. Includes typical Excel ranges like “B3:D87”, possibly including the sheet name like “Budget!B2:G14”, and more. Interpreted strictly, even if the range forces the inclusion of leading or trailing empty rows or columns. Takes precedence over ‘skip’, ‘n_max’ and ‘sheet’.\ncol_names: ‘TRUE’ to use the first row as column names, ‘FALSE’ to get default names, or a character vector giving a name for each column. If user provides ‘col_types’ as a vector, ‘col_names’ can have one entry per column, i.e. have the same length as ‘col_types’, or one entry per unskipped column.\ncol_types: Either ‘NULL’ to guess all from the spreadsheet or a character vector containing one entry per column from these options: “skip”, “guess”, “logical”, “numeric”, “date”, “text” or “list”. If exactly one ‘col_type’ is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from ‘col_types = NULL’, but on a cell-by-cell basis.\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\ntrim_ws: Should leading and trailing whitespace be trimmed?\nskip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\nn_max: Maximum number of data rows to read. Trailing empty rows are automatically skipped, so this is an upper bound on the number of rows in the returned tibble. Ignored if ‘range’ is given.\nguess_max: Maximum number of data rows to use for guessing column types.\nprogress: Display a progress spinner? By default, the spinner appears only in an interactive session, outside the context of knitting a document, and when the call is likely to run for several seconds or more. See ‘readxl_progress()’ for more details.\n.name_repair: Handling of column names. Passed along to ‘tibble::as_tibble()’. readxl’s default is `.name_repair = “unique”, which ensures column names are not empty and are unique.\nValue:\n A tibble\nSee Also:\n cell-specification for more details on targetting cells with the\n 'range' argument\nExamples:\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#reference-groups", - "href": "modules/Module07-VarCreationClassesSummaries.html#reference-groups", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Reference Groups", - "text": "Reference Groups\nWhy do we care about reference groups?\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations\nBy default middle is the reference group therefore we will only generate beta coefficients comparing middle to young AND middle to old. But, we want young to be the reference group so we will generate beta coefficients comparing young to middle AND young to old." + "objectID": "modules/Module05-DataImportExport.html#use-function-1", + "href": "modules/Module05-DataImportExport.html#use-function-1", + "title": "Module 5: Data Import and Export", + "section": "5. Use Function", + "text": "5. Use Function\nReminder of function signature\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\nLet’s practice\n\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference", - "href": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Changing factor reference", - "text": "Changing factor reference\nChanging the reference group of a factor variable.\n\nIf the object is already a factor then use relevel() function and the ref argument to specify the reference.\nIf the object is a character then use factor() function and levels argument to specify the order of the values, the first being the reference.\n\nLet’s look at the relevel() help file\nReorder Levels of Factor\nDescription:\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\nUsage:\n relevel(x, ref, ...)\n \nArguments:\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\nDetails:\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\nValue:\n A factor of the same length as 'x'.\nSee Also:\n 'factor', 'contr.treatment', 'levels', 'reorder'.\nExamples:\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\nLet’s look at the factor() help file\nFactors\nDescription:\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\nUsage:\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n x: a vector of data, usually taking a small number of distinct\n values.\nlevels: an optional vector of the unique values (as character strings) that ‘x’ might have taken. The default is the unique set of values taken by ‘as.character(x)’, sorted into increasing order of ‘x’. Note that this set can be specified as smaller than ‘sort(unique(x))’.\nlabels: either an optional character vector of labels for the levels (in the same order as ‘levels’ after removing those in ‘exclude’), or a character string of length 1. Duplicated values in ‘labels’ can be used to map different values of ‘x’ to the same factor level.\nexclude: a vector of values to be excluded when forming the set of levels. This may be factor with the same level set as ‘x’ or should be a ‘character’.\nordered: logical flag to determine if the levels should be regarded as ordered (in the order given).\nnmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\nifany: only add an ‘NA’ level if it is used, i.e. if ‘any(is.na(x))’.\nobject: an R object.\nDetails:\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and the model-fitting functions treat the two classes\n quite differently.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '<NA>');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '<NA>', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\nValue:\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject(<factor>)'.\nWarning:\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\nComparison operators and group generic methods:\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\nNote:\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\nReferences:\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\nSee Also:\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\nExamples:\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 <NA> <NA>\n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps <NA> , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = <character>\n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))" + "objectID": "modules/Module05-DataImportExport.html#what-would-happen-if-we-made-these-mistakes", + "href": "modules/Module05-DataImportExport.html#what-would-happen-if-we-made-these-mistakes", + "title": "Module 5: Data Import and Export", + "section": "What would happen if we made these mistakes (*)", + "text": "What would happen if we made these mistakes (*)\n\nWhat do you think would happen if I had imported the data without assigning it to an object\n\n\nread_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n\n\nWhat do you think would happen if I forgot to specify the sheet argument?\n\n\ndd <- read_excel(path = \"data/serodata.xlsx\")", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference-examples", - "href": "modules/Module07-VarCreationClassesSummaries.html#changing-factor-reference-examples", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Changing factor reference examples", - "text": "Changing factor reference examples\n\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n\n[1] \"young\" \"middle\" \"old\" \n\n\nOR\n\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n\n[1] \"young\" \"middle\" \"old\" \n\n\nArranging, tabulating, and plotting the data will reflect the new order" + "objectID": "modules/Module05-DataImportExport.html#installing-and-attaching-packages---common-confusion", + "href": "modules/Module05-DataImportExport.html#installing-and-attaching-packages---common-confusion", + "title": "Module 5: Data Import and Export", + "section": "Installing and attaching packages - Common confusion", + "text": "Installing and attaching packages - Common confusion\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to attach a package each time you want to use it.\n\nThe exception to this rule are the “base” set of packages (i.e., Base R) that are installed automatically when you install R and that automatically attached whenever you open R or RStudio.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#two-dimensional-data-classes", - "href": "modules/Module07-VarCreationClassesSummaries.html#two-dimensional-data-classes", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Two-dimensional data classes", - "text": "Two-dimensional data classes\nTwo-dimensional classes are those we would often use to store data read from a file\n\na matrix (matrix class)\na data frame (data.frame or tibble classes)" + "objectID": "modules/Module05-DataImportExport.html#common-error", + "href": "modules/Module05-DataImportExport.html#common-error", + "title": "Module 5: Data Import and Export", + "section": "Common Error", + "text": "Common Error\nBe prepared to see this error\n\nError: could not find function \"some_function_name\"\n\nThis usually means that either\n\nyou called the function by the wrong name\nyou have not installed a package that contains the function\nyou have installed a package but you forgot to attach it (i.e., require(package_name)) – most likely", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#matrices", - "href": "modules/Module07-VarCreationClassesSummaries.html#matrices", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Matrices", - "text": "Matrices\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike data.frame, the entire matrix is composed of one R class. For example: all entries are numeric, or all entries are character\nas.matrix() creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n\nmatrix(data=1:6, ncol = 2) \n\n\n\n\n1\n4\n\n\n2\n5\n\n\n3\n6\n\n\n\n\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n\n\n\n\n1\n2\n\n\n3\n4\n\n\n5\n6\n\n\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default byrow argument is FALSE. In the second matrix, we changed the argument byrow to TRUE, and now numbers 1-6 are filled by rows first and then columns." + "objectID": "modules/Module05-DataImportExport.html#export-write-data", + "href": "modules/Module05-DataImportExport.html#export-write-data", + "title": "Module 5: Data Import and Export", + "section": "Export (write) Data", + "text": "Export (write) Data\n\nExporting or ‘Writing out’ data allows you to save modified files for future use or sharing\nR can write almost any file format, especially with external, non-Base R, packages\nWe are going to focus again on writing delimited files", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#data-frame", - "href": "modules/Module07-VarCreationClassesSummaries.html#data-frame", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Data frame", - "text": "Data frame\nYou can transform an existing matrix into data frames using as.data.frame()\n\nas.data.frame(matrix(1:6, ncol = 2) ) \n\n\n\n\nV1\nV2\n\n\n\n\n1\n4\n\n\n2\n5\n\n\n3\n6" + "objectID": "modules/Module05-DataImportExport.html#export-delimited-data", + "href": "modules/Module05-DataImportExport.html#export-delimited-data", + "title": "Module 5: Data Import and Export", + "section": "Export delimited data", + "text": "Export delimited data\nWithin the Base R ‘util’ package we can find a handful of useful functions including write.csv() and write.table() to exporting data.\n\n\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"<filename>\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n (<https://www.rfc-editor.org/rfc/rfc4180>) for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary", - "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Numeric variable data summary", - "text": "Numeric variable data summary\nData summarization on numeric vectors/variables:\n\nmean(): takes the mean of x\nsd(): takes the standard deviation of x\nmedian(): takes the median of x\nquantile(): displays sample quantiles of x. Default is min, IQR, max\nrange(): displays the range. Same as c(min(), max())\nsum(): sum of x\nmax(): maximum value in x\nmin(): minimum value in x\ncolSums(): get the columns sums of a data frame\nrowSums(): get the row sums of a data frame\ncolMeans(): get the columns means of a data frame\nrowMeans()`: get the row means of a data frame\n\nNote, the top 8 functions have an na.rm argument for missing data" + "objectID": "modules/Module05-DataImportExport.html#export-delimited-data-1", + "href": "modules/Module05-DataImportExport.html#export-delimited-data-1", + "title": "Module 5: Data Import and Export", + "section": "Export delimited data", + "text": "Export delimited data\nLet’s practice exporting the data as three files with three different delimiters (comma, tab, semicolon)\n\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n\nNote, I wrote the data to new file names. Even though we didn’t change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-examples", - "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-examples", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Numeric variable data summary examples", - "text": "Numeric variable data summary examples\n\nsummary(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\nage_group\nage_group_factor\n\n\n\n\n\nMin. :5006\nMin. : 0.0054\nMin. : 1.000\nLength:651\nLength:651\nMin. :-5.2231\nMode :logical\nLength:651\nyoung :316\n\n\n\n1st Qu.:6306\n1st Qu.: 0.3000\n1st Qu.: 3.000\nClass :character\nClass :character\n1st Qu.:-1.2040\nFALSE:360\nClass :character\nmiddle:179\n\n\n\nMedian :7495\nMedian : 1.6658\nMedian : 6.000\nMode :character\nMode :character\nMedian : 0.5103\nTRUE :281\nMode :character\nold :147\n\n\n\nMean :7492\nMean : 87.3683\nMean : 6.606\nNA\nNA\nMean : 1.6074\nNA’s :10\nNA\nNA’s : 9\n\n\n\n3rd Qu.:8749\n3rd Qu.:141.4405\n3rd Qu.:10.000\nNA\nNA\n3rd Qu.: 4.9519\nNA\nNA\nNA\n\n\n\nMax. :9982\nMax. :916.4179\nMax. :15.000\nNA\nNA\nMax. : 6.8205\nNA\nNA\nNA\n\n\n\nNA\nNA’s :10\nNA’s :9\nNA\nNA\nNA’s :10\nNA\nNA\nNA\n\n\n\n\nrange(df$age)\n\n[1] NA NA\n\nrange(df$age, na.rm=TRUE)\n\n[1] 1 15\n\nmedian(df$IgG_concentration, na.rm=TRUE)\n\n[1] 1.665753" + "objectID": "modules/Module05-DataImportExport.html#r-.rds-and-.rdardata-files", + "href": "modules/Module05-DataImportExport.html#r-.rds-and-.rdardata-files", + "title": "Module 5: Data Import and Export", + "section": "R .rds and .rda/RData files", + "text": "R .rds and .rda/RData files\nThere are two file extensions worth discussing.\nR has two native data formats—‘Rdata’ (sometimes shortened to ‘Rda’) and ‘Rds’. These formats are used when R objects are saved for later use. ‘Rdata’ is used to save multiple R objects, while ‘Rds’ is used to save a single R object. ‘Rds’ is fast to write/read and is very small.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summaries", - "href": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summaries", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Character variable data summaries", - "text": "Character variable data summaries\nData summarization on character or factor vectors/variables using table()\n\n?table\n\nCross Tabulation and Table Creation\nDescription:\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\nUsage:\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\nexclude: levels to remove for all factors in ‘…’. If it does not contain ‘NA’ and ‘useNA’ is not specified, it implies ‘useNA = “ifany”’. See ‘Details’ for its interpretation for non-factor arguments.\nuseNA: whether to include ‘NA’ values in the table. See ‘Details’. Can be abbreviated.\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\ndeparse.level: controls how the default ‘dnn’ is constructed. See ‘Details’.\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\nrow.names: a character vector giving the row names for the data frame.\nresponseName: The name to be used for the column of table entries, usually counts.\nstringsAsFactors: logical: should the classifying factors be returned as factors (the default) or character vectors?\nsep, base: passed to ‘provideDimnames’.\nDetails:\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\nValue:\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\nExamples:\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))" + "objectID": "modules/Module05-DataImportExport.html#rds-binary-file", + "href": "modules/Module05-DataImportExport.html#rds-binary-file", + "title": "Module 5: Data Import and Export", + "section": ".rds binary file", + "text": ".rds binary file\nSaving datasets in .rds format can save time if you have to read it back in later.\nwrite_rds() and read_rds() from readr package can be used to write/read a single R object to/from file.\nrequire(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summary-examples", - "href": "modules/Module07-VarCreationClassesSummaries.html#character-variable-data-summary-examples", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Character variable data summary examples", - "text": "Character variable data summary examples\nNumber of observations in each category\n\ntable(df$gender)\n\n\n\n\nFemale\nMale\n\n\n\n\n325\n326\n\n\n\n\ntable(df$gender, useNA=\"always\")\n\n\n\n\nFemale\nMale\nNA\n\n\n\n\n325\n326\n0\n\n\n\n\ntable(df$age_group, useNA=\"always\")\n\n\n\n\nmiddle\nold\nyoung\nNA\n\n\n\n\n179\n147\n316\n9\n\n\n\n\n\n\ntable(df$gender)/nrow(df) #if no NA values\n\n\n\n\nFemale\nMale\n\n\n\n\n0.499232\n0.500768\n\n\n\n\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n\n\n\n\nmiddle\nold\nyoung\n\n\n\n\n0.2788162\n0.228972\n0.4922118\n\n\n\n\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n\n\n\n\nmiddle\nold\nyoung\n\n\n\n\n0.2788162\n0.228972\n0.4922118" + "objectID": "modules/Module05-DataImportExport.html#rdardata-files", + "href": "modules/Module05-DataImportExport.html#rdardata-files", + "title": "Module 5: Data Import and Export", + "section": ".rda/RData files", + "text": ".rda/RData files\nThe Base R functions save() and load() can be used to save and load multiple R objects.\nsave() writes an external representation of R objects to the specified file, and can by loaded back into the environment using load(). A nice feature about using save and load is that the R object(s) is directly imported into the environment and you don’t have to specify the name. The files can be saved as .RData or .Rda files.\nFunction signature\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\nNote, that you separate the objects you want to save with commas.", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#summary", - "href": "modules/Module07-VarCreationClassesSummaries.html#summary", - "title": "Module 7: Variable Creation, Classes, and Summaries", + "objectID": "modules/Module05-DataImportExport.html#summary", + "href": "modules/Module05-DataImportExport.html#summary", + "title": "Module 5: Data Import and Export", "section": "Summary", - "text": "Summary\n\nYou can create new columns/variable to a data frame by using $ or the transform() function\nOne useful function for creating new variables based on existing variables is the ifelse() function, which returns a value depending on whether the element of test is TRUE or FALSE\nThe class() function allows you to evaluate the class of an object.\nThere are two types of numeric class objects: integer and double\nLogical class objects only have TRUE or False (without quotes)\nis.CLASS_NAME(x) can be used to test the class of an object x\nas.CLASS_NAME(x) can be used to change the class of an object x\nFactors are a special character class that has levels\nThere are many fairly intuitive data summary functions you can perform on a vector (i.e., mean(), sd(), range()) or on rows or columns of a data frame (i.e., colSums(), colMeans(), rowSums())\nThe table() function builds frequency tables of the counts at each combination of categorical levels" + "text": "Summary\n\nImporting or ‘Reading in’ data are the first step of any real project / data analysis\nThe Base R ‘util’ package has useful functions including read.csv() and read.delim() to importing/reading data or write.csv() and write.table() for exporting/writing data\nWhen importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used\nIf data are imported correctly, they can be found in the Environment pane of RStudio\nYou only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it.\nTo complete a task you don’t know how to do (e.g., reading in an excel data file) use the following steps: 1. Asl Google / ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#acknowledgements", - "href": "modules/Module07-VarCreationClassesSummaries.html#acknowledgements", - "title": "Module 7: Variable Creation, Classes, and Summaries", + "objectID": "modules/Module05-DataImportExport.html#acknowledgements", + "href": "modules/Module05-DataImportExport.html#acknowledgements", + "title": "Module 5: Data Import and Export", "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University" - }, - { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-1", - "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-1", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Adding new columns", - "text": "Adding new columns\nWe can also add a new column using the transform() function:\n\n\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n\n\nFor example, adding a binary column for seropositivity called seropos:\n\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.1466807\nFALSE\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.2345475\nFALSE\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.2039728\nFALSE\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n4.9644957\nTRUE\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n-0.8037359\nFALSE\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n-3.6781074\nFALSE" + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University", + "crumbs": [ + "Day 1", + "Module 5: Data Import and Export" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#learning-objectives", "href": "modules/Module08-DataMergeReshape.html#learning-objectives", "title": "Module 8: Data Merging and Reshaping", "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 8, you should be able to…\n\nMerge/join data together\nReshape data from wide to long\nReshape data from long to wide" + "text": "Learning Objectives\nAfter module 8, you should be able to…\n\nMerge/join data together\nReshape data from wide to long\nReshape data from long to wide", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#joining-types", "href": "modules/Module08-DataMergeReshape.html#joining-types", "title": "Module 8: Data Merging and Reshaping", "section": "Joining types", - "text": "Joining types\nPay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:\n\n1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)\n1:m merge (one-to-many merge) – More complex (things often go wrong)\n\nThe “one” suggests that one dataset has the merging variable (e.g., id) each represented once and the “many” implies that one dataset has the merging variable represented multiple times\n\nm:m merge (many-to-many merge) – Danger zone (can be unpredictable)" + "text": "Joining types\nPay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:\n\n1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)\n1:m merge (one-to-many merge) – More complex (things often go wrong)\n\nThe “one” suggests that one dataset has the merging variable (e.g., id) each represented once and the “many” implies that one dataset has the merging variable represented multiple times\n\nm:m merge (many-to-many merge) – Danger zone (can be unpredictable)", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#one-to-one-merge", "href": "modules/Module08-DataMergeReshape.html#one-to-one-merge", "title": "Module 8: Data Merging and Reshaping", "section": "one-to-one merge", - "text": "one-to-one merge\n\nThis means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)\nWill likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)\nThe merging variable (e.g., id) each represented a single time\nYou should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong." + "text": "one-to-one merge\n\nThis means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)\nWill likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)\nThe merging variable (e.g., id) each represented a single time\nYou should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#merge-function", "href": "modules/Module08-DataMergeReshape.html#merge-function", "title": "Module 8: Data Merging and Reshaping", "section": "merge() function", - "text": "merge() function\nWe will use the merge() function to conduct one-to-one merge\n\n?merge\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\nMerge Two Data Frames\nDescription:\n Merge two data frames by common columns or row names, or do other\n versions of database _join_ operations.\nUsage:\n merge(x, y, ...)\n \n ## Default S3 method:\n merge(x, y, ...)\n \n ## S3 method for class 'data.frame'\n merge(x, y, by = intersect(names(x), names(y)),\n by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,\n sort = TRUE, suffixes = c(\".x\",\".y\"), no.dups = TRUE,\n incomparables = NULL, ...)\n \nArguments:\nx, y: data frames, or objects to be coerced to one.\nby, by.x, by.y: specifications of the columns used for merging. See ‘Details’.\n all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =\n L', where 'L' is either 'TRUE' or 'FALSE'.\nall.x: logical; if ‘TRUE’, then extra rows will be added to the output, one for each row in ‘x’ that has no matching row in ‘y’. These rows will have ‘NA’s in those columns that are usually filled with values from ’y’. The default is ‘FALSE’, so that only rows with data from both ‘x’ and ‘y’ are included in the output.\nall.y: logical; analogous to ‘all.x’.\nsort: logical. Should the result be sorted on the 'by' columns?\nsuffixes: a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which are not used for merging (appearing in ‘by’ etc).\nno.dups: logical indicating that ‘suffixes’ are appended in more cases to avoid duplicated column names in the result. This was implicitly false before R version 3.5.0.\nincomparables: values which cannot be matched. See ‘match’. This is intended to be used for merging on one column, so these are incomparable values of that column.\n ...: arguments to be passed to or from methods.\nDetails:\n 'merge' is a generic function whose principal method is for data\n frames: the default method coerces its arguments to data frames\n and calls the '\"data.frame\"' method.\n\n By default the data frames are merged on the columns with names\n they both have, but separate specifications of the columns can be\n given by 'by.x' and 'by.y'. The rows in the two data frames that\n match on the specified columns are extracted, and joined together.\n If there is more than one match, all possible matches contribute\n one row each. For the precise meaning of 'match', see 'match'.\n\n Columns to merge on can be specified by name, number or by a\n logical vector: the name '\"row.names\"' or the number '0' specifies\n the row names. If specified by name it must correspond uniquely\n to a named column in the input.\n\n If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero\n vector or 'NULL'), the result, 'r', is the _Cartesian product_ of\n 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +\n ncol(y))'.\n\n If 'all.x' is true, all the non matching cases of 'x' are appended\n to the result as well, with 'NA' filled in the corresponding\n columns of 'y'; analogously for 'all.y'.\n\n If the columns in the data frames not used in merging have any\n common names, these have 'suffixes' ('\".x\"' and '\".y\"' by default)\n appended to try to make the names of the result unique. If this\n is not possible, an error is thrown.\n\n If a 'by.x' column name matches one of 'y', and if 'no.dups' is\n true (as by default), the y version gets suffixed as well,\n avoiding duplicate column names in the result.\n\n The complexity of the algorithm used is proportional to the length\n of the answer.\n\n In SQL database terminology, the default value of 'all = FALSE'\n gives a _natural join_, a special case of an _inner join_.\n Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =\n TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)\n outer join_. DBMSes do not match 'NULL' records, equivalent to\n 'incomparables = NA' in R.\nValue:\n A data frame. The rows are by default lexicographically sorted on\n the common columns, but for 'sort = FALSE' are in an unspecified\n order. The columns are the common columns followed by the\n remaining columns in 'x' and then those in 'y'. If the matching\n involved row names, an extra character column called 'Row.names'\n is added at the left, and in all cases the result has 'automatic'\n row names.\nNote:\n This is intended to work with data frames with vector-like\n columns: some aspects work with data frames containing matrices,\n but not all.\n\n Currently long vectors are not accepted for inputs, which are thus\n restricted to less than 2^31 rows. That restriction also applies\n to the result for 32-bit platforms.\nSee Also:\n 'data.frame', 'by', 'cbind'.\n\n 'dendrogram' for a class which has a 'merge' method.\nExamples:\n authors <- data.frame(\n ## I(*) : use character columns of names to get sensible sort order\n surname = I(c(\"Tukey\", \"Venables\", \"Tierney\", \"Ripley\", \"McNeil\")),\n nationality = c(\"US\", \"Australia\", \"US\", \"UK\", \"Australia\"),\n deceased = c(\"yes\", rep(\"no\", 4)))\n authorN <- within(authors, { name <- surname; rm(surname) })\n books <- data.frame(\n name = I(c(\"Tukey\", \"Venables\", \"Tierney\",\n \"Ripley\", \"Ripley\", \"McNeil\", \"R Core\")),\n title = c(\"Exploratory Data Analysis\",\n \"Modern Applied Statistics ...\",\n \"LISP-STAT\",\n \"Spatial Statistics\", \"Stochastic Simulation\",\n \"Interactive Data Analysis\",\n \"An Introduction to R\"),\n other.author = c(NA, \"Ripley\", NA, NA, NA, NA,\n \"Venables & Smith\"))\n \n (m0 <- merge(authorN, books))\n (m1 <- merge(authors, books, by.x = \"surname\", by.y = \"name\"))\n m2 <- merge(books, authors, by.x = \"name\", by.y = \"surname\")\n stopifnot(exprs = {\n identical(m0, m2[, names(m0)])\n as.character(m1[, 1]) == as.character(m2[, 1])\n all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])\n identical(dim(merge(m1, m2, by = NULL)),\n c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))\n })\n \n ## \"R core\" is missing from authors and appears only here :\n merge(authors, books, by.x = \"surname\", by.y = \"name\", all = TRUE)\n \n \n ## example of using 'incomparables'\n x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)\n y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)\n merge(x, y, by = c(\"k1\",\"k2\")) # NA's match\n merge(x, y, by = \"k1\") # NA's match, so 6 rows\n merge(x, y, by = \"k2\", incomparables = NA) # 2 rows" + "text": "merge() function\nWe will use the merge() function to conduct one-to-one merge\n\n?merge\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\nMerge Two Data Frames\nDescription:\n Merge two data frames by common columns or row names, or do other\n versions of database _join_ operations.\nUsage:\n merge(x, y, ...)\n \n ## Default S3 method:\n merge(x, y, ...)\n \n ## S3 method for class 'data.frame'\n merge(x, y, by = intersect(names(x), names(y)),\n by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,\n sort = TRUE, suffixes = c(\".x\",\".y\"), no.dups = TRUE,\n incomparables = NULL, ...)\n \nArguments:\nx, y: data frames, or objects to be coerced to one.\nby, by.x, by.y: specifications of the columns used for merging. See ‘Details’.\n all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =\n L', where 'L' is either 'TRUE' or 'FALSE'.\nall.x: logical; if ‘TRUE’, then extra rows will be added to the output, one for each row in ‘x’ that has no matching row in ‘y’. These rows will have ‘NA’s in those columns that are usually filled with values from ’y’. The default is ‘FALSE’, so that only rows with data from both ‘x’ and ‘y’ are included in the output.\nall.y: logical; analogous to ‘all.x’.\nsort: logical. Should the result be sorted on the 'by' columns?\nsuffixes: a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which are not used for merging (appearing in ‘by’ etc).\nno.dups: logical indicating that ‘suffixes’ are appended in more cases to avoid duplicated column names in the result. This was implicitly false before R version 3.5.0.\nincomparables: values which cannot be matched. See ‘match’. This is intended to be used for merging on one column, so these are incomparable values of that column.\n ...: arguments to be passed to or from methods.\nDetails:\n 'merge' is a generic function whose principal method is for data\n frames: the default method coerces its arguments to data frames\n and calls the '\"data.frame\"' method.\n\n By default the data frames are merged on the columns with names\n they both have, but separate specifications of the columns can be\n given by 'by.x' and 'by.y'. The rows in the two data frames that\n match on the specified columns are extracted, and joined together.\n If there is more than one match, all possible matches contribute\n one row each. For the precise meaning of 'match', see 'match'.\n\n Columns to merge on can be specified by name, number or by a\n logical vector: the name '\"row.names\"' or the number '0' specifies\n the row names. If specified by name it must correspond uniquely\n to a named column in the input.\n\n If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero\n vector or 'NULL'), the result, 'r', is the _Cartesian product_ of\n 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +\n ncol(y))'.\n\n If 'all.x' is true, all the non matching cases of 'x' are appended\n to the result as well, with 'NA' filled in the corresponding\n columns of 'y'; analogously for 'all.y'.\n\n If the columns in the data frames not used in merging have any\n common names, these have 'suffixes' ('\".x\"' and '\".y\"' by default)\n appended to try to make the names of the result unique. If this\n is not possible, an error is thrown.\n\n If a 'by.x' column name matches one of 'y', and if 'no.dups' is\n true (as by default), the y version gets suffixed as well,\n avoiding duplicate column names in the result.\n\n The complexity of the algorithm used is proportional to the length\n of the answer.\n\n In SQL database terminology, the default value of 'all = FALSE'\n gives a _natural join_, a special case of an _inner join_.\n Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =\n TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)\n outer join_. DBMSes do not match 'NULL' records, equivalent to\n 'incomparables = NA' in R.\nValue:\n A data frame. The rows are by default lexicographically sorted on\n the common columns, but for 'sort = FALSE' are in an unspecified\n order. The columns are the common columns followed by the\n remaining columns in 'x' and then those in 'y'. If the matching\n involved row names, an extra character column called 'Row.names'\n is added at the left, and in all cases the result has 'automatic'\n row names.\nNote:\n This is intended to work with data frames with vector-like\n columns: some aspects work with data frames containing matrices,\n but not all.\n\n Currently long vectors are not accepted for inputs, which are thus\n restricted to less than 2^31 rows. That restriction also applies\n to the result for 32-bit platforms.\nSee Also:\n 'data.frame', 'by', 'cbind'.\n\n 'dendrogram' for a class which has a 'merge' method.\nExamples:\n authors <- data.frame(\n ## I(*) : use character columns of names to get sensible sort order\n surname = I(c(\"Tukey\", \"Venables\", \"Tierney\", \"Ripley\", \"McNeil\")),\n nationality = c(\"US\", \"Australia\", \"US\", \"UK\", \"Australia\"),\n deceased = c(\"yes\", rep(\"no\", 4)))\n authorN <- within(authors, { name <- surname; rm(surname) })\n books <- data.frame(\n name = I(c(\"Tukey\", \"Venables\", \"Tierney\",\n \"Ripley\", \"Ripley\", \"McNeil\", \"R Core\")),\n title = c(\"Exploratory Data Analysis\",\n \"Modern Applied Statistics ...\",\n \"LISP-STAT\",\n \"Spatial Statistics\", \"Stochastic Simulation\",\n \"Interactive Data Analysis\",\n \"An Introduction to R\"),\n other.author = c(NA, \"Ripley\", NA, NA, NA, NA,\n \"Venables & Smith\"))\n \n (m0 <- merge(authorN, books))\n (m1 <- merge(authors, books, by.x = \"surname\", by.y = \"name\"))\n m2 <- merge(books, authors, by.x = \"name\", by.y = \"surname\")\n stopifnot(exprs = {\n identical(m0, m2[, names(m0)])\n as.character(m1[, 1]) == as.character(m2[, 1])\n all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])\n identical(dim(merge(m1, m2, by = NULL)),\n c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))\n })\n \n ## \"R core\" is missing from authors and appears only here :\n merge(authors, books, by.x = \"surname\", by.y = \"name\", all = TRUE)\n \n \n ## example of using 'incomparables'\n x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)\n y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)\n merge(x, y, by = c(\"k1\",\"k2\")) # NA's match\n merge(x, y, by = \"k1\") # NA's match, so 6 rows\n merge(x, y, by = \"k2\", incomparables = NA) # 2 rows", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#lets-import-the-new-data-we-want-to-merge-and-take-a-look", "href": "modules/Module08-DataMergeReshape.html#lets-import-the-new-data-we-want-to-merge-and-take-a-look", "title": "Module 8: Data Merging and Reshaping", "section": "Lets import the new data we want to merge and take a look", - "text": "Lets import the new data we want to merge and take a look\nThe new data serodata_new.csv represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.\n\ndf_new <- read.csv(\"data/serodata_new.csv\")\nstr(df_new)\n\n'data.frame': 636 obs. of 3 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...\n $ age : int 6 8 8 8 5 8 8 NA 8 6 ...\n\nsummary(df_new)\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\n\n\n\n\n\nMin. :5006\nMin. : 0.0051\nMin. : 5.00\n\n\n\n1st Qu.:6328\n1st Qu.: 0.2751\n1st Qu.: 7.00\n\n\n\nMedian :7494\nMedian : 1.5477\nMedian :10.00\n\n\n\nMean :7490\nMean : 82.7684\nMean :10.63\n\n\n\n3rd Qu.:8736\n3rd Qu.:129.6389\n3rd Qu.:14.00\n\n\n\nMax. :9982\nMax. :950.6590\nMax. :19.00\n\n\n\nNA\nNA\nNA’s :9" + "text": "Lets import the new data we want to merge and take a look\nThe new data serodata_new.csv represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.\n\ndf_new <- read.csv(\"data/serodata_new.csv\")\nstr(df_new)\n\n'data.frame': 636 obs. of 3 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...\n $ age : int 6 8 8 8 5 8 8 NA 8 6 ...\n\nsummary(df_new)\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\n\n\n\n\n\nMin. :5006\nMin. : 0.0051\nMin. : 5.00\n\n\n\n1st Qu.:6328\n1st Qu.: 0.2751\n1st Qu.: 7.00\n\n\n\nMedian :7494\nMedian : 1.5477\nMedian :10.00\n\n\n\nMean :7490\nMean : 82.7684\nMean :10.63\n\n\n\n3rd Qu.:8736\n3rd Qu.:129.6389\n3rd Qu.:14.00\n\n\n\nMax. :9982\nMax. :950.6590\nMax. :19.00\n\n\n\nNA\nNA\nNA’s :9", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data", "href": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data", "title": "Module 8: Data Merging and Reshaping", "section": "Merge the new data with the original data", - "text": "Merge the new data with the original data\nLets load the old data as well and look for a variable, or variables, to merge by.\n\ndf <- read.csv(\"data/serodata.csv\")\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\n\nWe notice that observation_id seems to be the obvious variable by which to merge. However, we also realize that IgG_concentration and age are the exact same names. If we merge now we see that R has forced the IgG_concentration and age to have a .x or .y to make sure that these variables are different.\n\nhead(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration.x\nage.x\ngender\nslum\nIgG_concentration.y\nage.y\n\n\n\n\n5006\n164.2979452\n7\nMale\nNon slum\n155.5811325\n11\n\n\n5024\n0.3000000\n5\nFemale\nNon slum\n0.2918605\n9\n\n\n5026\n0.3000000\n10\nFemale\nNon slum\n0.2542945\n14\n\n\n5030\n0.0555556\n7\nFemale\nNon slum\n0.0533262\n11\n\n\n5035\n26.2112514\n11\nFemale\nNon slum\n22.0159300\n15\n\n\n5054\n0.3000000\n3\nMale\nNon slum\n0.2709671\n7" + "text": "Merge the new data with the original data\nLets load the old data as well and look for a variable, or variables, to merge by.\n\ndf <- read.csv(\"data/serodata.csv\")\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\n\nWe notice that observation_id seems to be the obvious variable by which to merge. However, we also realize that IgG_concentration and age are the exact same names. If we merge now we see that R has forced the IgG_concentration and age to have a .x or .y to make sure that these variables are different.\n\nhead(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration.x\nage.x\ngender\nslum\nIgG_concentration.y\nage.y\n\n\n\n\n5006\n164.2979452\n7\nMale\nNon slum\n155.5811325\n11\n\n\n5024\n0.3000000\n5\nFemale\nNon slum\n0.2918605\n9\n\n\n5026\n0.3000000\n10\nFemale\nNon slum\n0.2542945\n14\n\n\n5030\n0.0555556\n7\nFemale\nNon slum\n0.0533262\n11\n\n\n5035\n26.2112514\n11\nFemale\nNon slum\n22.0159300\n15\n\n\n5054\n0.3000000\n3\nMale\nNon slum\n0.2709671\n7", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data-1", "href": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data-1", "title": "Module 8: Data Merging and Reshaping", "section": "Merge the new data with the original data", - "text": "Merge the new data with the original data\nWhat do we do?\nThe first option is to rename the IgG_concentration and age variables before the merge, so that it is clear which is time point 1 and time point 2.\n\ndf$IgG_concentration_time1 <- df$IgG_concentration\ndf$age_time1 <- df$age\ndf$IgG_concentration <- df$age <- NULL #remove the original variables\n\ndf_new$IgG_concentration_time2 <- df_new$IgG_concentration\ndf_new$age_time2 <- df_new$age\ndf_new$IgG_concentration <- df_new$age <- NULL #remove the original variables\n\nNow, lets merge.\n\ndf_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))\nstr(df_all_wide)\n\n'data.frame': 651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...\n $ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ..." + "text": "Merge the new data with the original data\nWhat do we do?\nThe first option is to rename the IgG_concentration and age variables before the merge, so that it is clear which is time point 1 and time point 2.\n\ndf$IgG_concentration_time1 <- df$IgG_concentration\ndf$age_time1 <- df$age\ndf$IgG_concentration <- df$age <- NULL #remove the original variables\n\ndf_new$IgG_concentration_time2 <- df_new$IgG_concentration\ndf_new$age_time2 <- df_new$age\ndf_new$IgG_concentration <- df_new$age <- NULL #remove the original variables\n\nNow, lets merge.\n\ndf_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))\nstr(df_all_wide)\n\n'data.frame': 651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...\n $ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ...", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data-2", "href": "modules/Module08-DataMergeReshape.html#merge-the-new-data-with-the-original-data-2", "title": "Module 8: Data Merging and Reshaping", "section": "Merge the new data with the original data", - "text": "Merge the new data with the original data\nThe second option is to add a time variable to the two data sets and then merge by observation_id, time, age, and IgG_concentration. Note, I need to read in the data again b/c I removed the IgG_concentration and age variables.\n\ndf <- read.csv(\"data/serodata.csv\")\ndf_new <- read.csv(\"data/serodata_new.csv\")\n\n\ndf$time <- 1 #you can put in one number and it will repeat it\ndf_new$time <- 2\nhead(df)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\ntime\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n1\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n1\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n1\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n1\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n1\n\n\n\n\nhead(df_new)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ntime\n\n\n\n\n5772\n0.2612388\n6\n2\n\n\n8095\n2.9809049\n8\n2\n\n\n9784\n0.2819489\n8\n2\n\n\n9338\n136.6382260\n8\n2\n\n\n6369\n0.3810119\n5\n2\n\n\n6885\n0.0245951\n8\n2\n\n\n\n\n\nNow, lets merge. Note, “By default the data frames are merged on the columns with names they both have” therefore if I don’t specify the by argument it will merge on all matching variables.\n\ndf_all_long <- merge(df, df_new, all.x=T, all.y=T)\nhead(df_all_long)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ntime\ngender\nslum\n\n\n\n\n5006\n155.5811325\n11\n2\nNA\nNA\n\n\n5006\n164.2979452\n7\n1\nMale\nNon slum\n\n\n5024\n0.2918605\n9\n2\nNA\nNA\n\n\n5024\n0.3000000\n5\n1\nFemale\nNon slum\n\n\n5026\n0.2542945\n14\n2\nNA\nNA\n\n\n5026\n0.3000000\n10\n1\nFemale\nNon slum\n\n\n\n\n\nNote, there are 1287 rows, which is the sum of the number of rows of df (651 rows) and df_new (636 rows)\nNotice that there are some missing values though, because df_new doesn’t have the gender or slum variables. If we assume that those are constant and don’t change between the two study points, we can fill in the data points before merging for an easy solution. One easy way to make a new dataframe from df_new with extra columns is to use the transform() function, which lets us make multiple column changes to a data frame at one time. We just need to make sure to match the correct observation_id values together, using the match() function.\n\ndf_new_filled <- transform(\n df_new,\n gender = df[match(df_new$observation_id, df$observation_id), \"gender\"],\n slum = df[match(df_new$observation_id, df$observation_id), \"slum\"]\n)\n\nNow we can redo the merge.\n\ndf_all_long <- merge(df, df_new_filled, all.x=T, all.y=T)\nhead(df_all_long)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\ntime\n\n\n\n\n5006\n155.5811325\n11\nMale\nNon slum\n2\n\n\n5006\n164.2979452\n7\nMale\nNon slum\n1\n\n\n5024\n0.2918605\n9\nFemale\nNon slum\n2\n\n\n5024\n0.3000000\n5\nFemale\nNon slum\n1\n\n\n5026\n0.2542945\n14\nFemale\nNon slum\n2\n\n\n5026\n0.3000000\n10\nFemale\nNon slum\n1\n\n\n\n\n\nLooks good now! Another solution would be to edit the data file, or use a function that can actually fill in missing values for the same individual, like zoo::na.locf()." + "text": "Merge the new data with the original data\nThe second option is to add a time variable to the two data sets and then merge by observation_id, time, age, and IgG_concentration. Note, I need to read in the data again b/c I removed the IgG_concentration and age variables.\n\ndf <- read.csv(\"data/serodata.csv\")\ndf_new <- read.csv(\"data/serodata_new.csv\")\n\n\ndf$time <- 1 #you can put in one number and it will repeat it\ndf_new$time <- 2\nhead(df)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\ntime\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n1\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n1\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n1\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n1\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n1\n\n\n\n\nhead(df_new)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ntime\n\n\n\n\n5772\n0.2612388\n6\n2\n\n\n8095\n2.9809049\n8\n2\n\n\n9784\n0.2819489\n8\n2\n\n\n9338\n136.6382260\n8\n2\n\n\n6369\n0.3810119\n5\n2\n\n\n6885\n0.0245951\n8\n2\n\n\n\n\n\nNow, lets merge. Note, “By default the data frames are merged on the columns with names they both have” therefore if I don’t specify the by argument it will merge on all matching variables.\n\ndf_all_long <- merge(df, df_new, all.x=T, all.y=T)\nhead(df_all_long)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ntime\ngender\nslum\n\n\n\n\n5006\n155.5811325\n11\n2\nNA\nNA\n\n\n5006\n164.2979452\n7\n1\nMale\nNon slum\n\n\n5024\n0.2918605\n9\n2\nNA\nNA\n\n\n5024\n0.3000000\n5\n1\nFemale\nNon slum\n\n\n5026\n0.2542945\n14\n2\nNA\nNA\n\n\n5026\n0.3000000\n10\n1\nFemale\nNon slum\n\n\n\n\n\nNote, there are 1287 rows, which is the sum of the number of rows of df (651 rows) and df_new (636 rows)\nNotice that there are some missing values though, because df_new doesn’t have the gender or slum variables. If we assume that those are constant and don’t change between the two study points, we can fill in the data points before merging for an easy solution. One easy way to make a new dataframe from df_new with extra columns is to use the transform() function, which lets us make multiple column changes to a data frame at one time. We just need to make sure to match the correct observation_id values together, using the match() function.\n\ndf_new_filled <- transform(\n df_new,\n gender = df[match(df_new$observation_id, df$observation_id), \"gender\"],\n slum = df[match(df_new$observation_id, df$observation_id), \"slum\"]\n)\n\nNow we can redo the merge.\n\ndf_all_long <- merge(df, df_new_filled, all.x=T, all.y=T)\nhead(df_all_long)\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\ntime\n\n\n\n\n5006\n155.5811325\n11\nMale\nNon slum\n2\n\n\n5006\n164.2979452\n7\nMale\nNon slum\n1\n\n\n5024\n0.2918605\n9\nFemale\nNon slum\n2\n\n\n5024\n0.3000000\n5\nFemale\nNon slum\n1\n\n\n5026\n0.2542945\n14\nFemale\nNon slum\n2\n\n\n5026\n0.3000000\n10\nFemale\nNon slum\n1\n\n\n\n\n\nLooks good now! Another solution would be to edit the data file, or use a function that can actually fill in missing values for the same individual, like zoo::na.locf().", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#what-is-widelong-data", "href": "modules/Module08-DataMergeReshape.html#what-is-widelong-data", "title": "Module 8: Data Merging and Reshaping", "section": "What is wide/long data?", - "text": "What is wide/long data?\nAbove, we actually created a wide and long version of the data.\nWide: has many columns\n\nmultiple columns per individual, values spread across multiple columns\neasier for humans to read\n\nLong: has many rows\n\ncolumn names become data\nmultiple rows per observation, a single column contains the values\neasier for R to make plots & do analysis" + "text": "What is wide/long data?\nAbove, we actually created a wide and long version of the data.\nWide: has many columns\n\nmultiple columns per individual, values spread across multiple columns\neasier for humans to read\n\nLong: has many rows\n\ncolumn names become data\nmultiple rows per observation, a single column contains the values\neasier for R to make plots & do analysis", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#reshape-function", "href": "modules/Module08-DataMergeReshape.html#reshape-function", "title": "Module 8: Data Merging and Reshaping", "section": "reshape() function", - "text": "reshape() function\nThe reshape() function allows you to toggle between wide and long data\n\n?reshape\n\nReshape Grouped Data\nDescription:\n This function reshapes a data frame between 'wide' format (with\n repeated measurements in separate columns of the same row) and\n 'long' format (with the repeated measurements in separate rows).\nUsage:\n reshape(data, varying = NULL, v.names = NULL, timevar = \"time\",\n idvar = \"id\", ids = 1:NROW(data),\n times = seq_along(varying[[1]]),\n drop = NULL, direction, new.row.names = NULL,\n sep = \".\",\n split = if (sep == \"\") {\n list(regexp = \"[A-Za-z][0-9]\", include = TRUE)\n } else {\n list(regexp = sep, include = FALSE, fixed = TRUE)}\n )\n \n ### Typical usage for converting from long to wide format:\n \n # reshape(data, direction = \"wide\",\n # idvar = \"___\", timevar = \"___\", # mandatory\n # v.names = c(___), # time-varying variables\n # varying = list(___)) # auto-generated if missing\n \n ### Typical usage for converting from wide to long format:\n \n ### If names of wide-format variables are in a 'nice' format\n \n # reshape(data, direction = \"long\",\n # varying = c(___), # vector \n # sep) # to help guess 'v.names' and 'times'\n \n ### To specify long-format variable names explicitly\n \n # reshape(data, direction = \"long\",\n # varying = ___, # list / matrix / vector (use with care)\n # v.names = ___, # vector of variable names in long format\n # timevar, times, # name / values of constructed time variable\n # idvar, ids) # name / values of constructed id variable\n \nArguments:\ndata: a data frame\nvarying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, when ‘direction = “long”’, the names can be replaced by indices which are interpreted as referring to ‘names(data)’. See ‘Details’ for more details and options.\nv.names: names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’.\ntimevar: the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning).\nidvar: Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.\n ids: the values to use for a newly created 'idvar' variable in\n long format.\ntimes: the values to use for a newly created ‘timevar’ variable in long format. See ‘Details’.\ndrop: a vector of names of variables to drop before reshaping.\ndirection: character string, partially matched to either ‘“wide”’ to reshape to wide format, or ‘“long”’ to reshape to long format.\nnew.row.names: character or ‘NULL’: a non-null value will be used for the row names of the result.\n sep: A character vector of length 1, indicating a separating\n character in the variable names in the wide format. This is\n used for guessing 'v.names' and 'times' arguments based on\n the names in 'varying'. If 'sep == \"\"', the split is just\n before the first numeral that follows an alphabetic\n character. This is also used to create variable names when\n reshaping to wide format.\nsplit: A list with three components, ‘regexp’, ‘include’, and (optionally) ‘fixed’. This allows an extended interface to variable name splitting. See ‘Details’.\nDetails:\n Although 'reshape()' can be used in a variety of contexts, the\n motivating application is data from longitudinal studies, and the\n arguments of this function are named and described in those terms.\n A longitudinal study is characterized by repeated measurements of\n the same variable(s), e.g., height and weight, on each unit being\n studied (e.g., individual persons) at different time points (which\n are assumed to be the same for all units). These variables are\n called time-varying variables. The study may include other\n variables that are measured only once for each unit and do not\n vary with time (e.g., gender and race); these are called\n time-constant variables.\n\n A 'wide' format representation of a longitudinal dataset will have\n one record (row) for each unit, typically with some time-constant\n variables that occupy single columns, and some time-varying\n variables that occupy multiple columns (one column for each time\n point). A 'long' format representation of the same dataset will\n have multiple records (rows) for each individual, with the\n time-constant variables being constant across these records and\n the time-varying variables varying across the records. The 'long'\n format dataset will have two additional variables: a 'time'\n variable identifying which time point each record comes from, and\n an 'id' variable showing which records refer to the same unit.\n\n The type of conversion (long to wide or wide to long) is\n determined by the 'direction' argument, which is mandatory unless\n the 'data' argument is the result of a previous call to 'reshape'.\n In that case, the operation can be reversed simply using\n 'reshape(data)' (the other arguments are stored as attributes on\n the data frame).\n\n Conversion from long to wide format with 'direction = \"wide\"' is\n the simpler operation, and is mainly useful in the context of\n multivariate analysis where data is often expected as a\n wide-format matrix. In this case, the time variable 'timevar' and\n id variable 'idvar' must be specified. All other variables are\n assumed to be time-varying, unless the time-varying variables are\n explicitly specified via the 'v.names' argument. A warning is\n issued if time-constant variables are not actually constant.\n\n Each time-varying variable is expanded into multiple variables in\n the wide format. The names of these expanded variables are\n generated automatically, unless they are specified as the\n 'varying' argument in the form of a list (or matrix) with one\n component (or row) for each time-varying variable. If 'varying' is\n a vector of names, it is implicitly converted into a matrix, with\n one row for each time-varying variable. Use this option with care\n if there are multiple time-varying variables, as the ordering (by\n column, the default in the 'matrix' constructor) may be\n unintuitive, whereas the explicit list or matrix form is\n unambiguous.\n\n Conversion from wide to long with 'direction = \"long\"' is the more\n common operation as most (univariate) statistical modeling\n functions expect data in the long format. In the simpler case\n where there is only one time-varying variable, the corresponding\n columns in the wide format input can be specified as the 'varying'\n argument, which can be either a vector of column names or the\n corresponding column indices. The name of the corresponding\n variable in the long format output combining these columns can be\n optionally specified as the 'v.names' argument, and the name of\n the time variables as the 'timevar' argument. The values to use as\n the time values corresponding to the different columns in the wide\n format can be specified as the 'times' argument. If 'v.names' is\n unspecified, the function will attempt to guess 'v.names' and\n 'times' from 'varying' (an explicitly specified 'times' argument\n is unused in that case). The default expects variable names like\n 'x.1', 'x.2', where 'sep = \".\"' specifies to split at the dot and\n drop it from the name. To have alphabetic followed by numeric\n times use 'sep = \"\"'.\n\n Multiple time-varying variables can be specified in two ways,\n either with 'varying' as an atomic vector as above, or as a list\n (or a matrix). The first form is useful (and mandatory) if the\n automatic variable name splitting as described above is used; this\n requires the names of all time-varying variables to be suitably\n formatted in the same manner, and 'v.names' to be unspecified. If\n 'varying' is a list (with one component for each time-varying\n variable) or a matrix (one row for each time-varying variable),\n variable name splitting is not attempted, and 'v.names' and\n 'times' will generally need to be specified, although they will\n default to, respectively, the first variable name in each set, and\n sequential times.\n\n Also, guessing is not attempted if 'v.names' is given explicitly,\n even if 'varying' is an atomic vector. In that case, the number of\n time-varying variables is taken to be the length of 'v.names', and\n 'varying' is implicitly converted into a matrix, with one row for\n each time-varying variable. As in the case of long to wide\n conversion, the matrix is filled up by column, so careful\n attention needs to be paid to the order of variable names (or\n indices) in 'varying', which is taken to be like 'x.1', 'y.1',\n 'x.2', 'y.2' (i.e., variables corresponding to the same time point\n need to be grouped together).\n\n The 'split' argument should not usually be necessary. The\n 'split$regexp' component is passed to either 'strsplit' or\n 'regexpr', where the latter is used if 'split$include' is 'TRUE',\n in which case the splitting occurs after the first character of\n the matched string. In the 'strsplit' case, the separator is not\n included in the result, and it is possible to specify fixed-string\n matching using 'split$fixed'.\nValue:\n The reshaped data frame with added attributes to simplify\n reshaping back to the original form.\nSee Also:\n 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.\n 'xtabs' and 'as.data.frame.table' for creating contingency tables\n and converting them back to data frames.\nExamples:\n summary(Indometh) # data in long format\n \n ## long to wide (direction = \"wide\") requires idvar and timevar at a minimum\n reshape(Indometh, direction = \"wide\", idvar = \"Subject\", timevar = \"time\")\n \n ## can also explicitly specify name of combined variable\n wide <- reshape(Indometh, direction = \"wide\", idvar = \"Subject\",\n timevar = \"time\", v.names = \"conc\", sep= \"_\")\n wide\n \n ## reverse transformation\n reshape(wide, direction = \"long\")\n reshape(wide, idvar = \"Subject\", varying = list(2:12),\n v.names = \"conc\", direction = \"long\")\n \n ## times need not be numeric\n df <- data.frame(id = rep(1:4, rep(2,4)),\n visit = I(rep(c(\"Before\",\"After\"), 4)),\n x = rnorm(4), y = runif(4))\n df\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n ## warns that y is really varying\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\", v.names = \"x\")\n \n \n ## unbalanced 'long' data leads to NA fill in 'wide' form\n df2 <- df[1:7, ]\n df2\n reshape(df2, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n \n ## Alternative regular expressions for guessing names\n df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),\n dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))\n reshape(df3, direction = \"long\", varying = 3:5, sep = \"\")\n \n \n ## an example that isn't longitudinal data\n state.x77 <- as.data.frame(state.x77)\n long <- reshape(state.x77, idvar = \"state\", ids = row.names(state.x77),\n times = names(state.x77), timevar = \"Characteristic\",\n varying = list(names(state.x77)), direction = \"long\")\n \n reshape(long, direction = \"wide\")\n \n reshape(long, direction = \"wide\", new.row.names = unique(long$state))\n \n ## multiple id variables\n df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),\n time = rep(c(1,1,2,2), 3), score = rnorm(12))\n wide <- reshape(df3, idvar = c(\"school\", \"class\"), direction = \"wide\")\n wide\n ## transform back\n reshape(wide)" + "text": "reshape() function\nThe reshape() function allows you to toggle between wide and long data\n\n?reshape\n\nReshape Grouped Data\nDescription:\n This function reshapes a data frame between 'wide' format (with\n repeated measurements in separate columns of the same row) and\n 'long' format (with the repeated measurements in separate rows).\nUsage:\n reshape(data, varying = NULL, v.names = NULL, timevar = \"time\",\n idvar = \"id\", ids = 1:NROW(data),\n times = seq_along(varying[[1]]),\n drop = NULL, direction, new.row.names = NULL,\n sep = \".\",\n split = if (sep == \"\") {\n list(regexp = \"[A-Za-z][0-9]\", include = TRUE)\n } else {\n list(regexp = sep, include = FALSE, fixed = TRUE)}\n )\n \n ### Typical usage for converting from long to wide format:\n \n # reshape(data, direction = \"wide\",\n # idvar = \"___\", timevar = \"___\", # mandatory\n # v.names = c(___), # time-varying variables\n # varying = list(___)) # auto-generated if missing\n \n ### Typical usage for converting from wide to long format:\n \n ### If names of wide-format variables are in a 'nice' format\n \n # reshape(data, direction = \"long\",\n # varying = c(___), # vector \n # sep) # to help guess 'v.names' and 'times'\n \n ### To specify long-format variable names explicitly\n \n # reshape(data, direction = \"long\",\n # varying = ___, # list / matrix / vector (use with care)\n # v.names = ___, # vector of variable names in long format\n # timevar, times, # name / values of constructed time variable\n # idvar, ids) # name / values of constructed id variable\n \nArguments:\ndata: a data frame\nvarying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, when ‘direction = “long”’, the names can be replaced by indices which are interpreted as referring to ‘names(data)’. See ‘Details’ for more details and options.\nv.names: names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’.\ntimevar: the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning).\nidvar: Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.\n ids: the values to use for a newly created 'idvar' variable in\n long format.\ntimes: the values to use for a newly created ‘timevar’ variable in long format. See ‘Details’.\ndrop: a vector of names of variables to drop before reshaping.\ndirection: character string, partially matched to either ‘“wide”’ to reshape to wide format, or ‘“long”’ to reshape to long format.\nnew.row.names: character or ‘NULL’: a non-null value will be used for the row names of the result.\n sep: A character vector of length 1, indicating a separating\n character in the variable names in the wide format. This is\n used for guessing 'v.names' and 'times' arguments based on\n the names in 'varying'. If 'sep == \"\"', the split is just\n before the first numeral that follows an alphabetic\n character. This is also used to create variable names when\n reshaping to wide format.\nsplit: A list with three components, ‘regexp’, ‘include’, and (optionally) ‘fixed’. This allows an extended interface to variable name splitting. See ‘Details’.\nDetails:\n Although 'reshape()' can be used in a variety of contexts, the\n motivating application is data from longitudinal studies, and the\n arguments of this function are named and described in those terms.\n A longitudinal study is characterized by repeated measurements of\n the same variable(s), e.g., height and weight, on each unit being\n studied (e.g., individual persons) at different time points (which\n are assumed to be the same for all units). These variables are\n called time-varying variables. The study may include other\n variables that are measured only once for each unit and do not\n vary with time (e.g., gender and race); these are called\n time-constant variables.\n\n A 'wide' format representation of a longitudinal dataset will have\n one record (row) for each unit, typically with some time-constant\n variables that occupy single columns, and some time-varying\n variables that occupy multiple columns (one column for each time\n point). A 'long' format representation of the same dataset will\n have multiple records (rows) for each individual, with the\n time-constant variables being constant across these records and\n the time-varying variables varying across the records. The 'long'\n format dataset will have two additional variables: a 'time'\n variable identifying which time point each record comes from, and\n an 'id' variable showing which records refer to the same unit.\n\n The type of conversion (long to wide or wide to long) is\n determined by the 'direction' argument, which is mandatory unless\n the 'data' argument is the result of a previous call to 'reshape'.\n In that case, the operation can be reversed simply using\n 'reshape(data)' (the other arguments are stored as attributes on\n the data frame).\n\n Conversion from long to wide format with 'direction = \"wide\"' is\n the simpler operation, and is mainly useful in the context of\n multivariate analysis where data is often expected as a\n wide-format matrix. In this case, the time variable 'timevar' and\n id variable 'idvar' must be specified. All other variables are\n assumed to be time-varying, unless the time-varying variables are\n explicitly specified via the 'v.names' argument. A warning is\n issued if time-constant variables are not actually constant.\n\n Each time-varying variable is expanded into multiple variables in\n the wide format. The names of these expanded variables are\n generated automatically, unless they are specified as the\n 'varying' argument in the form of a list (or matrix) with one\n component (or row) for each time-varying variable. If 'varying' is\n a vector of names, it is implicitly converted into a matrix, with\n one row for each time-varying variable. Use this option with care\n if there are multiple time-varying variables, as the ordering (by\n column, the default in the 'matrix' constructor) may be\n unintuitive, whereas the explicit list or matrix form is\n unambiguous.\n\n Conversion from wide to long with 'direction = \"long\"' is the more\n common operation as most (univariate) statistical modeling\n functions expect data in the long format. In the simpler case\n where there is only one time-varying variable, the corresponding\n columns in the wide format input can be specified as the 'varying'\n argument, which can be either a vector of column names or the\n corresponding column indices. The name of the corresponding\n variable in the long format output combining these columns can be\n optionally specified as the 'v.names' argument, and the name of\n the time variables as the 'timevar' argument. The values to use as\n the time values corresponding to the different columns in the wide\n format can be specified as the 'times' argument. If 'v.names' is\n unspecified, the function will attempt to guess 'v.names' and\n 'times' from 'varying' (an explicitly specified 'times' argument\n is unused in that case). The default expects variable names like\n 'x.1', 'x.2', where 'sep = \".\"' specifies to split at the dot and\n drop it from the name. To have alphabetic followed by numeric\n times use 'sep = \"\"'.\n\n Multiple time-varying variables can be specified in two ways,\n either with 'varying' as an atomic vector as above, or as a list\n (or a matrix). The first form is useful (and mandatory) if the\n automatic variable name splitting as described above is used; this\n requires the names of all time-varying variables to be suitably\n formatted in the same manner, and 'v.names' to be unspecified. If\n 'varying' is a list (with one component for each time-varying\n variable) or a matrix (one row for each time-varying variable),\n variable name splitting is not attempted, and 'v.names' and\n 'times' will generally need to be specified, although they will\n default to, respectively, the first variable name in each set, and\n sequential times.\n\n Also, guessing is not attempted if 'v.names' is given explicitly,\n even if 'varying' is an atomic vector. In that case, the number of\n time-varying variables is taken to be the length of 'v.names', and\n 'varying' is implicitly converted into a matrix, with one row for\n each time-varying variable. As in the case of long to wide\n conversion, the matrix is filled up by column, so careful\n attention needs to be paid to the order of variable names (or\n indices) in 'varying', which is taken to be like 'x.1', 'y.1',\n 'x.2', 'y.2' (i.e., variables corresponding to the same time point\n need to be grouped together).\n\n The 'split' argument should not usually be necessary. The\n 'split$regexp' component is passed to either 'strsplit' or\n 'regexpr', where the latter is used if 'split$include' is 'TRUE',\n in which case the splitting occurs after the first character of\n the matched string. In the 'strsplit' case, the separator is not\n included in the result, and it is possible to specify fixed-string\n matching using 'split$fixed'.\nValue:\n The reshaped data frame with added attributes to simplify\n reshaping back to the original form.\nSee Also:\n 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.\n 'xtabs' and 'as.data.frame.table' for creating contingency tables\n and converting them back to data frames.\nExamples:\n summary(Indometh) # data in long format\n \n ## long to wide (direction = \"wide\") requires idvar and timevar at a minimum\n reshape(Indometh, direction = \"wide\", idvar = \"Subject\", timevar = \"time\")\n \n ## can also explicitly specify name of combined variable\n wide <- reshape(Indometh, direction = \"wide\", idvar = \"Subject\",\n timevar = \"time\", v.names = \"conc\", sep= \"_\")\n wide\n \n ## reverse transformation\n reshape(wide, direction = \"long\")\n reshape(wide, idvar = \"Subject\", varying = list(2:12),\n v.names = \"conc\", direction = \"long\")\n \n ## times need not be numeric\n df <- data.frame(id = rep(1:4, rep(2,4)),\n visit = I(rep(c(\"Before\",\"After\"), 4)),\n x = rnorm(4), y = runif(4))\n df\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n ## warns that y is really varying\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\", v.names = \"x\")\n \n \n ## unbalanced 'long' data leads to NA fill in 'wide' form\n df2 <- df[1:7, ]\n df2\n reshape(df2, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n \n ## Alternative regular expressions for guessing names\n df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),\n dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))\n reshape(df3, direction = \"long\", varying = 3:5, sep = \"\")\n \n \n ## an example that isn't longitudinal data\n state.x77 <- as.data.frame(state.x77)\n long <- reshape(state.x77, idvar = \"state\", ids = row.names(state.x77),\n times = names(state.x77), timevar = \"Characteristic\",\n varying = list(names(state.x77)), direction = \"long\")\n \n reshape(long, direction = \"wide\")\n \n reshape(long, direction = \"wide\", new.row.names = unique(long$state))\n \n ## multiple id variables\n df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),\n time = rep(c(1,1,2,2), 3), score = rnorm(12))\n wide <- reshape(df3, idvar = c(\"school\", \"class\"), direction = \"wide\")\n wide\n ## transform back\n reshape(wide)", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] + }, + { + "objectID": "modules/Module08-DataMergeReshape.html#wide-to-long-data", + "href": "modules/Module08-DataMergeReshape.html#wide-to-long-data", + "title": "Module 8: Data Merging and Reshaping", + "section": "wide to long data", + "text": "wide to long data\nReminder: “typical usage for converting from long to wide format”\n\n### If names of wide-format variables are in a 'nice' format\n\nreshape(data, direction = \"long\",\n varying = c(___), # vector \n sep) # to help guess 'v.names' and 'times'\n\n### To specify long-format variable names explicitly\n\nreshape(data, direction = \"long\",\n varying = ___, # list / matrix / vector (use with care)\n v.names = ___, # vector of variable names in long format\n timevar, times, # name / values of constructed time variable\n idvar, ids) # name / values of constructed id variable\n\nWe can try to apply that to our data.\n\ndf_wide_to_long <-\n reshape(\n # First argument is the wide-format data frame to be reshaped\n df_all_wide,\n # We are inputting wide data and expect long format as output\n direction = \"long\",\n # \"varying\" argument is a list of vectors. Each vector in the list is a\n # group of time-varying (or grouping-factor-varying) variables which\n # should become one variable after reformat. We want two variables after\n # reformating, so we need two vectors in a list.\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n ),\n # \"v.names\" is a vector of names for the new long-format variables, it\n # should have the same length as the list for varying and the names will\n # be assigned in order.\n v.names = c(\"IgG_concentration\", \"age\"),\n # Name of the variable for the time index that will be created\n timevar = \"time\",\n # Values of the time variable that should be created. Note that if you\n # have any missing observations over time, they NEED to be in the dataset\n # as NAs or your times will get messed up.\n times = 1:2,\n # 'idvar' is a variable that marks which records belong to each\n # observational unit, for us that is the ID marking individuals.\n idvar = \"observation_id\"\n )\n\nNotice that this has exactly twice as many rows as our wide data format, and doesn’t appear to have any systematic missingness, so it seems correct.\n\nstr(df_wide_to_long)\n\n'data.frame': 1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n\nnrow(df_wide_to_long)\n\n[1] 1302\n\nnrow(df_all_wide)\n\n[1] 651", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#long-to-wide-data", "href": "modules/Module08-DataMergeReshape.html#long-to-wide-data", "title": "Module 8: Data Merging and Reshaping", "section": "long to wide data", - "text": "long to wide data\nReminder: “typical usage for converting from long to wide format”\n\nreshape(data, direction = \"wide\",\n idvar = \"___\", timevar = \"___\", # mandatory\n v.names = c(___), # time-varying variables\n varying = list(___)) # auto-generated if missing\n\nWe can try to apply that to our data. Note that the arguments are the same as in the wide to long case, but we don’t need to specify the times argument because they are in the data already. The varying argument is optional also, and R will auto-generate names for the wide variables if it is left empty.\n\ndf_long_to_wide <-\n reshape(\n df_all_long,\n direction = \"wide\",\n idvar = \"observation_id\",\n timevar = \"time\",\n v.names = c(\"IgG_concentration\", \"age\"),\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n )\n )\n\nWe can do the same checks to make sure we pivoted correctly.\n\nstr(df_long_to_wide)\n\n'data.frame': 651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time1 : int 11 9 14 11 15 7 7 16 18 10 ...\n $ IgG_concentration_time2: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time2 : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeWide\")=List of 5\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ timevar: chr \"time\"\n ..$ idvar : chr \"observation_id\"\n ..$ times : num [1:2] 2 1\n ..$ varying: chr [1:2, 1:2] \"IgG_concentration_time1\" \"age_time1\" \"IgG_concentration_time2\" \"age_time2\"\n\nnrow(df_long_to_wide)\n\n[1] 651\n\nnrow(df_all_long)\n\n[1] 1287\n\n\nNote that this time we don’t have exactly twice as many records because of some quirks in how reshape() works. When we go from wide to long, R will create new records with NA values at the second time point for the individuals who were not in the second study – it won’t do that when we go from long to wide data. This is why it can be important to make sure all of your missing data are explicit rather than implicit.\n\n# For the original long dataset, we can see that not all individuals have 2\n# time points\nall(table(df_all_long$observation_id) == 2)\n\n[1] FALSE\n\n# But for the reshaped version they do all have 2 time points\nall(table(df_wide_to_long$observation_id) == 2)\n\n[1] TRUE" + "text": "long to wide data\nReminder: “typical usage for converting from long to wide format”\n\nreshape(data, direction = \"wide\",\n idvar = \"___\", timevar = \"___\", # mandatory\n v.names = c(___), # time-varying variables\n varying = list(___)) # auto-generated if missing\n\nWe can try to apply that to our data. Note that the arguments are the same as in the wide to long case, but we don’t need to specify the times argument because they are in the data already. The varying argument is optional also, and R will auto-generate names for the wide variables if it is left empty.\n\ndf_long_to_wide <-\n reshape(\n df_all_long,\n direction = \"wide\",\n idvar = \"observation_id\",\n timevar = \"time\",\n v.names = c(\"IgG_concentration\", \"age\"),\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n )\n )\n\nWe can do the same checks to make sure we pivoted correctly.\n\nstr(df_long_to_wide)\n\n'data.frame': 651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time1 : int 11 9 14 11 15 7 7 16 18 10 ...\n $ IgG_concentration_time2: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time2 : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeWide\")=List of 5\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ timevar: chr \"time\"\n ..$ idvar : chr \"observation_id\"\n ..$ times : num [1:2] 2 1\n ..$ varying: chr [1:2, 1:2] \"IgG_concentration_time1\" \"age_time1\" \"IgG_concentration_time2\" \"age_time2\"\n\nnrow(df_long_to_wide)\n\n[1] 651\n\nnrow(df_all_long)\n\n[1] 1287\n\n\nNote that this time we don’t have exactly twice as many records because of some quirks in how reshape() works. When we go from wide to long, R will create new records with NA values at the second time point for the individuals who were not in the second study – it won’t do that when we go from long to wide data. This is why it can be important to make sure all of your missing data are explicit rather than implicit.\n\n# For the original long dataset, we can see that not all individuals have 2\n# time points\nall(table(df_all_long$observation_id) == 2)\n\n[1] FALSE\n\n# But for the reshaped version they do all have 2 time points\nall(table(df_wide_to_long$observation_id) == 2)\n\n[1] TRUE", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { - "objectID": "modules/Module08-DataMergeReshape.html#wide-to-long-data", - "href": "modules/Module08-DataMergeReshape.html#wide-to-long-data", + "objectID": "modules/Module08-DataMergeReshape.html#reshape-metadata", + "href": "modules/Module08-DataMergeReshape.html#reshape-metadata", "title": "Module 8: Data Merging and Reshaping", - "section": "wide to long data", - "text": "wide to long data\nReminder: “typical usage for converting from long to wide format”\n\n### If names of wide-format variables are in a 'nice' format\n\nreshape(data, direction = \"long\",\n varying = c(___), # vector \n sep) # to help guess 'v.names' and 'times'\n\n### To specify long-format variable names explicitly\n\nreshape(data, direction = \"long\",\n varying = ___, # list / matrix / vector (use with care)\n v.names = ___, # vector of variable names in long format\n timevar, times, # name / values of constructed time variable\n idvar, ids) # name / values of constructed id variable\n\nWe can try to apply that to our data.\n\ndf_wide_to_long <-\n reshape(\n # First argument is the wide-format data frame to be reshaped\n df_all_wide,\n # We are inputting wide data and expect long format as output\n direction = \"long\",\n # \"varying\" argument is a list of vectors. Each vector in the list is a\n # group of time-varying (or grouping-factor-varying) variables which\n # should become one variable after reformat. We want two variables after\n # reformating, so we need two vectors in a list.\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n ),\n # \"v.names\" is a vector of names for the new long-format variables, it\n # should have the same length as the list for varying and the names will\n # be assigned in order.\n v.names = c(\"IgG_concentration\", \"age\"),\n # Name of the variable for the time index that will be created\n timevar = \"time\",\n # Values of the time variable that should be created. Note that if you\n # have any missing observations over time, they NEED to be in the dataset\n # as NAs or your times will get messed up.\n times = 1:2,\n # 'idvar' is a variable that marks which records belong to each\n # observational unit, for us that is the ID marking individuals.\n idvar = \"observation_id\"\n )\n\nNotice that this has exactly twice as many rows as our wide data format, and doesn’t appear to have any systematic missingness, so it seems correct.\n\nstr(df_wide_to_long)\n\n'data.frame': 1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n\nnrow(df_wide_to_long)\n\n[1] 1302\n\nnrow(df_all_wide)\n\n[1] 651" + "section": "reshape metadata", + "text": "reshape metadata\nWhenever you use reshape() to change the data format, it leaves behind some metadata on our new data frame, as an attr.\n\nstr(df_wide_to_long)\n\n'data.frame': 1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n\n\nThis stores information so we can reshape() back to the other format and we don’t have to specify arguments again.\n\ndf_back_to_wide <- reshape(df_wide_to_long)", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] + }, + { + "objectID": "modules/Module08-DataMergeReshape.html#lets-get-real", + "href": "modules/Module08-DataMergeReshape.html#lets-get-real", + "title": "Module 8: Data Merging and Reshaping", + "section": "Let’s get real", + "text": "Let’s get real\nUse the pivot_wider() and pivot_longer() from the tidyr package!", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#summary", "href": "modules/Module08-DataMergeReshape.html#summary", "title": "Module 8: Data Merging and Reshaping", "section": "Summary", - "text": "Summary\n\nthe merge() function can be used to marge datasets.\npay close attention to the number of rows in your data set before and after a merge\nwide data has many columns and has many columns per observation\nlong data has many rows and can have multiple rows per observation\nthe reshape() function allows you to toggle between wide and long data. although we highly recommend using pivot_wider() and pivot_longer() from the tidyr package instead" + "text": "Summary\n\nthe merge() function can be used to marge datasets.\npay close attention to the number of rows in your data set before and after a merge\nwide data has many columns and has many columns per observation\nlong data has many rows and can have multiple rows per observation\nthe reshape() function allows you to toggle between wide and long data. although we highly recommend using pivot_wider() and pivot_longer() from the tidyr package instead", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { "objectID": "modules/Module08-DataMergeReshape.html#acknowledgements", "href": "modules/Module08-DataMergeReshape.html#acknowledgements", "title": "Module 8: Data Merging and Reshaping", "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University" - }, - { - "objectID": "modules/Module08-DataMergeReshape.html#lets-get-real", - "href": "modules/Module08-DataMergeReshape.html#lets-get-real", - "title": "Module 8: Data Merging and Reshaping", - "section": "Let’s get real", - "text": "Let’s get real\nUse the pivot_wider() and pivot_longer() from the tidyr package!" + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University", + "crumbs": [ + "Day 2", + "Module 8: Data Merging and Reshaping" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#learning-objectives", - "href": "modules/Module09-DataAnalysis.html#learning-objectives", - "title": "Module 9: Data Analysis", + "objectID": "modules/Module10-DataVisualization.html#learning-objectives", + "href": "modules/Module10-DataVisualization.html#learning-objectives", + "title": "Module 10: Data Visualization", "section": "Learning Objectives", - "text": "Learning Objectives\nAfter module 9, you should be able to…\n\nDescriptively assess association between two variables\nCompute basic statistics\nFit a generalized linear model" + "text": "Learning Objectives\nAfter module 10, you should be able to:\n\nCreate Base R plots", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#import-data-for-this-module", - "href": "modules/Module09-DataAnalysis.html#import-data-for-this-module", - "title": "Module 9: Data Analysis", + "objectID": "modules/Module10-DataVisualization.html#import-data-for-this-module", + "href": "modules/Module10-DataVisualization.html#import-data-for-this-module", + "title": "Module 10: Data Visualization", "section": "Import data for this module", - "text": "Import data for this module\nLet’s read in our data (again) and take a quick look.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum" + "text": "Import data for this module\nLet’s read in our data (again) and take a quick look.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#prep-data", - "href": "modules/Module09-DataAnalysis.html#prep-data", - "title": "Module 9: Data Analysis", + "objectID": "modules/Module10-DataVisualization.html#prep-data", + "href": "modules/Module10-DataVisualization.html#prep-data", + "title": "Module 10: Data Visualization", "section": "Prep data", - "text": "Prep data\nCreate age_group three level factor variable\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n\nCreate seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)" + "text": "Prep data\nCreate age_group three level factor variable\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n\nCreate seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#variable-contingency-tables", - "href": "modules/Module09-DataAnalysis.html#variable-contingency-tables", - "title": "Module 9: Data Analysis", - "section": "2 variable contingency tables", - "text": "2 variable contingency tables\nWe use table() prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the prop.table() is useful.\n\n?prop.table\n\n\nlibrary(printr)\n?prop.table\n\nExpress Table Entries as Fraction of Marginal Table\n\nDescription:\n\n Returns conditional proportions given 'margins', i.e. entries of\n 'x', divided by the appropriate marginal sums.\n\nUsage:\n\n proportions(x, margin = NULL)\n prop.table(x, margin = NULL)\n \nArguments:\n\n x: table\n\n margin: a vector giving the margins to split by. E.g., for a matrix\n '1' indicates rows, '2' indicates columns, 'c(1, 2)'\n indicates rows and columns. When 'x' has named dimnames, it\n can be a character vector selecting dimension names.\n\nValue:\n\n Table like 'x' expressed relative to 'margin'\n\nNote:\n\n 'prop.table' is an earlier name, retained for back-compatibility.\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'marginSums'. 'apply', 'sweep' are a more general mechanism for\n sweeping out marginal statistics.\n\nExamples:\n\n m <- matrix(1:4, 2)\n m\n proportions(m, 1)\n \n DF <- as.data.frame(UCBAdmissions)\n tbl <- xtabs(Freq ~ Gender + Admit, DF)\n \n proportions(tbl, \"Gender\")" + "objectID": "modules/Module10-DataVisualization.html#base-r-data-visualizattion-functions", + "href": "modules/Module10-DataVisualization.html#base-r-data-visualizattion-functions", + "title": "Module 10: Data Visualization", + "section": "Base R data visualizattion functions", + "text": "Base R data visualizattion functions\nThe Base R ‘graphics’ package has a ton of graphics options.\n\nhelp(package = \"graphics\")\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\n Information on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team <do-use-Contact-address@r-project.org>\nContact: R-help mailing list <r-help@r-project.org>\nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] + }, + { + "objectID": "modules/Module10-DataVisualization.html#base-r-plotting", + "href": "modules/Module10-DataVisualization.html#base-r-plotting", + "title": "Module 10: Data Visualization", + "section": "Base R Plotting", + "text": "Base R Plotting\nTo make a plot you often need to specify the following features:\n\nParameters\nPlot attributes\nThe legend", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] + }, + { + "objectID": "modules/Module10-DataVisualization.html#parameters", + "href": "modules/Module10-DataVisualization.html#parameters", + "title": "Module 10: Data Visualization", + "section": "1. Parameters", + "text": "1. Parameters\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via par() before you call the plot creates ‘global’ settings for your plot.\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\nThe mfrow specifies that we have one row and two columns of plots — that is, two plots side by side.\nThe mar attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\npar(mfrow = c(1,2), mar = c(5,5,4,1))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#chi-square-test", - "href": "modules/Module09-DataAnalysis.html#chi-square-test", - "title": "Module 9: Data Analysis", - "section": "Chi-Square test", - "text": "Chi-Square test\nThe chisq.test() function test of independence of factor variables from stats package.\n\n?chisq.test\n\nPearson’s Chi-squared Test for Count Data\nDescription:\n 'chisq.test' performs chi-squared contingency table tests and\n goodness-of-fit tests.\nUsage:\n chisq.test(x, y = NULL, correct = TRUE,\n p = rep(1/length(x), length(x)), rescale.p = FALSE,\n simulate.p.value = FALSE, B = 2000)\n \nArguments:\n x: a numeric vector or matrix. 'x' and 'y' can also both be\n factors.\n\n y: a numeric vector; ignored if 'x' is a matrix. If 'x' is a\n factor, 'y' should be a factor of the same length.\ncorrect: a logical indicating whether to apply continuity correction when computing the test statistic for 2 by 2 tables: one half is subtracted from all |O - E| differences; however, the correction will not be bigger than the differences themselves. No correction is done if ‘simulate.p.value = TRUE’.\n p: a vector of probabilities of the same length as 'x'. An\n error is given if any entry of 'p' is negative.\nrescale.p: a logical scalar; if TRUE then ‘p’ is rescaled (if necessary) to sum to 1. If ‘rescale.p’ is FALSE, and ‘p’ does not sum to 1, an error is given.\nsimulate.p.value: a logical indicating whether to compute p-values by Monte Carlo simulation.\n B: an integer specifying the number of replicates used in the\n Monte Carlo test.\nDetails:\n If 'x' is a matrix with one row or column, or if 'x' is a vector\n and 'y' is not given, then a _goodness-of-fit test_ is performed\n ('x' is treated as a one-dimensional contingency table). The\n entries of 'x' must be non-negative integers. In this case, the\n hypothesis tested is whether the population probabilities equal\n those in 'p', or are all equal if 'p' is not given.\n\n If 'x' is a matrix with at least two rows and columns, it is taken\n as a two-dimensional contingency table: the entries of 'x' must be\n non-negative integers. Otherwise, 'x' and 'y' must be vectors or\n factors of the same length; cases with missing values are removed,\n the objects are coerced to factors, and the contingency table is\n computed from these. Then Pearson's chi-squared test is performed\n of the null hypothesis that the joint distribution of the cell\n counts in a 2-dimensional contingency table is the product of the\n row and column marginals.\n\n If 'simulate.p.value' is 'FALSE', the p-value is computed from the\n asymptotic chi-squared distribution of the test statistic;\n continuity correction is only used in the 2-by-2 case (if\n 'correct' is 'TRUE', the default). Otherwise the p-value is\n computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.\n The default 'B = 2000' implies a minimum p-value of about 0.0005\n (1/(B+1)).\n\n In the contingency table case, simulation is done by random\n sampling from the set of all contingency tables with given\n marginals, and works only if the marginals are strictly positive.\n Continuity correction is never used, and the statistic is quoted\n without it. Note that this is not the usual sampling situation\n assumed for the chi-squared test but rather that for Fisher's\n exact test.\n\n In the goodness-of-fit case simulation is done by random sampling\n from the discrete distribution specified by 'p', each sample being\n of size 'n = sum(x)'. This simulation is done in R and may be\n slow.\nValue:\n A list with class '\"htest\"' containing the following components:\nstatistic: the value the chi-squared test statistic.\nparameter: the degrees of freedom of the approximate chi-squared distribution of the test statistic, ‘NA’ if the p-value is computed by Monte Carlo simulation.\np.value: the p-value for the test.\nmethod: a character string indicating the type of test performed, and whether Monte Carlo simulation or continuity correction was used.\ndata.name: a character string giving the name(s) of the data.\nobserved: the observed counts.\nexpected: the expected counts under the null hypothesis.\nresiduals: the Pearson residuals, ‘(observed - expected) / sqrt(expected)’.\nstdres: standardized residuals, ‘(observed - expected) / sqrt(V)’, where ‘V’ is the residual cell variance (Agresti, 2007, section 2.4.5 for the case where ‘x’ is a matrix, ‘n * p * (1 - p)’ otherwise).\nSource:\n The code for Monte Carlo simulation is a C translation of the\n Fortran algorithm of Patefield (1981).\nReferences:\n Hope, A. C. A. (1968). A simplified Monte Carlo significance test\n procedure. _Journal of the Royal Statistical Society Series B_,\n *30*, 582-598. doi:10.1111/j.2517-6161.1968.tb00759.x\n <https://doi.org/10.1111/j.2517-6161.1968.tb00759.x>.\n\n Patefield, W. M. (1981). Algorithm AS 159: An efficient method of\n generating r x c tables with given row and column totals.\n _Applied Statistics_, *30*, 91-97. doi:10.2307/2346669\n <https://doi.org/10.2307/2346669>.\n\n Agresti, A. (2007). _An Introduction to Categorical Data\n Analysis_, 2nd ed. New York: John Wiley & Sons. Page 38.\nSee Also:\n For goodness-of-fit testing, notably of continuous distributions,\n 'ks.test'.\nExamples:\n ## From Agresti(2007) p.39\n M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))\n dimnames(M) <- list(gender = c(\"F\", \"M\"),\n party = c(\"Democrat\",\"Independent\", \"Republican\"))\n (Xsq <- chisq.test(M)) # Prints test summary\n Xsq$observed # observed counts (same as M)\n Xsq$expected # expected counts under the null\n Xsq$residuals # Pearson residuals\n Xsq$stdres # standardized residuals\n \n \n ## Effect of simulating p-values\n x <- matrix(c(12, 5, 7, 7), ncol = 2)\n chisq.test(x)$p.value # 0.4233\n chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value\n # around 0.29!\n \n ## Testing for population probabilities\n ## Case A. Tabulated data\n x <- c(A = 20, B = 15, C = 25)\n chisq.test(x)\n chisq.test(as.table(x)) # the same\n x <- c(89,37,30,28,2)\n p <- c(40,20,20,15,5)\n try(\n chisq.test(x, p = p) # gives an error\n )\n chisq.test(x, p = p, rescale.p = TRUE)\n # works\n p <- c(0.40,0.20,0.20,0.19,0.01)\n # Expected count in category 5\n # is 1.86 < 5 ==> chi square approx.\n chisq.test(x, p = p) # maybe doubtful, but is ok!\n chisq.test(x, p = p, simulate.p.value = TRUE)\n \n ## Case B. Raw data\n x <- trunc(5 * runif(100))\n chisq.test(table(x)) # NOT 'chisq.test(x)'!" + "objectID": "modules/Module10-DataVisualization.html#parameters-1", + "href": "modules/Module10-DataVisualization.html#parameters-1", + "title": "Module 10: Data Visualization", + "section": "1. Parameters", + "text": "1. Parameters", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#chi-square-test-1", - "href": "modules/Module09-DataAnalysis.html#chi-square-test-1", - "title": "Module 9: Data Analysis", - "section": "Chi-Square test", - "text": "Chi-Square test\n\nchisq.test(freq)\n\n\n Pearson's Chi-squared test\n\ndata: freq\nX-squared = 175.85, df = 2, p-value < 2.2e-16\n\n\nWe reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same." + "objectID": "modules/Module10-DataVisualization.html#lots-of-parameters-options", + "href": "modules/Module10-DataVisualization.html#lots-of-parameters-options", + "title": "Module 10: Data Visualization", + "section": "Lots of parameters options", + "text": "Lots of parameters options\nHowever, there are many more parameter options that can be specified in the ‘global’ settings or specific to a certain plot option.\n\n?par\n\nSet or Query Graphical Parameters\nDescription:\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\nUsage:\n par(..., no.readonly = FALSE)\n \n <highlevel plot> (...., <tag> = <value>)\n \nArguments:\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\nno.readonly: logical; if ‘TRUE’ and there are no other arguments, only parameters are returned which can be set by a subsequent ‘par()’ call on the same device.\nDetails:\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\nValue:\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\nGraphical Parameters:\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\nColor Specification:\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\nLine Type Specification:\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\nNote:\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\nExamples:\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#correlation", - "href": "modules/Module09-DataAnalysis.html#correlation", - "title": "Module 9: Data Analysis", - "section": "Correlation", - "text": "Correlation\nFirst, we compute correlation by providing two vectors.\nLike other functions, if there are NAs, you get NA as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.\n\ncor(df$age, df$IgG_concentration, method=\"pearson\")\n\n[1] NA\n\ncor(df$age, df$IgG_concentration, method=\"pearson\", use = \"complete.obs\") #IF have missing data\n\n[1] 0.2604783\n\n\nSmall positive correlation between IgG concentration and age." + "objectID": "modules/Module10-DataVisualization.html#common-parameter-options", + "href": "modules/Module10-DataVisualization.html#common-parameter-options", + "title": "Module 10: Data Visualization", + "section": "Common parameter options", + "text": "Common parameter options\nEight useful parameter arguments help improve the readability of the plot:\n\nxlab: specifies the x-axis label of the plot\nylab: specifies the y-axis label\nmain: titles your graph\npch: specifies the symbology of your graph\nlty: specifies the line type of your graph\nlwd: specifies line thickness\ncex : specifies size\ncol: specifies the colors for your graph.\n\nWe will explore use of these arguments below.", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#t-test", - "href": "modules/Module09-DataAnalysis.html#t-test", - "title": "Module 9: Data Analysis", - "section": "T-test", - "text": "T-test\nThe commonly used are:\n\none-sample t-test – used to test mean of a variable in one group (to the null hypothesis mean)\ntwo-sample t-test – used to test difference in means of a variable between two groups (null hypothesis - the group means are the same)" + "objectID": "modules/Module10-DataVisualization.html#common-parameter-options-1", + "href": "modules/Module10-DataVisualization.html#common-parameter-options-1", + "title": "Module 10: Data Visualization", + "section": "Common parameter options", + "text": "Common parameter options", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#t-test-1", - "href": "modules/Module09-DataAnalysis.html#t-test-1", - "title": "Module 9: Data Analysis", - "section": "T-test", - "text": "T-test\nWe can use the t.test() function from the stats package.\n\n?t.test\n\nStudent’s t-Test\nDescription:\n Performs one and two sample t-tests on vectors of data.\nUsage:\n t.test(x, ...)\n \n ## Default S3 method:\n t.test(x, y = NULL,\n alternative = c(\"two.sided\", \"less\", \"greater\"),\n mu = 0, paired = FALSE, var.equal = FALSE,\n conf.level = 0.95, ...)\n \n ## S3 method for class 'formula'\n t.test(formula, data, subset, na.action, ...)\n \nArguments:\n x: a (non-empty) numeric vector of data values.\n\n y: an optional (non-empty) numeric vector of data values.\nalternative: a character string specifying the alternative hypothesis, must be one of ‘“two.sided”’ (default), ‘“greater”’ or ‘“less”’. You can specify just the initial letter.\n mu: a number indicating the true value of the mean (or difference\n in means if you are performing a two sample test).\npaired: a logical indicating whether you want a paired t-test.\nvar.equal: a logical variable indicating whether to treat the two variances as being equal. If ‘TRUE’ then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used.\nconf.level: confidence level of the interval.\nformula: a formula of the form ‘lhs ~ rhs’ where ‘lhs’ is a numeric variable giving the data values and ‘rhs’ either ‘1’ for a one-sample or paired test or a factor with two levels giving the corresponding groups. If ‘lhs’ is of class ‘“Pair”’ and ‘rhs’ is ‘1’, a paired test is done.\ndata: an optional matrix or data frame (or similar: see\n 'model.frame') containing the variables in the formula\n 'formula'. By default the variables are taken from\n 'environment(formula)'.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’s. Defaults to ’getOption(“na.action”)’.\n ...: further arguments to be passed to or from methods.\nDetails:\n 'alternative = \"greater\"' is the alternative that 'x' has a larger\n mean than 'y'. For the one-sample case: that the mean is positive.\n\n If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and\n they must be the same length. Missing values are silently removed\n (in pairs if 'paired' is 'TRUE'). If 'var.equal' is 'TRUE' then\n the pooled estimate of the variance is used. By default, if\n 'var.equal' is 'FALSE' then the variance is estimated separately\n for both groups and the Welch modification to the degrees of\n freedom is used.\n\n If the input data are effectively constant (compared to the larger\n of the two means) an error is generated.\nValue:\n A list with class '\"htest\"' containing the following components:\nstatistic: the value of the t-statistic.\nparameter: the degrees of freedom for the t-statistic.\np.value: the p-value for the test.\nconf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.\nestimate: the estimated mean or difference in means depending on whether it was a one-sample test or a two-sample test.\nnull.value: the specified hypothesized value of the mean or mean difference depending on whether it was a one-sample test or a two-sample test.\nstderr: the standard error of the mean (difference), used as denominator in the t-statistic formula.\nalternative: a character string describing the alternative hypothesis.\nmethod: a character string indicating what type of t-test was performed.\ndata.name: a character string giving the name(s) of the data.\nSee Also:\n 'prop.test'\nExamples:\n require(graphics)\n \n t.test(1:10, y = c(7:20)) # P = .00001855\n t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore\n \n ## Classical example: Student's sleep data\n plot(extra ~ group, data = sleep)\n ## Traditional interface\n with(sleep, t.test(extra[group == 1], extra[group == 2]))\n \n ## Formula interface\n t.test(extra ~ group, data = sleep)\n \n ## Formula interface to one-sample test\n t.test(extra ~ 1, data = sleep)\n \n ## Formula interface to paired test\n ## The sleep data are actually paired, so could have been in wide format:\n sleep2 <- reshape(sleep, direction = \"wide\", \n idvar = \"ID\", timevar = \"group\")\n t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)" + "objectID": "modules/Module10-DataVisualization.html#plot-attributes", + "href": "modules/Module10-DataVisualization.html#plot-attributes", + "title": "Module 10: Data Visualization", + "section": "2. Plot Attributes", + "text": "2. Plot Attributes\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot.\nWe will only look at four types of plots today:\n\nhist() displays histogram of one variable\nplot() displays x-y plot of two variables\nboxplot() displays boxplot\nbarplot() displays barplot", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#running-two-sample-t-test", - "href": "modules/Module09-DataAnalysis.html#running-two-sample-t-test", - "title": "Module 9: Data Analysis", - "section": "Running two-sample t-test", - "text": "Running two-sample t-test\nThe base R - t.test() function from the stats package. It tests test difference in means of a variable between two groups. By default:\n\ntests whether difference in means of a variable is equal to 0 (default mu=0)\nuses “two sided” alternative (alternative = \"two.sided\")\nreturns result assuming confidence level 0.95 (conf.level = 0.95)\nassumes data are not paired (paired = FALSE)\nassumes true variance in the two groups is not equal (var.equal = FALSE)" + "objectID": "modules/Module10-DataVisualization.html#hist-help-file", + "href": "modules/Module10-DataVisualization.html#hist-help-file", + "title": "Module 10: Data Visualization", + "section": "hist() Help File", + "text": "hist() Help File\n\n?hist\n\nHistograms\nDescription:\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\nUsage:\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n x: a vector of values for which the histogram is desired.\nbreaks: one of:\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\nfreq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\nprobability: an alias for ‘!freq’, for S compatibility.\ninclude.lowest: logical; if ‘TRUE’, an ‘x[i]’ equal to the ‘breaks’ value will be included in the first (or last, for ‘right = FALSE’) bar. This will be ignored (with a warning) unless ‘breaks’ is a vector.\nright: logical; if ‘TRUE’, the histogram cells are right-closed (left open) intervals.\nfuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\ndensity: the density of shading lines, in lines per inch. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise).\n col: a colour to be used to fill the bars.\nborder: the color of the border around the bars. The default is to use the standard foreground color.\nmain, xlab, ylab: main title and axis labels: these arguments to ‘title()’ get “smart” defaults here, e.g., the default ‘ylab’ is ‘“Frequency”’ iff ‘freq’ is true.\nxlim, ylim: the range of x and y values with sensible defaults. Note that ‘xlim’ is not used to define the histogram (breaks), but only for plotting (when ‘plot = TRUE’).\naxes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\nplot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\nlabels: logical or character string. Additionally draw labels on top of bars, if not ‘FALSE’; see ‘plot.histogram’.\nnclass: numeric (integer). For S(-PLUS) compatibility only, ‘nclass’ is equivalent to ‘breaks’ for a scalar or character argument.\nwarn.unused: logical. If ‘plot = FALSE’ and ‘warn.unused = TRUE’, a warning will be issued when graphical parameters are passed to ‘hist.default()’.\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\nDetails:\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\nValue:\n an object of class '\"histogram\"' which is a list with components:\nbreaks: the n+1 cell boundaries (= ‘breaks’ if that was a vector). These are the nominal breaks, not with the boundary fuzz.\ncounts: n integers; for each cell, the number of ‘x[]’ inside.\ndensity: values f^(x[i]), as estimated density values. If ‘all(diff(breaks) == 1)’, they are the relative frequencies ‘counts/n’ and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = ‘breaks[i]’.\nmids: the n cell midpoints.\nxname: a character string with the actual ‘x’ argument name.\nequidist: logical, indicating if the distances between ‘breaks’ are all the same.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\nSee Also:\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\nExamples:\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#running-two-sample-t-test-1", - "href": "modules/Module09-DataAnalysis.html#running-two-sample-t-test-1", - "title": "Module 9: Data Analysis", - "section": "Running two-sample t-test", - "text": "Running two-sample t-test\n\nIgG_young <- df$IgG_concentration[df$age_group==\"young\"]\nIgG_old <- df$IgG_concentration[df$age_group==\"old\"]\n\nt.test(IgG_young, IgG_old)\n\n\n Welch Two Sample t-test\n\ndata: IgG_young and IgG_old\nt = -6.1969, df = 259.54, p-value = 2.25e-09\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -111.09281 -57.51515\nsample estimates:\nmean of x mean of y \n 45.05056 129.35454 \n\n\nThe mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL." + "objectID": "modules/Module10-DataVisualization.html#hist-example", + "href": "modules/Module10-DataVisualization.html#hist-example", + "title": "Module 10: Data Visualization", + "section": "hist() example", + "text": "hist() example\nReminder function signature\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\nLet’s practice\n\nhist(df$age)\n\n\n\n\n\n\n\nhist(\n df$age, \n freq=FALSE, \n main=\"Histogram\", \n xlab=\"Age (years)\"\n )", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r", - "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r", - "title": "Module 9: Data Analysis", - "section": "Linear regression fit in R", - "text": "Linear regression fit in R\nTo fit regression models in R, we use the function glm() (Generalized Linear Model).\n\n?glm\n\nFitting Generalized Linear Models\nDescription:\n 'glm' is used to fit generalized linear models, specified by\n giving a symbolic description of the linear predictor and a\n description of the error distribution.\nUsage:\n glm(formula, family = gaussian, data, weights, subset,\n na.action, start = NULL, etastart, mustart, offset,\n control = list(...), model = TRUE, method = \"glm.fit\",\n x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)\n \n glm.fit(x, y, weights = rep.int(1, nobs),\n start = NULL, etastart = NULL, mustart = NULL,\n offset = rep.int(0, nobs), family = gaussian(),\n control = list(), intercept = TRUE, singular.ok = TRUE)\n \n ## S3 method for class 'glm'\n weights(object, type = c(\"prior\", \"working\"), ...)\n \nArguments:\nformula: an object of class ‘“formula”’ (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.\nfamily: a description of the error distribution and link function to be used in the model. For ‘glm’ this can be a character string naming a family function, a family function or the result of a call to a family function. For ‘glm.fit’ only the third option is supported. (See ‘family’ for details of family functions.)\ndata: an optional data frame, list or environment (or object\n coercible by 'as.data.frame' to a data frame) containing the\n variables in the model. If not found in 'data', the\n variables are taken from 'environment(formula)', typically\n the environment from which 'glm' is called.\nweights: an optional vector of ‘prior weights’ to be used in the fitting process. Should be ‘NULL’ or a numeric vector.\nsubset: an optional vector specifying a subset of observations to be used in the fitting process.\nna.action: a function which indicates what should happen when the data contain ‘NA’s. The default is set by the ’na.action’ setting of ‘options’, and is ‘na.fail’ if that is unset. The ‘factory-fresh’ default is ‘na.omit’. Another possible value is ‘NULL’, no action. Value ‘na.exclude’ can be useful.\nstart: starting values for the parameters in the linear predictor.\netastart: starting values for the linear predictor.\nmustart: starting values for the vector of means.\noffset: this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be ‘NULL’ or a numeric vector of length equal to the number of cases. One or more ‘offset’ terms can be included in the formula instead or as well, and if more than one is specified their sum is used. See ‘model.offset’.\ncontrol: a list of parameters for controlling the fitting process. For ‘glm.fit’ this is passed to ‘glm.control’.\nmodel: a logical value indicating whether model frame should be included as a component of the returned value.\nmethod: the method to be used in fitting the model. The default method ‘“glm.fit”’ uses iteratively reweighted least squares (IWLS): the alternative ‘“model.frame”’ returns the model frame and does no fitting.\n User-supplied fitting functions can be supplied either as a\n function or a character string naming a function, with a\n function which takes the same arguments as 'glm.fit'. If\n specified as a character string it is looked up from within\n the 'stats' namespace.\n\nx, y: For 'glm': logical values indicating whether the response\n vector and model matrix used in the fitting process should be\n returned as components of the returned value.\n\n For 'glm.fit': 'x' is a design matrix of dimension 'n * p',\n and 'y' is a vector of observations of length 'n'.\nsingular.ok: logical; if ‘FALSE’ a singular fit is an error.\ncontrasts: an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.\nintercept: logical. Should an intercept be included in the null model?\nobject: an object inheriting from class ‘“glm”’.\ntype: character, partial matching allowed. Type of weights to\n extract from the fitted model object. Can be abbreviated.\n\n ...: For 'glm': arguments to be used to form the default 'control'\n argument if it is not supplied directly.\n\n For 'weights': further arguments passed to or from other\n methods.\nDetails:\n A typical predictor has the form 'response ~ terms' where\n 'response' is the (numeric) response vector and 'terms' is a\n series of terms which specifies a linear predictor for 'response'.\n For 'binomial' and 'quasibinomial' families the response can also\n be specified as a 'factor' (when the first level denotes failure\n and all others success) or as a two-column matrix with the columns\n giving the numbers of successes and failures. A terms\n specification of the form 'first + second' indicates all the terms\n in 'first' together with all the terms in 'second' with any\n duplicates removed.\n\n A specification of the form 'first:second' indicates the set of\n terms obtained by taking the interactions of all terms in 'first'\n with all terms in 'second'. The specification 'first*second'\n indicates the _cross_ of 'first' and 'second'. This is the same\n as 'first + second + first:second'.\n\n The terms in the formula will be re-ordered so that main effects\n come first, followed by the interactions, all second-order, all\n third-order and so on: to avoid this pass a 'terms' object as the\n formula.\n\n Non-'NULL' 'weights' can be used to indicate that different\n observations have different dispersions (with the values in\n 'weights' being inversely proportional to the dispersions); or\n equivalently, when the elements of 'weights' are positive integers\n w_i, that each response y_i is the mean of w_i unit-weight\n observations. For a binomial GLM prior weights are used to give\n the number of trials when the response is the proportion of\n successes: they would rarely be used for a Poisson GLM.\n\n 'glm.fit' is the workhorse function: it is not normally called\n directly but can be more efficient where the response vector,\n design matrix and family have already been calculated.\n\n If more than one of 'etastart', 'start' and 'mustart' is\n specified, the first in the list will be used. It is often\n advisable to supply starting values for a 'quasi' family, and also\n for families with unusual links such as 'gaussian(\"log\")'.\n\n All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are\n evaluated in the same way as variables in 'formula', that is first\n in 'data' and then in the environment of 'formula'.\n\n For the background to warning messages about 'fitted probabilities\n numerically 0 or 1 occurred' for binomial GLMs, see Venables &\n Ripley (2002, pp. 197-8).\nValue:\n 'glm' returns an object of class inheriting from '\"glm\"' which\n inherits from the class '\"lm\"'. See later in this section. If a\n non-standard 'method' is used, the object will also inherit from\n the class (if any) returned by that function.\n\n The function 'summary' (i.e., 'summary.glm') can be used to obtain\n or print a summary of the results and the function 'anova' (i.e.,\n 'anova.glm') to produce an analysis of variance table.\n\n The generic accessor functions 'coefficients', 'effects',\n 'fitted.values' and 'residuals' can be used to extract various\n useful features of the value returned by 'glm'.\n\n 'weights' extracts a vector of weights, one for each case in the\n fit (after subsetting and 'na.action').\n\n An object of class '\"glm\"' is a list containing at least the\n following components:\ncoefficients: a named vector of coefficients\nresiduals: the working residuals, that is the residuals in the final iteration of the IWLS fit. Since cases with zero weights are omitted, their working residuals are ‘NA’.\nfitted.values: the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.\nrank: the numeric rank of the fitted linear model.\nfamily: the ‘family’ object used.\nlinear.predictors: the linear fit on link scale.\ndeviance: up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero.\n aic: A version of Akaike's _An Information Criterion_, minus twice\n the maximized log-likelihood plus twice the number of\n parameters, computed via the 'aic' component of the family.\n For binomial and Poison families the dispersion is fixed at\n one and the number of parameters is the number of\n coefficients. For gaussian, Gamma and inverse gaussian\n families the dispersion is estimated from the residual\n deviance, and the number of parameters is the number of\n coefficients plus one. For a gaussian family the MLE of the\n dispersion is used so this is a valid value of AIC, but for\n Gamma and inverse gaussian families it is not. For families\n fitted by quasi-likelihood the value is 'NA'.\nnull.deviance: The deviance for the null model, comparable with ‘deviance’. The null model will include the offset, and an intercept if there is one in the model. Note that this will be incorrect if the link function depends on the data other than through the fitted mean: specify a zero offset to force a correct calculation.\niter: the number of iterations of IWLS used.\nweights: the working weights, that is the weights in the final iteration of the IWLS fit.\nprior.weights: the weights initially supplied, a vector of ’1’s if none were.\ndf.residual: the residual degrees of freedom.\ndf.null: the residual degrees of freedom for the null model.\n y: if requested (the default) the 'y' vector used. (It is a\n vector even for a binomial model.)\n\n x: if requested, the model matrix.\nmodel: if requested (the default), the model frame.\nconverged: logical. Was the IWLS algorithm judged to have converged?\nboundary: logical. Is the fitted value on the boundary of the attainable values?\ncall: the matched call.\nformula: the formula supplied.\nterms: the ‘terms’ object used.\ndata: the 'data argument'.\noffset: the offset vector used.\ncontrol: the value of the ‘control’ argument used.\nmethod: the name of the fitter function used (when provided as a ‘character’ string to ‘glm()’) or the fitter ‘function’ (when provided as that).\ncontrasts: (where relevant) the contrasts used.\nxlevels: (where relevant) a record of the levels of the factors used in fitting.\nna.action: (where relevant) information returned by ‘model.frame’ on the special handling of ’NA’s.\n In addition, non-empty fits will have components 'qr', 'R' and\n 'effects' relating to the final weighted linear fit.\n\n Objects of class '\"glm\"' are normally of class 'c(\"glm\", \"lm\")',\n that is inherit from class '\"lm\"', and well-designed methods for\n class '\"lm\"' will be applied to the weighted linear model at the\n final iteration of IWLS. However, care is needed, as extractor\n functions for class '\"glm\"' such as 'residuals' and 'weights' do\n *not* just pick out the component of the fit with the same name.\n\n If a 'binomial' 'glm' model was specified by giving a two-column\n response, the weights returned by 'prior.weights' are the total\n numbers of cases (factored by the supplied case weights) and the\n component 'y' of the result is the proportion of successes.\nFitting functions:\n The argument 'method' serves two purposes. One is to allow the\n model frame to be recreated with no fitting. The other is to\n allow the default fitting function 'glm.fit' to be replaced by a\n function which takes the same arguments and uses a different\n fitting algorithm. If 'glm.fit' is supplied as a character string\n it is used to search for a function of that name, starting in the\n 'stats' namespace.\n\n The class of the object return by the fitter (if any) will be\n prepended to the class returned by 'glm'.\nAuthor(s):\n The original R implementation of 'glm' was written by Simon Davies\n working for Ross Ihaka at the University of Auckland, but has\n since been extensively re-written by members of the R Core team.\n\n The design was inspired by the S function of the same name\n described in Hastie & Pregibon (1992).\nReferences:\n Dobson, A. J. (1990) _An Introduction to Generalized Linear\n Models._ London: Chapman and Hall.\n\n Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._\n Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.\n J. Hastie, Wadsworth & Brooks/Cole.\n\n McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._\n London: Chapman and Hall.\n\n Venables, W. N. and Ripley, B. D. (2002) _Modern Applied\n Statistics with S._ New York: Springer.\nSee Also:\n 'anova.glm', 'summary.glm', etc. for 'glm' methods, and the\n generic functions 'anova', 'summary', 'effects', 'fitted.values',\n and 'residuals'.\n\n 'lm' for non-generalized _linear_ models (which SAS calls GLMs,\n for 'general' linear models).\n\n 'loglin' and 'loglm' (package 'MASS') for fitting log-linear\n models (which binomial and Poisson GLMs are) to contingency\n tables.\n\n 'bigglm' in package 'biglm' for an alternative way to fit GLMs to\n large datasets (especially those with many cases).\n\n 'esoph', 'infert' and 'predict.glm' have examples of fitting\n binomial glms.\nExamples:\n ## Dobson (1990) Page 93: Randomized Controlled Trial :\n counts <- c(18,17,15,20,10,20,25,13,12)\n outcome <- gl(3,1,9)\n treatment <- gl(3,3)\n data.frame(treatment, outcome, counts) # showing data\n glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())\n anova(glm.D93)\n summary(glm.D93)\n ## Computing AIC [in many ways]:\n (A0 <- AIC(glm.D93))\n (ll <- logLik(glm.D93))\n A1 <- -2*c(ll) + 2*attr(ll, \"df\")\n A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +\n 2 * length(coef(glm.D93))\n stopifnot(exprs = {\n all.equal(A0, A1)\n all.equal(A1, A2)\n all.equal(A1, glm.D93$aic)\n })\n \n \n ## an example with offsets from Venables & Ripley (2002, p.189)\n utils::data(anorexia, package = \"MASS\")\n \n anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),\n family = gaussian, data = anorexia)\n summary(anorex.1)\n \n \n # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)\n clotting <- data.frame(\n u = c(5,10,15,20,30,40,60,80,100),\n lot1 = c(118,58,42,35,27,25,21,19,18),\n lot2 = c(69,35,26,21,18,16,13,12,12))\n summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))\n summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))\n ## Aliased (\"S\"ingular) -> 1 NA coefficient\n (fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))\n tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())\n ## -> .. \"singular fit encountered\"\n \n ## Not run:\n \n ## for an example of the use of a terms object as a formula\n demo(glm.vr)\n ## End(Not run)" + "objectID": "modules/Module10-DataVisualization.html#plot-help-file", + "href": "modules/Module10-DataVisualization.html#plot-help-file", + "title": "Module 10: Data Visualization", + "section": "plot() Help File", + "text": "plot() Help File\n\n?plot\n\nGeneric X-Y Plotting\nDescription:\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\nUsage:\n plot(x, y, ...)\n \nArguments:\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\nDetails:\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\nNote:\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\nSee Also:\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\nExamples:\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-1", - "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-1", - "title": "Module 9: Data Analysis", - "section": "Linear regression fit in R", - "text": "Linear regression fit in R\nWe tend to focus on three arguments:\n\nformula – model formula written using names of columns in our data\ndata – our data frame\nfamily – error distribution and link function\n\n\nfit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())\nfit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = \"logit\"))" + "objectID": "modules/Module10-DataVisualization.html#plot-example", + "href": "modules/Module10-DataVisualization.html#plot-example", + "title": "Module 10: Data Visualization", + "section": "plot() example", + "text": "plot() example\n\nplot(df$age, df$IgG_concentration)\n\n\n\n\n\n\n\nplot(\n df$age, \n df$IgG_concentration, \n type=\"p\", \n main=\"Age by IgG Concentrations\", \n xlab=\"Age (years)\", \n ylab=\"IgG Concentration (IU/mL)\", \n pch=16, \n cex=0.9,\n col=\"lightblue\")", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#summary.glm", - "href": "modules/Module09-DataAnalysis.html#summary.glm", - "title": "Module 9: Data Analysis", - "section": "summary.glm()", - "text": "summary.glm()\nThe summary() function when applied to a fit object based on a glm is technically the summary.glm() function and produces details of the model fit. Note on object oriented code.\n\nSummarizing Generalized Linear Model Fits\nDescription:\n These functions are all 'methods' for class 'glm' or 'summary.glm'\n objects.\nUsage:\n ## S3 method for class 'glm'\n summary(object, dispersion = NULL, correlation = FALSE,\n symbolic.cor = FALSE, ...)\n \n ## S3 method for class 'summary.glm'\n print(x, digits = max(3, getOption(\"digits\") - 3),\n symbolic.cor = x$symbolic.cor,\n signif.stars = getOption(\"show.signif.stars\"),\n show.residuals = FALSE, ...)\n \nArguments:\nobject: an object of class ‘“glm”’, usually, a result of a call to ‘glm’.\n x: an object of class '\"summary.glm\"', usually, a result of a\n call to 'summary.glm'.\ndispersion: the dispersion parameter for the family used. Either a single numerical value or ‘NULL’ (the default), when it is inferred from ‘object’ (see ‘Details’).\ncorrelation: logical; if ‘TRUE’, the correlation matrix of the estimated parameters is returned and printed.\ndigits: the number of significant digits to use when printing.\nsymbolic.cor: logical. If ‘TRUE’, print the correlations in a symbolic form (see ‘symnum’) rather than as numbers.\nsignif.stars: logical. If ‘TRUE’, ‘significance stars’ are printed for each coefficient.\nshow.residuals: logical. If ‘TRUE’ then a summary of the deviance residuals is printed at the head of the output.\n ...: further arguments passed to or from other methods.\nDetails:\n 'print.summary.glm' tries to be smart about formatting the\n coefficients, standard errors, etc. and additionally gives\n 'significance stars' if 'signif.stars' is 'TRUE'. The\n 'coefficients' component of the result gives the estimated\n coefficients and their estimated standard errors, together with\n their ratio. This third column is labelled 't ratio' if the\n dispersion is estimated, and 'z ratio' if the dispersion is known\n (or fixed by the family). A fourth column gives the two-tailed\n p-value corresponding to the t or z ratio based on a Student t or\n Normal reference distribution. (It is possible that the\n dispersion is not known and there are no residual degrees of\n freedom from which to estimate it. In that case the estimate is\n 'NaN'.)\n\n Aliased coefficients are omitted in the returned object but\n restored by the 'print' method.\n\n Correlations are printed to two decimal places (or symbolically):\n to see the actual correlations print 'summary(object)$correlation'\n directly.\n\n The dispersion of a GLM is not used in the fitting process, but it\n is needed to find standard errors. If 'dispersion' is not\n supplied or 'NULL', the dispersion is taken as '1' for the\n 'binomial' and 'Poisson' families, and otherwise estimated by the\n residual Chisquared statistic (calculated from cases with non-zero\n weights) divided by the residual degrees of freedom.\n\n 'summary' can be used with Gaussian 'glm' fits to handle the case\n of a linear regression with known error variance, something not\n handled by 'summary.lm'.\nValue:\n 'summary.glm' returns an object of class '\"summary.glm\"', a list\n with components\n\ncall: the component from 'object'.\nfamily: the component from ‘object’.\ndeviance: the component from ‘object’.\ncontrasts: the component from ‘object’.\ndf.residual: the component from ‘object’.\nnull.deviance: the component from ‘object’.\ndf.null: the component from ‘object’.\ndeviance.resid: the deviance residuals: see ‘residuals.glm’.\ncoefficients: the matrix of coefficients, standard errors, z-values and p-values. Aliased coefficients are omitted.\naliased: named logical vector showing if the original coefficients are aliased.\ndispersion: either the supplied argument or the inferred/estimated dispersion if the former is ‘NULL’.\n df: a 3-vector of the rank of the model and the number of\n residual degrees of freedom, plus number of coefficients\n (including aliased ones).\ncov.unscaled: the unscaled (‘dispersion = 1’) estimated covariance matrix of the estimated coefficients.\ncov.scaled: ditto, scaled by ‘dispersion’.\ncorrelation: (only if ‘correlation’ is true.) The estimated correlations of the estimated coefficients.\nsymbolic.cor: (only if ‘correlation’ is true.) The value of the argument ‘symbolic.cor’.\nSee Also:\n 'glm', 'summary'.\nExamples:\n ## For examples see example(glm)" + "objectID": "modules/Module10-DataVisualization.html#adding-more-stuff-to-the-same-plot", + "href": "modules/Module10-DataVisualization.html#adding-more-stuff-to-the-same-plot", + "title": "Module 10: Data Visualization", + "section": "Adding more stuff to the same plot", + "text": "Adding more stuff to the same plot\n\nWe can use the functions points() or lines() to add additional points or additional lines to an existing plot.\n\n\nplot(\n df$age[df$slum == \"Non slum\"],\n df$IgG_concentration[df$slum == \"Non slum\"],\n type = \"p\",\n main = \"IgG Concentration vs Age\",\n xlab = \"Age (years)\",\n ylab = \"IgG Concentration (IU/mL)\",\n pch = 16,\n cex = 0.9,\n col = \"lightblue\",\n xlim = range(df$age, na.rm = TRUE),\n ylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n df$age[df$slum == \"Mixed\"],\n df$IgG_concentration[df$slum == \"Mixed\"],\n pch = 16,\n cex = 0.9,\n col = \"blue\"\n)\npoints(\n df$age[df$slum == \"Slum\"],\n df$IgG_concentration[df$slum == \"Slum\"],\n pch = 16,\n cex = 0.9,\n col = \"darkblue\"\n)\n\n\n\nThe lines() function works similarly for connected lines.\nNote that the points() or lines() functions must be called with a plot()-style function\nWe will show how we could draw a legend() in a future section.", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-2", - "href": "modules/Module09-DataAnalysis.html#linear-regression-fit-in-r-2", - "title": "Module 9: Data Analysis", - "section": "Linear regression fit in R", - "text": "Linear regression fit in R\nLets look at the output…\n\nsummary(fit1)\n\n\nCall:\nglm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(), \n data = df)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 46.132 16.774 2.750 0.00613 ** \nage 9.324 1.388 6.718 4.15e-11 ***\ngenderMale -9.655 11.543 -0.836 0.40321 \nslumNon slum -20.353 14.299 -1.423 0.15513 \nslumSlum -29.705 25.009 -1.188 0.23536 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for gaussian family taken to be 20918.39)\n\n Null deviance: 14141483 on 631 degrees of freedom\nResidual deviance: 13115831 on 627 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 8087.9\n\nNumber of Fisher Scoring iterations: 2\n\nsummary(fit2)\n\n\nCall:\nglm(formula = seropos ~ age_group + gender + slum, family = binomial(link = \"logit\"), \n data = df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.3220 0.2516 -5.254 1.49e-07 ***\nage_groupmiddle 1.9020 0.2133 8.916 < 2e-16 ***\nage_groupold 2.8443 0.2522 11.278 < 2e-16 ***\ngenderMale -0.1725 0.1895 -0.910 0.363 \nslumNon slum -0.1099 0.2329 -0.472 0.637 \nslumSlum -0.1073 0.4118 -0.261 0.794 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 866.98 on 631 degrees of freedom\nResidual deviance: 679.10 on 626 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 691.1\n\nNumber of Fisher Scoring iterations: 4" + "objectID": "modules/Module10-DataVisualization.html#boxplot-help-file", + "href": "modules/Module10-DataVisualization.html#boxplot-help-file", + "title": "Module 10: Data Visualization", + "section": "boxplot() Help File", + "text": "boxplot() Help File\n\n?boxplot\n\nBox Plots\nDescription:\n Produce box-and-whisker plot(s) of the given (grouped) values.\nUsage:\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\nformula: a formula, such as ‘y ~ grp’, where ‘y’ is a numeric vector of data values to be split into groups according to the grouping variable ‘grp’ (usually a factor). Note that ‘~ g1 + g2’ is equivalent to ‘g1:g2’.\ndata: a data.frame (or list) from which the variables in 'formula'\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used for plotting.\nna.action: a function which indicates what should happen when the data contain ’NA’s. The default is to ignore missing values in either the response or the group.\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ‘ann=FALSE’.\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\ndrop, sep, lex.order: passed to ‘split.default’, see there.\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\nrange: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.\nwidth: a vector giving the relative widths of the boxes making up the plot.\nvarwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.\nnotch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See ‘boxplot.stats’ for the calculations used.\noutline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).\nnames: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).\nboxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.\nstaplewex: staple line width expansion, proportional to box width.\noutwex: outlier line width expansion, proportional to box width.\nplot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\nborder: an optional vector of colors for the outlines of the boxplots. The values in ‘border’ are recycled if the length of ‘border’ is less than the number of plots.\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\npars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\nhorizontal: logical indicating if the boxplots should be horizontal; default ‘FALSE’ means vertical boxes.\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\nDetails:\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\nValue:\n List with the following components:\nstats: a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.\n n: a vector with the number of (non-'NA') observations in each\n group.\n\nconf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\ngroup: a vector of the same length as ‘out’ whose elements indicate to which group the outlier belongs.\nnames: a vector of names for the groups.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\nSee Also:\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\nExamples:\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#summary", - "href": "modules/Module09-DataAnalysis.html#summary", - "title": "Module 9: Data Analysis", - "section": "Summary", - "text": "Summary\n\nthe aggregate() function can be used to conduct analyses across groups (i.e., categorical variables in the data(\nthe table() function can generate frequency tables for 2 plus variables, but to get percentage tables, the prop.table() is useful\nthe chisq.test() function tests independence of factor variables\nthe cor() or cor.test() functions can be used to calculate correlation between two numeric vectors\nthe t.test() functions conducts one and two sample (paired or unpaired) t-tests\nthe function glm() fits generalized linear modules to data and returns a fit object that can be read with the summary() function\nchanging the family argument in the glm() function allows you to fit models with different link functions" + "objectID": "modules/Module10-DataVisualization.html#boxplot-example", + "href": "modules/Module10-DataVisualization.html#boxplot-example", + "title": "Module 10: Data Visualization", + "section": "boxplot() example", + "text": "boxplot() example\nReminder function signature\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\nLet’s practice\n\nboxplot(IgG_concentration~age_group, data=df)\n\n\n\n\n\n\n\nboxplot(\n log(df$IgG_concentration)~df$age_group, \n main=\"Age by IgG Concentrations\", \n xlab=\"Age Group (years)\", \n ylab=\"log IgG Concentration (mIU/mL)\", \n names=c(\"1-5\",\"6-10\", \"11-15\"), \n varwidth=T\n )", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#acknowledgements", - "href": "modules/Module09-DataAnalysis.html#acknowledgements", - "title": "Module 9: Data Analysis", - "section": "Acknowledgements", - "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University" + "objectID": "modules/Module10-DataVisualization.html#barplot-help-file", + "href": "modules/Module10-DataVisualization.html#barplot-help-file", + "title": "Module 10: Data Visualization", + "section": "barplot() Help File", + "text": "barplot() Help File\n\n?barplot\n\nBar Plots\nDescription:\n Creates a bar plot with vertical or horizontal bars.\nUsage:\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\nheight: either a vector or matrix of values describing the bars which make up the plot. If ‘height’ is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If ‘height’ is a matrix and ‘beside’ is ‘FALSE’ then each bar of the plot corresponds to a column of ‘height’, with the values in the column giving the heights of stacked sub-bars making up the bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, then the values in each column are juxtaposed rather than stacked.\nwidth: optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless ‘xlim’ is specified.\nspace: the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, ‘space’ may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to ‘c(0,1)’ if ‘height’ is a matrix and ‘beside’ is ‘TRUE’, and to 0.2 otherwise.\nnames.arg: a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the ‘names’ attribute of ‘height’ if this is a vector, or the column names if it is a matrix.\nlegend.text: a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when ‘height’ is a matrix. In that case given legend labels should correspond to the rows of ‘height’; if ‘legend.text’ is true, the row names of ‘height’ will be used as labels if they are non-null.\nbeside: a logical value. If ‘FALSE’, the columns of ‘height’ are portrayed as stacked bars, and if ‘TRUE’ the columns are portrayed as juxtaposed bars.\nhoriz: a logical value. If ‘FALSE’, the bars are drawn vertically with the first bar to the left. If ‘TRUE’, the bars are drawn horizontally with the first at the bottom.\ndensity: a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\nborder: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.\nmain,sub: main title and subtitle for the plot.\nxlab: a label for the x axis.\n\nylab: a label for the y axis.\n\nxlim: limits for the x axis.\n\nylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\naxes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\naxisnames: logical. If ‘TRUE’, and if there are ‘names.arg’ (see above), the other axis is drawn (with ‘lty = 0’) and labeled.\ncex.axis: expansion factor for numeric axis labels (see ‘par(’cex’)’).\ncex.names: expansion factor for axis names (bar labels).\ninside: logical. If ‘TRUE’, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when ‘space = 0’ (which it partly is when ‘beside = TRUE’).\nplot: logical. If 'FALSE', nothing is plotted.\naxis.lty: the graphics parameter ‘lty’ (see ‘par(’lty’)’) applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.\noffset: a vector indicating how much the bars should be shifted relative to the x axis.\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\nargs.legend: list of additional arguments to pass to ‘legend()’; names of the list are used as argument names. Only used if ‘legend.text’ is supplied.\nformula: a formula where the ‘y’ variables are numeric data to plot against the categorical ‘x’ variables. The formula can have one of three forms:\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\ndata: a data frame (or list) from which the variables in formula\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’ values. The default is to ignore missing values in the given variables.\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\nValue:\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\nAuthor(s):\n R Core, with a contribution by Arni Magnusson.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\nExamples:\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#variable-contingency-tables-1", - "href": "modules/Module09-DataAnalysis.html#variable-contingency-tables-1", - "title": "Module 9: Data Analysis", - "section": "2 variable contingency tables", - "text": "2 variable contingency tables\nLet’s practice\n\nfreq <- table(df$age_group, df$seropos)\nfreq\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n254\n57\n\n\nmiddle\n70\n105\n\n\nold\n30\n116\n\n\n\n\n\nNow, lets move to percentages\n\nprop.cell.percentages <- prop.table(freq)\nprop.cell.percentages\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n0.4018987\n0.0901899\n\n\nmiddle\n0.1107595\n0.1661392\n\n\nold\n0.0474684\n0.1835443\n\n\n\n\nprop.column.percentages <- prop.table(freq, margin=2)\nprop.column.percentages\n\n\n\n\n/\n0\n1\n\n\n\n\nyoung\n0.7175141\n0.2050360\n\n\nmiddle\n0.1977401\n0.3776978\n\n\nold\n0.0847458\n0.4172662" + "objectID": "modules/Module10-DataVisualization.html#barplot-example", + "href": "modules/Module10-DataVisualization.html#barplot-example", + "title": "Module 10: Data Visualization", + "section": "barplot() example", + "text": "barplot() example\nThe function takes the a lot of arguments to control the way the way our data is plotted.\nReminder function signature\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n\n\n\n\n\n\n\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#correlation-confidence-interval", - "href": "modules/Module09-DataAnalysis.html#correlation-confidence-interval", - "title": "Module 9: Data Analysis", - "section": "Correlation confidence interval", - "text": "Correlation confidence interval\nThe function cor.test() also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default.\n\ncor.test(df$age, df$IgG_concentration, method=\"pearson\")\n\n\n Pearson's product-moment correlation\n\ndata: df$age and df$IgG_concentration\nt = 6.7717, df = 630, p-value = 2.921e-11\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n 0.1862722 0.3317295\nsample estimates:\n cor \n0.2604783" + "objectID": "modules/Module10-DataVisualization.html#legend", + "href": "modules/Module10-DataVisualization.html#legend", + "title": "Module 10: Data Visualization", + "section": "3. Legend!", + "text": "3. Legend!\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n?legend\n\n\n\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#parameters-1", - "href": "modules/Module10-DataVisualization.html#parameters-1", + "objectID": "modules/Module10-DataVisualization.html#add-legend-to-the-plot", + "href": "modules/Module10-DataVisualization.html#add-legend-to-the-plot", "title": "Module 10: Data Visualization", - "section": "1. Parameters", - "text": "1. Parameters" + "section": "Add legend to the plot", + "text": "Add legend to the plot\nReminder function signature\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\nLet’s practice\n\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { "objectID": "modules/Module10-DataVisualization.html#add-legend-to-the-plot-1", "href": "modules/Module10-DataVisualization.html#add-legend-to-the-plot-1", "title": "Module 10: Data Visualization", "section": "Add legend to the plot", - "text": "Add legend to the plot" + "text": "Add legend to the plot", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#barplot-example-3", - "href": "modules/Module10-DataVisualization.html#barplot-example-3", + "objectID": "modules/Module10-DataVisualization.html#barplot-example-1", + "href": "modules/Module10-DataVisualization.html#barplot-example-1", "title": "Module 10: Data Visualization", "section": "barplot() example", - "text": "barplot() example\nNow, let look at seropositivity by two individual level characteristics in the same plot.\n\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))" + "text": "barplot() example\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#barplot-example-4", - "href": "modules/Module10-DataVisualization.html#barplot-example-4", + "objectID": "modules/Module10-DataVisualization.html#barplot-example-2", + "href": "modules/Module10-DataVisualization.html#barplot-example-2", "title": "Module 10: Data Visualization", "section": "barplot() example", - "text": "barplot() example" - }, - { - "objectID": "archive/CaseStudy01.html#learning-goals", - "href": "archive/CaseStudy01.html#learning-goals", - "title": "Algorithmic Thinking Case Study 1", - "section": "Learning goals", - "text": "Learning goals\n\nUse logical operators, subsetting functions, and math calculations in R\nTranslate human-understandable problem descriptions into instructions that R can understand." - }, - { - "objectID": "archive/CaseStudy01.html#instructions", - "href": "archive/CaseStudy01.html#instructions", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it." - }, - { - "objectID": "archive/CaseStudy01.html#instructions-1", - "href": "archive/CaseStudy01.html#instructions-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Instructions", - "text": "Instructions\n\nMake a new R script for this case study, and save it to your code folder.\nWe’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.\nThe str() of your dataset should look like this.\n\n\n\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ..." + "text": "barplot() example", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "href": "archive/CaseStudy01.html#q1-was-the-overall-prevalence-higher-in-urban-or-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Was the overall prevalence higher in urban or rural areas?", - "text": "Q1: Was the overall prevalence higher in urban or rural areas?\n\n\nHow do we calculate the prevalence from the data?\nHow do we calculate the prevalence separately for urban and rural areas?\nHow do we determine which prevalence is higher and if the difference is meaningful?" + "objectID": "modules/Module10-DataVisualization.html#barplot-example-3", + "href": "modules/Module10-DataVisualization.html#barplot-example-3", + "title": "Module 10: Data Visualization", + "section": "barplot() example", + "text": "barplot() example\nNow, let look at seropositivity by two individual level characteristics in the same plot.\n\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "href": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-from-the-data", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence from the data?", - "text": "Q1: How do we calculate the prevalence from the data?\n\n\nThe variable DP_infection in our dataset is binary / dichotomous.\nThe prevalence is the number or percent of people who had the disease over some duration.\nThe average of a binary variable gives the prevalence!\n\n\n\n\nmean(diph$DP_infection)\n\n[1] 0.8" + "objectID": "modules/Module10-DataVisualization.html#barplot-example-4", + "href": "modules/Module10-DataVisualization.html#barplot-example-4", + "title": "Module 10: Data Visualization", + "section": "barplot() example", + "text": "barplot() example", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "href": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n\n[1] 0.8235294\n\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n\n[1] 0.778626\n\n\n\n\n\nThere are many ways you could write this code! You can use subset() or you can write the indices many ways.\nUsing tbl_df objects from haven uses different [[ rules than a base R data frame." + "objectID": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", + "href": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", + "title": "Module 10: Data Visualization", + "section": "Base R plots vs the Tidyverse ggplot2 package", + "text": "Base R plots vs the Tidyverse ggplot2 package\nIt is good to know both b/c they each have their strengths", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "href": "archive/CaseStudy01.html#q1-how-do-we-calculate-the-prevalence-separately-for-urban-and-rural-areas-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we calculate the prevalence separately for urban and rural areas?", - "text": "Q1: How do we calculate the prevalence separately for urban and rural areas?\n\nOne easy way is to use the aggregate() function.\n\n\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294" + "objectID": "modules/Module10-DataVisualization.html#summary", + "href": "modules/Module10-DataVisualization.html#summary", + "title": "Module 10: Data Visualization", + "section": "Summary", + "text": "Summary\n\nthe Base R ‘graphics’ package has a ton of graphics options that allow for ultimate flexibility\nBase R plots typically include setting plot options (par()), mapping data to the plot (e.g., plot(), barplot(), points(), lines()), and creating a legend (legend()).\nthe functions points() or lines() add additional points or additional lines to an existing plot, but must be called with a plot()-style function\nin Base R plotting the legend is not automatically generated, so be careful when creating it", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "href": "archive/CaseStudy01.html#q1-how-do-we-determine-which-prevalence-is-higher-and-if-the-difference-is-meaningful", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?", - "text": "Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n\nWe probably need to include a confidence interval in our calculation.\nThis is actually not so easy without more advanced tools that we will learn in upcoming modules.\nRight now the best options are to do it by hand or google a function." + "objectID": "modules/Module10-DataVisualization.html#acknowledgements", + "href": "modules/Module10-DataVisualization.html#acknowledgements", + "title": "Module 10: Data Visualization", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Base Plotting in R” by Medium\n [\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-by-hand", - "href": "archive/CaseStudy01.html#q1-by-hand", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n \"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n round(p_urban - 1.96 * se_urban, 2), \", \",\n round(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n \"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n round(p_rural - 1.96 * se_rural, 2), \", \",\n round(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)" + "objectID": "modules/Module06-DataSubset.html#learning-objectives", + "href": "modules/Module06-DataSubset.html#learning-objectives", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Learning Objectives", + "text": "Learning Objectives\nAfter module 6, you should be able to…\n\nUse basic functions to get to know you data\nUse three indexing approaches\nRely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\nDescribe what logical operators are and how to use them\nUse on the subset() function to subset data", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-by-hand-1", - "href": "archive/CaseStudy01.html#q1-by-hand-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: By hand", - "text": "Q1: By hand\n\nWe can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.\nSome people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules." + "objectID": "modules/Module06-DataSubset.html#getting-to-know-our-data", + "href": "modules/Module06-DataSubset.html#getting-to-know-our-data", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Getting to know our data", + "text": "Getting to know our data\nThe dim(), nrow(), and ncol() functions are good options to check the dimensions of your data before moving forward.\nLet’s first read in the data from the previous module.\n\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n\n\ndim(df) # rows, columns\n\n[1] 651 5\n\nnrow(df) # number of rows\n\n[1] 651\n\nncol(df) # number of columns\n\n[1] 5", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "archive/CaseStudy01.html#q1-googling-a-package", - "href": "archive/CaseStudy01.html#q1-googling-a-package", - "title": "Algorithmic Thinking Case Study 1", - "section": "Q1: Googling a package", - "text": "Q1: Googling a package\n\n\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254" + "objectID": "modules/Module06-DataSubset.html#quick-summary-of-data", + "href": "modules/Module06-DataSubset.html#quick-summary-of-data", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Quick summary of data", + "text": "Quick summary of data\nThe colnames(), str() and summary()functions from Base R are great functions to assess the data type and some summary statistics.\n\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\nstr(df)\n\n'data.frame': 651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n\nsummary(df)\n\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n\n\nNote, if you have a very large dataset with 15+ variables, summary() is not so efficient.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "archive/CaseStudy01.html#you-try-it", - "href": "archive/CaseStudy01.html#you-try-it", - "title": "Algorithmic Thinking Case Study 1", - "section": "You try it!", - "text": "You try it!\n\nUsing any of the approaches you can think of, answer this question!\nHow many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?" + "objectID": "modules/Module06-DataSubset.html#description-of-data", + "href": "modules/Module06-DataSubset.html#description-of-data", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Description of data", + "text": "Description of data\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "archive/CaseStudy01.html#you-try-it-1", - "href": "archive/CaseStudy01.html#you-try-it-1", - "title": "Algorithmic Thinking Case Study 1", - "section": "You try it!", - "text": "You try it!\n\n# How many children under 5 were vaccinated\nsum(diph$DP_vacc[diph$age_months < 60])\n\n[1] 91\n\n# Prevalence in both vaccine groups for children under 5\naggregate(\n DP_infection ~ DP_vacc,\n data = subset(diph, age_months < 60),\n FUN = DescTools::MeanCI\n)\n\n DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 0 0.4285714 0.1977457 0.6593972\n2 1 0.6373626 0.5366845 0.7380407\n\n\nIt appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure." + "objectID": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe", + "href": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "View the data as a whole dataframe", + "text": "View the data as a whole dataframe\nThe View() function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.\n\nView(df)", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "archive/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "href": "archive/CaseStudy01.html#congratulations-for-finishing-the-first-case-study", - "title": "Algorithmic Thinking Case Study 1", - "section": "Congratulations for finishing the first case study!", - "text": "Congratulations for finishing the first case study!\n\nWhat R functions and skills did you practice?\nWhat other questions could you answer about the same dataset with the skills you know now?" + "objectID": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe-1", + "href": "modules/Module06-DataSubset.html#view-the-data-as-a-whole-dataframe-1", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "View the data as a whole dataframe", + "text": "View the data as a whole dataframe\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane\n\nYou can also hold down Cmd or CTRL and click on the name of a data frame in your code.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-operator", - "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-operator", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Adding new columns with $ operator", - "text": "Adding new columns with $ operator\nYou can add a new column, called log_IgG to df, using the $ operator:\n\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error." + "objectID": "modules/Module06-DataSubset.html#indexing", + "href": "modules/Module06-DataSubset.html#indexing", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Indexing", + "text": "Indexing\nR contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: [, [[ and $.\n\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform", - "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Adding new columns with transform()", - "text": "Adding new columns with transform()\nWe can also add a new column using the transform() function:\n\n?transform\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)" + "objectID": "modules/Module06-DataSubset.html#vectors-and-multi-dimensional-objects", + "href": "modules/Module06-DataSubset.html#vectors-and-multi-dimensional-objects", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Vectors and multi-dimensional objects", + "text": "Vectors and multi-dimensional objects\nTo index a vector, vector[i] select the ith element. To index a multi-dimensional objects such as a matrix, matrix[i, j] selects the element in row i and column j, where as in a three dimensional array[k, i, j] selects the element in matrix k, row i, and column j.\nLet’s practice by first creating the same objects as we did in Module 1.\n\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n\nHere is a reminder of what these objects look like.\n\nvector.object1\n\n[1] 2 3 4 5\n\nmatrix.object\n\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nFinally, let’s use indexing to pull out elements of the objects.\n\nvector.object1[2] #pulling the second element\n\n[1] 3\n\nmatrix.object[1,2] #pulling the element in row 1 column 2\n\n[1] 3", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform-1", - "href": "modules/Module07-VarCreationClassesSummaries.html#adding-new-columns-with-transform-1", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Adding new columns with transform()", - "text": "Adding new columns with transform()\nFor example, adding a binary column for seropositivity called seropos:\n\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n\n\n\n\n\n\n\n\n\n\n\n\n\nobservation_id\nIgG_concentration\nage\ngender\nslum\nlog_IgG\nseropos\n\n\n\n\n5772\n0.3176895\n2\nFemale\nNon slum\n-1.1466807\nFALSE\n\n\n8095\n3.4368231\n4\nFemale\nNon slum\n1.2345475\nFALSE\n\n\n9784\n0.3000000\n4\nMale\nNon slum\n-1.2039728\nFALSE\n\n\n9338\n143.2363014\n4\nMale\nNon slum\n4.9644957\nTRUE\n\n\n6369\n0.4476534\n1\nMale\nNon slum\n-0.8037359\nFALSE\n\n\n6885\n0.0252708\n4\nMale\nNon slum\n-3.6781074\nFALSE" + "objectID": "modules/Module06-DataSubset.html#list-objects", + "href": "modules/Module06-DataSubset.html#list-objects", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "List objects", + "text": "List objects\nFor lists, one generally uses list[[p]] to select any single element p.\nLet’s practice by creating the same list as we did in Module 1.\n\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\nlist.object[[3]]\n\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nWhat happens if we use a single square bracket?\n\nlist.object[3]\n\n[[1]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nThe [[ operator is called the “extract” operator and gives us the element from the list. The [ operator is called the “subset” operator and gives us a subset of the list, that is still a list.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example-1", - "href": "modules/Module07-VarCreationClassesSummaries.html#ifelse-example-1", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "ifelse example", - "text": "ifelse example\nLet’s delve into what is actually happening, with a focus on the NA values in age variable.\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n\n\ndf$age <= 5\n\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE" + "objectID": "modules/Module06-DataSubset.html#for-indexing-for-data-frame", + "href": "modules/Module06-DataSubset.html#for-indexing-for-data-frame", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "$ for indexing for data frame", + "text": "$ for indexing for data frame\n$ allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.\n\ndf$IgG_concentration\n\nNote, if you have spaces in your variable name, you will need to use back ticks ` after the $. This is a good reason to not create variables / column names with spaces.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#nesting-two-ifelse-statements-example", - "href": "modules/Module07-VarCreationClassesSummaries.html#nesting-two-ifelse-statements-example", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Nesting two ifelse statements example", - "text": "Nesting two ifelse statements example\nifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2)).\n\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n\nLet’s use the table() function to check if it worked.\n\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n\n\n\n\nage/\nmiddle\nold\nyoung\nNA\n\n\n\n\n1\n0\n0\n44\n0\n\n\n2\n0\n0\n72\n0\n\n\n3\n0\n0\n79\n0\n\n\n4\n0\n0\n80\n0\n\n\n5\n0\n0\n41\n0\n\n\n6\n38\n0\n0\n0\n\n\n7\n38\n0\n0\n0\n\n\n8\n39\n0\n0\n0\n\n\n9\n20\n0\n0\n0\n\n\n10\n44\n0\n0\n0\n\n\n11\n0\n41\n0\n0\n\n\n12\n0\n23\n0\n0\n\n\n13\n0\n35\n0\n0\n\n\n14\n0\n37\n0\n0\n\n\n15\n0\n11\n0\n0\n\n\nNA\n0\n0\n0\n9\n\n\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later." + "objectID": "modules/Module06-DataSubset.html#for-indexing-with-lists", + "href": "modules/Module06-DataSubset.html#for-indexing-with-lists", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "$ for indexing with lists", + "text": "$ for indexing with lists\n$ allows only a literal character string or a symbol as the index. For a list it extracts a named element.\nList elements can be named\n\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n\n\nIf list elements are named, than you can reference data from list using $ or using double square brackets, [[\n\nlist.object.named$uga \n\n[1] \"blue\" \"red\" \"yellow\"\n\nlist.object.named[[\"uga\"]] \n\n[1] \"blue\" \"red\" \"yellow\"", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-1", - "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-1", - "title": "Module 7: Variable Creation, Classes, and Summaries", - "section": "Numeric variable data summary", - "text": "Numeric variable data summary\nLet’s look at a help file for mean() to make note of the na.rm argument\n\n?range\n\nRange of Values\nDescription:\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\nUsage:\n range(..., na.rm = FALSE)\n \n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n \nArguments:\n ...: any 'numeric' or character objects.\nna.rm: logical, indicating if ‘NA’’s should be omitted.\nfinite: logical, indicating if all non-finite elements should be omitted.\nDetails:\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\nS4 methods:\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\nExamples:\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)" + "objectID": "modules/Module06-DataSubset.html#using-indexing-to-rename-columns", + "href": "modules/Module06-DataSubset.html#using-indexing-to-rename-columns", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Using indexing to rename columns", + "text": "Using indexing to rename columns\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\ncolnames(df) \n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncolnames(df)[2:3] <- c(\"IgG_concentration_IU/mL\", \"age_year\") # reassigns\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age_year\" \"gender\" \n[5] \"slum\" \n\n\n\nFor the sake of the module, I am going to reassign them back to the original variable names\n\ncolnames(df)[2:3] <- c(\"IgG_concentration\", \"age\") #reset", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#installing-and-attaching-packages---common-confusion", - "href": "modules/Module05-DataImportExport.html#installing-and-attaching-packages---common-confusion", - "title": "Module 5: Data Import and Export", - "section": "Installing and attaching packages - Common confusion", - "text": "Installing and attaching packages - Common confusion\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to attach a package each time you want to use it.\n\nThe exception to this rule are the “base” set of packages (i.e., Base R) that are installed automatically when you install R and that automatically attached whenever you open R or RStudio." + "objectID": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-columns", + "href": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-columns", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Using indexing to subset by columns", + "text": "Using indexing to subset by columns\nWe can also subset data frames and matrices (2-dimensional objects) using the bracket [ row , column ]. We can subset by columns and pull the x column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.\nFor example, here I am pulling the 3rd column, which has the variable name age, for all of rows.\n\ndf[ , \"age\"] #same as df[ , 3]\n\nWe can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.\n\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n\n age gender\n1 2 Female\n2 4 Female\n3 4 Male\n4 4 Male\n5 1 Male\n6 4 Male\n7 4 Female\n8 NA Female\n9 4 Male\n10 2 Male\n11 3 Male\n12 15 Female\n13 8 Male\n14 12 Male\n15 15 Male\n16 9 Male\n17 8 Male\n18 7 Female\n19 11 Female\n20 10 Male\n21 8 Male\n22 11 Female\n23 2 Male\n24 2 Female\n25 3 Female\n26 5 Male\n27 1 Male\n28 3 Female\n29 5 Female\n30 5 Female\n31 3 Male\n32 1 Male\n33 4 Female\n34 3 Male\n35 2 Female\n36 11 Female\n37 7 Male\n38 8 Male\n39 6 Male\n40 6 Male\n41 11 Female\n42 10 Male\n43 6 Female\n44 12 Male\n45 11 Male\n46 10 Male\n47 11 Male\n48 13 Female\n49 3 Female\n50 4 Female\n51 3 Male\n52 1 Male\n53 2 Female\n54 2 Female\n55 4 Male\n56 2 Male\n57 2 Male\n58 3 Female\n59 3 Female\n60 4 Male\n61 1 Female\n62 13 Female\n63 13 Female\n64 6 Male\n65 13 Male\n66 5 Female\n67 13 Female\n68 14 Male\n69 13 Male\n70 8 Female\n71 7 Male\n72 6 Female\n73 13 Male\n74 3 Male\n75 4 Male\n76 2 Male\n77 NA Male\n78 5 Female\n79 3 Male\n80 3 Male\n81 14 Male\n82 11 Female\n83 7 Female\n84 7 Male\n85 11 Female\n86 9 Female\n87 14 Male\n88 13 Female\n89 1 Male\n90 1 Male\n91 4 Male\n92 1 Female\n93 2 Male\n94 3 Female\n95 2 Male\n96 1 Male\n97 2 Male\n98 2 Female\n99 4 Female\n100 5 Female\n101 5 Male\n102 6 Female\n103 14 Female\n104 14 Male\n105 10 Male\n106 6 Female\n107 6 Male\n108 8 Male\n109 6 Female\n110 12 Female\n111 12 Male\n112 14 Female\n113 15 Male\n114 12 Female\n115 4 Female\n116 4 Male\n117 3 Female\n118 NA Male\n119 2 Female\n120 3 Male\n121 NA Female\n122 3 Female\n123 3 Male\n124 2 Female\n125 4 Female\n126 10 Female\n127 7 Female\n128 11 Female\n129 6 Female\n130 11 Male\n131 9 Male\n132 6 Male\n133 13 Female\n134 10 Female\n135 6 Female\n136 11 Female\n137 7 Male\n138 6 Female\n139 4 Female\n140 4 Female\n141 4 Male\n142 4 Female\n143 4 Male\n144 4 Male\n145 3 Male\n146 4 Female\n147 3 Male\n148 3 Male\n149 13 Female\n150 7 Female\n151 10 Male\n152 6 Male\n153 10 Female\n154 12 Female\n155 10 Male\n156 10 Male\n157 13 Male\n158 13 Female\n159 5 Female\n160 3 Female\n161 4 Male\n162 1 Male\n163 3 Female\n164 4 Male\n165 4 Male\n166 1 Male\n167 5 Female\n168 6 Female\n169 14 Female\n170 6 Male\n171 13 Female\n172 9 Male\n173 11 Male\n174 10 Male\n175 5 Female\n176 14 Male\n177 7 Male\n178 10 Male\n179 6 Male\n180 5 Male\n181 3 Female\n182 4 Male\n183 2 Female\n184 3 Male\n185 3 Female\n186 2 Female\n187 3 Male\n188 5 Female\n189 2 Male\n190 3 Female\n191 14 Female\n192 9 Female\n193 14 Female\n194 9 Female\n195 8 Female\n196 7 Male\n197 13 Male\n198 8 Female\n199 6 Male\n200 12 Female\n201 14 Female\n202 15 Female\n203 2 Female\n204 4 Female\n205 3 Male\n206 3 Female\n207 3 Male\n208 4 Female\n209 3 Male\n210 14 Female\n211 8 Male\n212 7 Male\n213 14 Female\n214 13 Female\n215 13 Female\n216 7 Male\n217 8 Female\n218 10 Female\n219 9 Male\n220 9 Female\n221 3 Female\n222 4 Male\n223 4 Female\n224 4 Male\n225 2 Female\n226 1 Female\n227 3 Female\n228 2 Male\n229 3 Male\n230 5 Male\n231 2 Female\n232 2 Male\n233 9 Male\n234 13 Male\n235 10 Female\n236 6 Male\n237 13 Female\n238 11 Male\n239 10 Male\n240 8 Female\n241 9 Female\n242 10 Male\n243 14 Male\n244 1 Female\n245 2 Male\n246 3 Female\n247 2 Male\n248 3 Female\n249 2 Female\n250 3 Female\n251 5 Female\n252 10 Female\n253 7 Male\n254 13 Female\n255 15 Male\n256 11 Female\n257 10 Female\n258 3 Female\n259 2 Male\n260 3 Male\n261 3 Female\n262 3 Female\n263 4 Male\n264 3 Male\n265 2 Male\n266 4 Male\n267 2 Female\n268 8 Male\n269 11 Male\n270 6 Male\n271 14 Female\n272 14 Male\n273 5 Female\n274 5 Male\n275 10 Female\n276 13 Male\n277 6 Male\n278 5 Male\n279 12 Male\n280 2 Male\n281 3 Female\n282 1 Female\n283 1 Male\n284 1 Female\n285 2 Female\n286 5 Female\n287 5 Male\n288 4 Female\n289 2 Male\n290 NA Female\n291 6 Female\n292 8 Male\n293 15 Male\n294 11 Male\n295 14 Male\n296 6 Male\n297 10 Female\n298 12 Male\n299 14 Male\n300 10 Male\n301 1 Female\n302 3 Male\n303 2 Male\n304 3 Female\n305 4 Male\n306 3 Male\n307 4 Female\n308 4 Male\n309 1 Female\n310 7 Male\n311 11 Female\n312 7 Female\n313 5 Female\n314 10 Male\n315 9 Female\n316 13 Male\n317 11 Female\n318 13 Male\n319 9 Female\n320 15 Female\n321 7 Female\n322 4 Male\n323 1 Male\n324 1 Male\n325 2 Female\n326 2 Female\n327 3 Male\n328 2 Male\n329 3 Male\n330 4 Female\n331 7 Female\n332 11 Female\n333 10 Female\n334 5 Male\n335 8 Male\n336 15 Male\n337 14 Male\n338 2 Male\n339 2 Female\n340 2 Male\n341 5 Male\n342 4 Female\n343 3 Male\n344 5 Female\n345 4 Female\n346 2 Female\n347 1 Female\n348 7 Male\n349 8 Female\n350 NA Male\n351 9 Male\n352 8 Female\n353 5 Male\n354 14 Male\n355 14 Male\n356 7 Female\n357 13 Female\n358 2 Male\n359 1 Female\n360 1 Male\n361 4 Female\n362 3 Male\n363 4 Female\n364 3 Male\n365 1 Male\n366 5 Female\n367 4 Female\n368 4 Female\n369 4 Male\n370 11 Male\n371 15 Female\n372 12 Female\n373 11 Female\n374 8 Female\n375 13 Male\n376 10 Female\n377 10 Female\n378 15 Male\n379 8 Female\n380 14 Male\n381 4 Male\n382 1 Male\n383 5 Female\n384 2 Male\n385 2 Female\n386 4 Male\n387 4 Male\n388 2 Female\n389 3 Male\n390 11 Male\n391 10 Female\n392 6 Male\n393 12 Female\n394 10 Female\n395 8 Male\n396 8 Male\n397 13 Male\n398 10 Male\n399 13 Female\n400 10 Male\n401 2 Male\n402 4 Female\n403 3 Female\n404 2 Female\n405 1 Female\n406 3 Male\n407 3 Female\n408 4 Male\n409 5 Female\n410 5 Female\n411 1 Female\n412 11 Male\n413 6 Male\n414 14 Female\n415 8 Male\n416 8 Female\n417 9 Female\n418 7 Male\n419 6 Male\n420 12 Female\n421 8 Male\n422 11 Female\n423 14 Male\n424 3 Female\n425 1 Female\n426 5 Female\n427 2 Female\n428 3 Female\n429 4 Female\n430 2 Male\n431 3 Female\n432 4 Male\n433 1 Female\n434 7 Female\n435 10 Male\n436 11 Male\n437 7 Female\n438 10 Female\n439 14 Female\n440 7 Female\n441 11 Male\n442 12 Male\n443 10 Female\n444 6 Male\n445 13 Male\n446 8 Female\n447 2 Male\n448 3 Female\n449 1 Female\n450 2 Female\n451 NA Male\n452 NA Female\n453 4 Male\n454 4 Male\n455 1 Male\n456 2 Female\n457 2 Male\n458 12 Male\n459 12 Female\n460 8 Female\n461 14 Female\n462 13 Female\n463 6 Male\n464 11 Female\n465 11 Male\n466 10 Female\n467 12 Male\n468 14 Female\n469 11 Female\n470 1 Male\n471 2 Female\n472 3 Male\n473 3 Female\n474 5 Female\n475 3 Male\n476 1 Male\n477 4 Female\n478 4 Female\n479 4 Male\n480 2 Female\n481 5 Female\n482 7 Male\n483 8 Male\n484 10 Male\n485 6 Female\n486 7 Male\n487 10 Female\n488 6 Male\n489 6 Female\n490 15 Female\n491 5 Male\n492 3 Male\n493 5 Male\n494 3 Female\n495 5 Male\n496 5 Male\n497 1 Female\n498 1 Male\n499 7 Female\n500 14 Female\n501 9 Male\n502 10 Female\n503 10 Female\n504 11 Male\n505 11 Female\n506 12 Female\n507 11 Female\n508 12 Male\n509 12 Male\n510 10 Female\n511 1 Male\n512 2 Female\n513 4 Male\n514 2 Male\n515 3 Male\n516 3 Female\n517 2 Male\n518 4 Male\n519 3 Male\n520 1 Female\n521 4 Male\n522 12 Female\n523 6 Male\n524 7 Female\n525 7 Male\n526 13 Female\n527 8 Female\n528 7 Male\n529 8 Female\n530 8 Female\n531 11 Female\n532 14 Female\n533 3 Male\n534 2 Female\n535 2 Male\n536 3 Male\n537 2 Male\n538 2 Female\n539 3 Female\n540 2 Male\n541 5 Male\n542 10 Female\n543 14 Male\n544 9 Male\n545 6 Male\n546 7 Male\n547 14 Female\n548 7 Female\n549 7 Male\n550 9 Male\n551 14 Male\n552 10 Female\n553 13 Female\n554 5 Male\n555 4 Female\n556 4 Female\n557 5 Female\n558 4 Female\n559 4 Male\n560 4 Male\n561 3 Female\n562 1 Female\n563 4 Male\n564 1 Male\n565 1 Female\n566 7 Male\n567 13 Female\n568 10 Female\n569 14 Male\n570 12 Female\n571 14 Male\n572 8 Male\n573 7 Male\n574 11 Female\n575 8 Male\n576 12 Male\n577 9 Female\n578 5 Female\n579 4 Male\n580 3 Female\n581 2 Male\n582 2 Male\n583 3 Male\n584 4 Female\n585 4 Male\n586 4 Female\n587 5 Male\n588 3 Female\n589 6 Female\n590 3 Male\n591 11 Female\n592 11 Male\n593 7 Male\n594 8 Male\n595 6 Female\n596 10 Female\n597 8 Female\n598 8 Male\n599 9 Female\n600 8 Male\n601 13 Male\n602 11 Male\n603 8 Female\n604 2 Female\n605 4 Male\n606 2 Male\n607 2 Female\n608 4 Male\n609 2 Male\n610 4 Female\n611 2 Female\n612 4 Female\n613 1 Female\n614 4 Female\n615 12 Female\n616 7 Female\n617 11 Male\n618 6 Male\n619 8 Male\n620 14 Male\n621 11 Male\n622 7 Female\n623 14 Female\n624 6 Male\n625 13 Female\n626 13 Female\n627 3 Male\n628 1 Male\n629 3 Male\n630 1 Female\n631 1 Female\n632 2 Male\n633 4 Male\n634 4 Male\n635 2 Female\n636 4 Female\n637 5 Male\n638 3 Female\n639 3 Male\n640 6 Female\n641 11 Female\n642 9 Female\n643 7 Female\n644 8 Male\n645 NA Female\n646 8 Female\n647 14 Female\n648 10 Male\n649 10 Male\n650 11 Female\n651 13 Female\n\n\nWe can remove select columns using indexing as well, OR by simply changing the column to NULL\n\ndf[, -5] #remove column 5, \"slum\" variable\n\n\ndf$slum <- NULL # this is the same as above\n\nWe can also grab the age column using the $ operator, again this is selecting the variable for all of the rows.\n\ndf$age", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module05-DataImportExport.html#what-would-happen-if-we-made-these-mistakes", - "href": "modules/Module05-DataImportExport.html#what-would-happen-if-we-made-these-mistakes", - "title": "Module 5: Data Import and Export", - "section": "What would happen if we made these mistakes (*)", - "text": "What would happen if we made these mistakes (*)\n\nWhat do you think would happen if I had imported the data without assigning it to an object\n\n\nread_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n\n\nWhat do you think would happen if I forgot to specify the sheet argument?\n\n\ndd <- read_excel(path = \"data/serodata.xlsx\")" + "objectID": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-rows", + "href": "modules/Module06-DataSubset.html#using-indexing-to-subset-by-rows", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Using indexing to subset by rows", + "text": "Using indexing to subset by rows\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\ndf[100,] \n\n observation_id IgG_concentration age gender slum\n100 8122 0.1818182 5 Female Non slum\n\n\nAnd, here we pull the age of the 100th observation/row.\n\ndf[100,\"age\"] \n\n[1] 5", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module08-DataMergeReshape.html#reshape-metadata", - "href": "modules/Module08-DataMergeReshape.html#reshape-metadata", - "title": "Module 8: Data Merging and Reshaping", - "section": "reshape metadata", - "text": "reshape metadata\nWhenever you use reshape() to change the data format, it leaves behind some metadata on our new data frame, as an attr.\n\nstr(df_wide_to_long)\n\n'data.frame': 1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n\n\nThis stores information so we can reshape() back to the other format and we don’t have to specify arguments again.\n\ndf_back_to_wide <- reshape(df_wide_to_long)" + "objectID": "modules/Module06-DataSubset.html#logical-operators", + "href": "modules/Module06-DataSubset.html#logical-operators", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Logical operators", + "text": "Logical operators\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\n\n\noperator\noperator option\ndescription\n\n\n\n\n<\n%l%\nless than\n\n\n<=\n%le%\nless than or equal to\n\n\n>\n%g%\ngreater than\n\n\n>=\n%ge%\ngreater than or equal to\n\n\n==\n\nequal to\n\n\n!=\n\nnot equal to\n\n\nx&y\n\nx and y\n\n\nx|y\n\nx or y\n\n\n%in%\n\nmatch\n\n\n%!in%\n\ndo not match", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module01-Intro.html#object-names", - "href": "modules/Module01-Intro.html#object-names", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Object names", - "text": "Object names\n\nIn general, any object name can be typed into R.\nHowever, only some are considered “valid”. If you use a non-valid object name, you will have to enclose it in backticks `like this` for R to recognize it.\nFrom the R documentation:\n\n\nA syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.\n\n\nReserved words: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_Complex_, _NA_Character, ..., ..1, ..2, ..3, and so on.", + "objectID": "modules/Module06-DataSubset.html#logical-operators-examples", + "href": "modules/Module06-DataSubset.html#logical-operators-examples", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Logical operators examples", + "text": "Logical operators examples\nLet’s practice. First, here is a reminder of what the number.object contains.\n\nnumber.object\n\n[1] 3\n\n\nNow, we will use logical operators to evaluate the object.\n\nnumber.object<4\n\n[1] TRUE\n\nnumber.object>=3\n\n[1] TRUE\n\nnumber.object!=5\n\n[1] TRUE\n\nnumber.object %in% c(6,7,2)\n\n[1] FALSE\n\n\nWe can use any of these logical operators to subset our data.\n\n# Overall mean\nmean(df$IgG_concentration, na.rm=TRUE)\n\n[1] 87.36826\n\n# Mean for all children who are not age 3\nmean(df$IgG_concentration[df$age != 3], na.rm=TRUE)\n\n[1] 90.32824\n\n# Mean for all children who are between 0 and 3 or between 7 and 10 years old\nmean(df$IgG_concentration[df$age %in% c(0:3, 7:10)], na.rm=TRUE)\n\n[1] 74.0914", "crumbs": [ "Day 1", - "Module 1: Introduction to RStudio and R Basics" + "Module 6: Get to Know Your Data and Subsetting" ] }, { - "objectID": "modules/Module01-Intro.html#object-names-1", - "href": "modules/Module01-Intro.html#object-names-1", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Object names", - "text": "Object names\n\n\n\nValid\nInvalid\n\n\n\n\nmy_object\nmy-data\n\n\nthe.vector\n2data\n\n\nnum12\nfor\n\n\nmeasles_data\n.9data\n\n\n.calc\nxX~mŷ_δätą~Xx", + "objectID": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-rename-columns", + "href": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-rename-columns", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Using indexing and logical operators to rename columns", + "text": "Using indexing and logical operators to rename columns\n\nWe can assign the column names from data frame df to an object cn, then we can modify cn directly using indexing and logical operators, finally we reassign the column names, cn, back to the data frame df:\n\n\ncn <- colnames(df)\ncn\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncn==\"IgG_concentration\"\n\n[1] FALSE TRUE FALSE FALSE FALSE\n\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_mIU\" \"age\" \n[4] \"gender\" \"slum\" \n\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset", "crumbs": [ "Day 1", - "Module 1: Introduction to RStudio and R Basics" + "Module 6: Get to Know Your Data and Subsetting" ] }, { - "objectID": "modules/Module01-Intro.html#object-names---good-coding", - "href": "modules/Module01-Intro.html#object-names---good-coding", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Object names - Good coding", - "text": "Object names - Good coding\n\nIn general, any object name can be typed into R.\nHowever, only some are considered “valid”. If you use a non-valid object name, you will have to enclose it in backticks `like this` for R to recognize it.\nFrom the R documentation:\n\n\nA syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.\n\n\nReserved words: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_Complex_, _NA_Character, ..., ..1, ..2, ..3, and so on.", + "objectID": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-subset-data", + "href": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-subset-data", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Using indexing and logical operators to subset data", + "text": "Using indexing and logical operators to subset data\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to df_lt10. Note that the logical operators df$age<=10 is before the comma because I want to subset by rows (the first dimension).\n\ndf_lte10 <- df[df$age<=10, ]\n\nLets check that my subsets worked using the summary() function.\n\nsummary(df_lte10$age)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.0 3.0 4.0 4.8 7.0 10.0 9 \n\n\n\nIn the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n\nLets check that my subsets worked using the summary() function.\n\nsummary(df_lte5_gt10$age)\n\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.00 2.50 4.00 6.08 11.00 15.00 9", "crumbs": [ "Day 1", - "Module 1: Introduction to RStudio and R Basics" + "Module 6: Get to Know Your Data and Subsetting" ] }, { - "objectID": "modules/Module01-Intro.html#object-names---good-coding-1", - "href": "modules/Module01-Intro.html#object-names---good-coding-1", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Object names - Good coding", - "text": "Object names - Good coding\n\n\n\nValid\nInvalid\n\n\n\n\nmy_object\nmy-data\n\n\nthe.vector\n2data\n\n\nnum12\nfor\n\n\nmeasles_data\n.9data\n\n\n.calc\nxX~mŷ_δätą~Xx", + "objectID": "modules/Module06-DataSubset.html#missing-values", + "href": "modules/Module06-DataSubset.html#missing-values", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Missing values", + "text": "Missing values\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\nTypes of “missing” values:\n\nNA - Not Applicable general missing data\nNaN - stands for “Not a Number”, happens when you do 0/0.\nInf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.\nblank space - sometimes when data is read it, there is a blank space left\nan empty string (e.g., \"\")\nNULL- undefined value that represents something that does not exist", "crumbs": [ "Day 1", - "Module 1: Introduction to RStudio and R Basics" + "Module 6: Get to Know Your Data and Subsetting" ] }, { - "objectID": "modules/Module01-Intro.html#object-assingment---good-coding", - "href": "modules/Module01-Intro.html#object-assingment---good-coding", - "title": "Module 1: Introduction to RStudio and R Basics", - "section": "Object assingment - Good coding", - "text": "Object assingment - Good coding\n= and <- can both be used for assignment, but <- is better coding practice, because sometimes = doesn’t work and we want to distinguish between the logical operator ==. We will talk about this more, later.", + "objectID": "modules/Module06-DataSubset.html#logical-operators-to-help-identify-and-missing-data", + "href": "modules/Module06-DataSubset.html#logical-operators-to-help-identify-and-missing-data", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Logical operators to help identify and missing data", + "text": "Logical operators to help identify and missing data\n\n\n\noperator\ndescription\n\n\n\n\n\nis.na\nis NAN or NA\n\n\n\nis.nan\nis NAN\n\n\n\n!is.na\nis not NAN or NA\n\n\n\n!is.nan\nis not NAN\n\n\n\nis.infinite\nis infinite\n\n\n\nany\nare any TRUE\n\n\n\nall\nall are TRUE\n\n\n\nwhich\nwhich are TRUE", "crumbs": [ "Day 1", - "Module 1: Introduction to RStudio and R Basics" + "Module 6: Get to Know Your Data and Subsetting" ] }, { - "objectID": "modules/Module02-Functions.html#functions-from-module-1-2", - "href": "modules/Module02-Functions.html#functions-from-module-1-2", - "title": "Module 2: Functions", - "section": "Functions from Module 1", - "text": "Functions from Module 1\nThe matrix() function creates a matrix from the given set of values.\n\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\nmatrix.object\n?matrix\n\n\n\nMatrices\n\nDescription:\n\n 'matrix' creates a matrix from the given set of values.\n\n 'as.matrix' attempts to turn its argument into a matrix.\n\n 'is.matrix' tests if its argument is a (strict) matrix.\n\nUsage:\n\n matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,\n dimnames = NULL)\n \n as.matrix(x, ...)\n ## S3 method for class 'data.frame'\n as.matrix(x, rownames.force = NA, ...)\n \n is.matrix(x)\n \nArguments:\n\n data: an optional data vector (including a list or 'expression'\n vector). Non-atomic classed R objects are coerced by\n 'as.vector' and all attributes discarded.\n\n nrow: the desired number of rows.\n\n ncol: the desired number of columns.\n\n byrow: logical. If 'FALSE' (the default) the matrix is filled by\n columns, otherwise the matrix is filled by rows.\n\ndimnames: A 'dimnames' attribute for the matrix: 'NULL' or a 'list' of\n length 2 giving the row and column names respectively. An\n empty list is treated as 'NULL', and a list of length one as\n row names. The list can be named, and the list names will be\n used as names for the dimensions.\n\n x: an R object.\n\n ...: additional arguments to be passed to or from methods.\n\nrownames.force: logical indicating if the resulting matrix should have\n character (rather than 'NULL') 'rownames'. The default,\n 'NA', uses 'NULL' rownames if the data frame has 'automatic'\n row.names or for a zero-row data frame.\n\nDetails:\n\n If one of 'nrow' or 'ncol' is not given, an attempt is made to\n infer it from the length of 'data' and the other parameter. If\n neither is given, a one-column matrix is returned.\n\n If there are too few elements in 'data' to fill the matrix, then\n the elements in 'data' are recycled. If 'data' has length zero,\n 'NA' of an appropriate type is used for atomic vectors ('0' for\n raw vectors) and 'NULL' for lists.\n\n 'is.matrix' returns 'TRUE' if 'x' is a vector and has a '\"dim\"'\n attribute of length 2 and 'FALSE' otherwise. Note that a\n 'data.frame' is *not* a matrix by this test. The function is\n generic: you can write methods to handle specific classes of\n objects, see InternalMethods.\n\n 'as.matrix' is a generic function. The method for data frames\n will return a character matrix if there is only atomic columns and\n any non-(numeric/logical/complex) column, applying 'as.vector' to\n factors and 'format' to other non-character columns. Otherwise,\n the usual coercion hierarchy (logical < integer < double <\n complex) will be used, e.g., all-logical data frames will be\n coerced to a logical matrix, mixed logical-integer will give a\n integer matrix, etc.\n\n The default method for 'as.matrix' calls 'as.vector(x)', and hence\n e.g. coerces factors to character vectors.\n\n When coercing a vector, it produces a one-column matrix, and\n promotes the names (if any) of the vector to the rownames of the\n matrix.\n\n 'is.matrix' is a primitive function.\n\n The 'print' method for a matrix gives a rectangular layout with\n dimnames or indices. For a list matrix, the entries of length not\n one are printed in the form 'integer,7' indicating the type and\n length.\n\nNote:\n\n If you just want to convert a vector to a matrix, something like\n\n dim(x) <- c(nx, ny)\n dimnames(x) <- list(row_names, col_names)\n \n will avoid duplicating 'x' _and_ preserve 'class(x)' which may be\n useful, e.g., for 'Date' objects.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'data.matrix', which attempts to convert to a numeric matrix.\n\n A matrix is the special case of a two-dimensional 'array'.\n 'inherits(m, \"array\")' is true for a 'matrix' 'm'.\n\nExamples:\n\n is.matrix(as.matrix(1:10))\n !is.matrix(warpbreaks) # data.frame, NOT matrix!\n warpbreaks[1:10,]\n as.matrix(warpbreaks[1:10,]) # using as.matrix.data.frame(.) method\n \n ## Example of setting row and column names\n mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,\n dimnames = list(c(\"row1\", \"row2\"),\n c(\"C.1\", \"C.2\", \"C.3\")))\n mdat" + "objectID": "modules/Module06-DataSubset.html#more-logical-operators-examples", + "href": "modules/Module06-DataSubset.html#more-logical-operators-examples", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "More logical operators examples", + "text": "More logical operators examples\n\ntest <- c(0,NA, -1)/0\ntest\n\n[1] NaN NA -Inf\n\nis.na(test)\n\n[1] TRUE TRUE FALSE\n\nis.nan(test)\n\n[1] TRUE FALSE FALSE\n\nis.infinite(test)\n\n[1] FALSE FALSE TRUE", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#grouped-analyses", - "href": "modules/Module09-DataAnalysis.html#grouped-analyses", - "title": "Module 9: Data Analysis", - "section": "Grouped analyses", - "text": "Grouped analyses\n\nMost of this module will discuss statistical analyses. But first we’ll discuss doing univariate analyses we’ve already used on multiple groups.\nWe can use the aggregate() function to do many analyses across groups.\n\n\n?aggregate\n\n\nlibrary(printr)\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n?aggregate\n\nCompute Summary Statistics of Data Subsets\n\nDescription:\n\n Splits the data into subsets, computes summary statistics for\n each, and returns the result in a convenient form.\n\nUsage:\n\n aggregate(x, ...)\n \n ## Default S3 method:\n aggregate(x, ...)\n \n ## S3 method for class 'data.frame'\n aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)\n \n ## S3 method for class 'formula'\n aggregate(x, data, FUN, ...,\n subset, na.action = na.omit)\n \n ## S3 method for class 'ts'\n aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,\n ts.eps = getOption(\"ts.eps\"), ...)\n \nArguments:\n\n x: an R object. For the 'formula' method a 'formula', such as\n 'y ~ x' or 'cbind(y1, y2) ~ x1 + x2', where the 'y' variables\n are numeric data to be split into groups according to the\n grouping 'x' variables (usually factors).\n\n by: a list of grouping elements, each as long as the variables in\n the data frame 'x', or a formula. The elements are coerced\n to factors before use.\n\n FUN: a function to compute the summary statistics which can be\n applied to all data subsets.\n\nsimplify: a logical indicating whether results should be simplified to\n a vector or matrix if possible.\n\n drop: a logical indicating whether to drop unused combinations of\n grouping values. The non-default case 'drop=FALSE' has been\n amended for R 3.5.0 to drop unused combinations.\n\n data: a data frame (or list) from which the variables in the\n formula should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\nnfrequency: new number of observations per unit of time; must be a\n divisor of the frequency of 'x'.\n\n ndeltat: new fraction of the sampling period between successive\n observations; must be a divisor of the sampling interval of\n 'x'.\n\n ts.eps: tolerance used to decide if 'nfrequency' is a sub-multiple of\n the original frequency.\n\n ...: further arguments passed to or used by methods.\n\nDetails:\n\n 'aggregate' is a generic function with methods for data frames and\n time series.\n\n The default method, 'aggregate.default', uses the time series\n method if 'x' is a time series, and otherwise coerces 'x' to a\n data frame and calls the data frame method.\n\n 'aggregate.data.frame' is the data frame method. If 'x' is not a\n data frame, it is coerced to one, which must have a non-zero\n number of rows. Then, each of the variables (columns) in 'x' is\n split into subsets of cases (rows) of identical combinations of\n the components of 'by', and 'FUN' is applied to each such subset\n with further arguments in '...' passed to it. The result is\n reformatted into a data frame containing the variables in 'by' and\n 'x'. The ones arising from 'by' contain the unique combinations\n of grouping values used for determining the subsets, and the ones\n arising from 'x' the corresponding summaries for the subset of the\n respective variables in 'x'. If 'simplify' is true, summaries are\n simplified to vectors or matrices if they have a common length of\n one or greater than one, respectively; otherwise, lists of summary\n results according to subsets are obtained. Rows with missing\n values in any of the 'by' variables will be omitted from the\n result. (Note that versions of R prior to 2.11.0 required 'FUN'\n to be a scalar function.)\n\n The formula method provides a standard formula interface to\n 'aggregate.data.frame'. The latter invokes the formula method if\n 'by' is a formula, in which case 'aggregate(x, by, FUN)' is the\n same as 'aggregate(by, x, FUN)' for a data frame 'x'.\n\n 'aggregate.ts' is the time series method, and requires 'FUN' to be\n a scalar function. If 'x' is not a time series, it is coerced to\n one. Then, the variables in 'x' are split into appropriate blocks\n of length 'frequency(x) / nfrequency', and 'FUN' is applied to\n each such block, with further (named) arguments in '...' passed to\n it. The result returned is a time series with frequency\n 'nfrequency' holding the aggregated values. Note that this make\n most sense for a quarterly or yearly result when the original\n series covers a whole number of quarters or years: in particular\n aggregating a monthly series to quarters starting in February does\n not give a conventional quarterly series.\n\n 'FUN' is passed to 'match.fun', and hence it can be a function or\n a symbol or character string naming a function.\n\nValue:\n\n For the time series method, a time series of class '\"ts\"' or class\n 'c(\"mts\", \"ts\")'.\n\n For the data frame method, a data frame with columns corresponding\n to the grouping variables in 'by' followed by aggregated columns\n from 'x'. If the 'by' has names, the non-empty times are used to\n label the columns in the results, with unnamed grouping variables\n being named 'Group.i' for 'by[[i]]'.\n\nWarning:\n\n The first argument of the '\"formula\"' method was named 'formula'\n rather than 'x' prior to R 4.2.0. Portable uses should not name\n that argument.\n\nAuthor(s):\n\n Kurt Hornik, with contributions by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'apply', 'lapply', 'tapply'.\n\nExamples:\n\n ## Compute the averages for the variables in 'state.x77', grouped\n ## according to the region (Northeast, South, North Central, West) that\n ## each state belongs to.\n aggregate(state.x77, list(Region = state.region), mean)\n \n ## Compute the averages according to region and the occurrence of more\n ## than 130 days of frost.\n aggregate(state.x77,\n list(Region = state.region,\n Cold = state.x77[,\"Frost\"] > 130),\n mean)\n ## (Note that no state in 'South' is THAT cold.)\n \n \n ## example with character variables and NAs\n testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),\n v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )\n by1 <- c(\"red\", \"blue\", 1, 2, NA, \"big\", 1, 2, \"red\", 1, NA, 12)\n by2 <- c(\"wet\", \"dry\", 99, 95, NA, \"damp\", 95, 99, \"red\", 99, NA, NA)\n aggregate(x = testDF, by = list(by1, by2), FUN = \"mean\")\n \n # and if you want to treat NAs as a group\n fby1 <- factor(by1, exclude = \"\")\n fby2 <- factor(by2, exclude = \"\")\n aggregate(x = testDF, by = list(fby1, fby2), FUN = \"mean\")\n \n \n ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:\n aggregate(weight ~ feed, data = chickwts, mean)\n aggregate(breaks ~ wool + tension, data = warpbreaks, mean)\n aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)\n aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)\n \n ## Dot notation:\n aggregate(. ~ Species, data = iris, mean)\n aggregate(len ~ ., data = ToothGrowth, mean)\n \n ## Often followed by xtabs():\n ag <- aggregate(len ~ ., data = ToothGrowth, mean)\n xtabs(len ~ ., data = ag)\n \n ## Formula interface via 'by' (for pipe operations)\n ToothGrowth |> aggregate(len ~ ., FUN = mean)\n \n ## Compute the average annual approval ratings for American presidents.\n aggregate(presidents, nfrequency = 1, FUN = mean)\n ## Give the summer less weight.\n aggregate(presidents, nfrequency = 1,\n FUN = weighted.mean, w = c(1, 1, 0.5, 1))" + "objectID": "modules/Module06-DataSubset.html#more-logical-operators-examples-1", + "href": "modules/Module06-DataSubset.html#more-logical-operators-examples-1", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "More logical operators examples", + "text": "More logical operators examples\nany(is.na(x)) means do we have any NA’s in the object x?\n\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n\n[1] TRUE\n\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n\n[1] FALSE\n\n\nwhich(is.na(x)) means which of the elements in object x are NA’s?\n\nwhich(is.na(df$IgG_concentration)) \n\n [1] 13 55 57 72 182 406 414 478 488 595\n\nwhich(is.na(df$slum)) \n\ninteger(0)", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module09-DataAnalysis.html#grouped-analyses-1", - "href": "modules/Module09-DataAnalysis.html#grouped-analyses-1", - "title": "Module 9: Data Analysis", - "section": "Grouped analyses", - "text": "Grouped analyses\n\nLet’s calculate seropositivity rate across age groups using the variables we just created.\nThe easiest way to use aggregate() is with the formula option. The syntax is variable_of_intest ~ grouping_variables.\n\n\naggregate(\n # Formula specifies we are calculating statistics on seropos, separately for\n # each level of age_group\n seropos ~ age_group,\n data = df, # Data argument\n FUN = mean # function for our calculation WITHOUT PARENTHESES\n)\n\n\n\n\nage_group\nseropos\n\n\n\n\nyoung\n0.1832797\n\n\nmiddle\n0.6000000\n\n\nold\n0.7945205\n\n\n\n\n\n\nWe can add as many things as we want on the RHS of the formula.\n\n\naggregate(\n IgG_concentration ~ age_group + slum,\n data = df,\n FUN = sd # standard deviation\n)\n\n\n\n\nage_group\nslum\nIgG_concentration\n\n\n\n\nyoung\nMixed\n174.89797\n\n\nmiddle\nMixed\n162.08188\n\n\nold\nMixed\n150.07063\n\n\nyoung\nNon slum\n114.68422\n\n\nmiddle\nNon slum\n177.62113\n\n\nold\nNon slum\n141.22330\n\n\nyoung\nSlum\n61.85705\n\n\nmiddle\nSlum\n202.42018\n\n\nold\nSlum\n74.75217\n\n\n\n\n\n\nWe can also add multiple variables on the LHS at the same time using cbind() syntax.\n\n\naggregate(\n cbind(age, IgG_concentration) ~ gender + slum,\n data = df,\n FUN = median\n)\n\n\n\n\ngender\nslum\nage\nIgG_concentration\n\n\n\n\nFemale\nMixed\n5.0\n2.0117423\n\n\nMale\nMixed\n6.0\n2.2082192\n\n\nFemale\nNon slum\n6.0\n2.5040431\n\n\nMale\nNon slum\n5.0\n1.1245846\n\n\nFemale\nSlum\n3.0\n5.1482480\n\n\nMale\nSlum\n5.5\n0.7753834" + "objectID": "modules/Module06-DataSubset.html#subset-function", + "href": "modules/Module06-DataSubset.html#subset-function", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "subset() function", + "text": "subset() function\nThe Base R subset() function is a slightly easier way to select variables and observations.\n\n?subset\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\nSubsetting Vectors, Matrices and Data Frames\nDescription:\n Return subsets of vectors, matrices or data frames which meet\n conditions.\nUsage:\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n x: object to be subsetted.\nsubset: logical expression indicating elements or rows to keep: missing values are taken as false.\nselect: expression, indicating columns to select from a data frame.\ndrop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\nDetails:\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\nValue:\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\nWarning:\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\nAuthor(s):\n Peter Dalgaard and Brian Ripley\nSee Also:\n '[', 'transform' 'droplevels'\nExamples:\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#hist-help-file", - "href": "modules/Module10-DataVisualization.html#hist-help-file", - "title": "Module 10: Data Visualization", - "section": "hist() Help File", - "text": "hist() Help File\n\n?hist\n\nHistograms\nDescription:\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\nUsage:\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n x: a vector of values for which the histogram is desired.\nbreaks: one of:\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\nfreq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\nprobability: an alias for ‘!freq’, for S compatibility.\ninclude.lowest: logical; if ‘TRUE’, an ‘x[i]’ equal to the ‘breaks’ value will be included in the first (or last, for ‘right = FALSE’) bar. This will be ignored (with a warning) unless ‘breaks’ is a vector.\nright: logical; if ‘TRUE’, the histogram cells are right-closed (left open) intervals.\nfuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\ndensity: the density of shading lines, in lines per inch. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise).\n col: a colour to be used to fill the bars.\nborder: the color of the border around the bars. The default is to use the standard foreground color.\nmain, xlab, ylab: main title and axis labels: these arguments to ‘title()’ get “smart” defaults here, e.g., the default ‘ylab’ is ‘“Frequency”’ iff ‘freq’ is true.\nxlim, ylim: the range of x and y values with sensible defaults. Note that ‘xlim’ is not used to define the histogram (breaks), but only for plotting (when ‘plot = TRUE’).\naxes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\nplot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\nlabels: logical or character string. Additionally draw labels on top of bars, if not ‘FALSE’; see ‘plot.histogram’.\nnclass: numeric (integer). For S(-PLUS) compatibility only, ‘nclass’ is equivalent to ‘breaks’ for a scalar or character argument.\nwarn.unused: logical. If ‘plot = FALSE’ and ‘warn.unused = TRUE’, a warning will be issued when graphical parameters are passed to ‘hist.default()’.\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\nDetails:\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\nValue:\n an object of class '\"histogram\"' which is a list with components:\nbreaks: the n+1 cell boundaries (= ‘breaks’ if that was a vector). These are the nominal breaks, not with the boundary fuzz.\ncounts: n integers; for each cell, the number of ‘x[]’ inside.\ndensity: values f^(x[i]), as estimated density values. If ‘all(diff(breaks) == 1)’, they are the relative frequencies ‘counts/n’ and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = ‘breaks[i]’.\nmids: the n cell midpoints.\nxname: a character string with the actual ‘x’ argument name.\nequidist: logical, indicating if the distances between ‘breaks’ are all the same.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\nSee Also:\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\nExamples:\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)" + "objectID": "modules/Module06-DataSubset.html#subsetting-use-the-subset-function", + "href": "modules/Module06-DataSubset.html#subsetting-use-the-subset-function", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Subsetting use the subset() function", + "text": "Subsetting use the subset() function\nHere are a few examples using the subset() function\n\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#hist-example", - "href": "modules/Module10-DataVisualization.html#hist-example", - "title": "Module 10: Data Visualization", - "section": "hist() example", - "text": "hist() example\nReminder function signature\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\nLet’s practice\n\nhist(df$age)\n\n\n\n\n\n\n\nhist(\n df$age, \n freq=FALSE, \n main=\"Histogram\", \n xlab=\"Age (years)\"\n )" + "objectID": "modules/Module06-DataSubset.html#subset-function-vs-logical-operators", + "href": "modules/Module06-DataSubset.html#subset-function-vs-logical-operators", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "subset() function vs logical operators", + "text": "subset() function vs logical operators\nsubset() automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\nsummary(df_lte10$age) #created with indexing\n\n\n\n\nMin.\n1st Qu.\nMedian\nMean\n3rd Qu.\nMax.\nNA’s\n\n\n\n\n1\n3\n4\n4.8\n7\n10\n9\n\n\n\n\nsummary(df_lte10_v2$age) #created with the subset function\n\n\n\n\nMin.\n1st Qu.\nMedian\nMean\n3rd Qu.\nMax.\n\n\n\n\n1\n3\n4\n4.8\n7\n10\n\n\n\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\nnrow(df_lte10)\n\n[1] 504\n\nnrow(df_lte10_v2)\n\n[1] 495", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#adding-more-stuff-to-the-same-plot", - "href": "modules/Module10-DataVisualization.html#adding-more-stuff-to-the-same-plot", - "title": "Module 10: Data Visualization", - "section": "Adding more stuff to the same plot", - "text": "Adding more stuff to the same plot\n\nWe can use the functions points() or lines() to add additional points or additional lines to an existing plot.\n\n\nplot(\n df$age[df$slum == \"Non slum\"],\n df$IgG_concentration[df$slum == \"Non slum\"],\n type = \"p\",\n main = \"IgG Concentration vs Age\",\n xlab = \"Age (years)\",\n ylab = \"IgG Concentration (IU/mL)\",\n pch = 16,\n cex = 0.9,\n col = \"lightblue\",\n xlim = range(df$age, na.rm = TRUE),\n ylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n df$age[df$slum == \"Mixed\"],\n df$IgG_concentration[df$slum == \"Mixed\"],\n pch = 16,\n cex = 0.9,\n col = \"blue\"\n)\npoints(\n df$age[df$slum == \"Slum\"],\n df$IgG_concentration[df$slum == \"Slum\"],\n pch = 16,\n cex = 0.9,\n col = \"darkblue\"\n)\n\n\n\nThe lines() function works similarly for connected lines.\nNote that the points() or lines() functions must be called with a plot()-style function\nWe will show how we could draw a legend() in a future section." + "objectID": "modules/Module06-DataSubset.html#summary", + "href": "modules/Module06-DataSubset.html#summary", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Summary", + "text": "Summary\n\ncolnames(), str() and summary()functions from Base R are functions to assess the data type and some summary statistics\nThere are three basic indexing syntax: [, [[ and $\nIndexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\nThere are 7 “types” of missing values, the most common being “NA”\nLogical operators meant to determine missing values are very helpful for data cleaning\nThe Base R subset() function is a slightly easier way to select variables and observations.", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] }, { - "objectID": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", - "href": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", - "title": "Module 10: Data Visualization", - "section": "Base R plots vs the Tidyverse ggplot2 package", - "text": "Base R plots vs the Tidyverse ggplot2 package\nIt is good to know both b/c they each have their strengths" + "objectID": "modules/Module06-DataSubset.html#acknowledgements", + "href": "modules/Module06-DataSubset.html#acknowledgements", + "title": "Module 6: Get to Know Your Data and Subsetting", + "section": "Acknowledgements", + "text": "Acknowledgements\nThese are the materials we looked through, modified, or extracted to complete this module’s lecture.\n\n“Introduction to R for Public Health Researchers” Johns Hopkins University\n“Indexing” CRAN Project\n“Logical operators” CRAN Project", + "crumbs": [ + "Day 1", + "Module 6: Get to Know Your Data and Subsetting" + ] } ] \ No newline at end of file diff --git a/docs/site_libs/quarto-html/quarto-syntax-highlighting.css b/docs/site_libs/quarto-html/quarto-syntax-highlighting.css index d9fd98f..b30ce57 100644 --- a/docs/site_libs/quarto-html/quarto-syntax-highlighting.css +++ b/docs/site_libs/quarto-html/quarto-syntax-highlighting.css @@ -85,6 +85,7 @@ code span.st { code span.cf { color: #003B4F; + font-weight: bold; font-style: inherit; } @@ -193,6 +194,7 @@ code span.dv { code span.kw { color: #003B4F; + font-weight: bold; font-style: inherit; } diff --git a/downloads/data.zip b/downloads/data.zip index 3f6a7c1..a4b9c40 100644 Binary files a/downloads/data.zip and b/downloads/data.zip differ diff --git a/downloads/exercises.zip b/downloads/exercises.zip new file mode 100644 index 0000000..d08c9d4 Binary files /dev/null and b/downloads/exercises.zip differ diff --git a/downloads/modules.zip b/downloads/modules.zip new file mode 100644 index 0000000..f073460 Binary files /dev/null and b/downloads/modules.zip differ diff --git a/images/.keep b/images/.keep deleted file mode 100644 index e69de29..0000000 diff --git a/index.qmd b/index.qmd index 64007e2..ddda2a2 100644 --- a/index.qmd +++ b/index.qmd @@ -4,7 +4,7 @@ title: "Welcome" Welcome to "Introduction to R"! -This website contains all of the slides and exercises for the [2024 +This website contains all of the material for the [2024 Summer Institute in Modeling for Infectious Diseases (SISMID) Module "Introduction to R"](https://sph.emory.edu/SISMID/modules/intro-to-r/index.html). @@ -27,7 +27,7 @@ by clicking on the correct download link for your OS.
    -

    Instructor: [Dr. Amy Winter](https://publichealth.uga.edu/faculty-member/amy-k-winter/)

    +

    Co-Instructor: [Dr. Amy Winter](https://publichealth.uga.edu/faculty-member/amy-k-winter/)

    @@ -40,7 +40,7 @@ Health to graduate students at the University of Georgia.
    -

    TA: [Zane Billings](https://wzbillings.com/ )

    +

    Co-Instructor: [Zane Billings](https://wzbillings.com/ )

    diff --git a/modules/ModuleXX-Data-Analysis-Example.qmd b/modules/Module095-DataAnalysisWalkthrough.qmd similarity index 99% rename from modules/ModuleXX-Data-Analysis-Example.qmd rename to modules/Module095-DataAnalysisWalkthrough.qmd index f1b2973..bf41a90 100644 --- a/modules/ModuleXX-Data-Analysis-Example.qmd +++ b/modules/Module095-DataAnalysisWalkthrough.qmd @@ -1,5 +1,5 @@ --- -title: "Data Analysis Example" +title: "Data Analysis Walkthrough" format: revealjs: toc: false diff --git a/modules/Module11-RMarkdown.qmd b/modules/Module11-RMarkdown.qmd index afb6f63..314ae2a 100644 --- a/modules/Module11-RMarkdown.qmd +++ b/modules/Module11-RMarkdown.qmd @@ -1,5 +1,5 @@ --- -title: "Literate Programming" +title: "Module 11: Literate Programming" format: revealjs: toc: false diff --git a/modules/Module11-Rmarkdown-Demo.html b/modules/Module11-Rmarkdown-Demo.html new file mode 100644 index 0000000..90e31ef --- /dev/null +++ b/modules/Module11-Rmarkdown-Demo.html @@ -0,0 +1,680 @@ + + + + + + + + + + + + + + + +R Markdown Notes + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + + + + + +
    +

    This is an example R Markdown document

    +
      +
    • The top part of this document (between the ---) is +called the YAML header. You specify options here that +change the configuration of the document.
    • +
    • Text in the R Markdown body is formatted in the +Pandoc Markdown language. Most of the syntax can be found on the cheat +sheets in the references section.
    • +
    • To include a bibliography in your document, add the +bibliography option to your YAML header and include a +BIBTEX file. A bibtex file looks like this:
    • +
    +
    @Book{rmarkdown-cookbook,
    +  title = {R Markdown Cookbook},
    +  author = {Yihui Xie and Christophe Dervieux and Emily Riederer},
    +  publisher = {Chapman and Hall/CRC},
    +  address = {Boca Raton, Florida},
    +  year = {2020},
    +  isbn = {9780367563837},
    +  url = {https://bookdown.org/yihui/rmarkdown-cookbook},
    +}
    +
    +@Manual{rmarkdown-package,
    +  title = {rmarkdown: Dynamic Documents for R},
    +  author = {JJ Allaire and Yihui Xie and Christophe Dervieux and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone},
    +  year = {2024},
    +  note = {R package version 2.27},
    +  url = {https://github.com/rstudio/rmarkdown},
    +}
    +
      +
    • You can then add citations from your bibliography by adding special +text in your R Markdown document: @rmarkdown-cookbook. +That’s how we can get this citation here (Xie, +Dervieux, and Riederer 2020).
    • +
    +
    +
    +

    Including R code in your Markdown document

    +

    You have to put all of your code in a “Code chunk” and tell +knitr that you are using R code.

    +
    meas <- readRDS(here::here("data", "measles_final.Rds"))
    +str(meas)
    +
    ## 'data.frame':    12438 obs. of  7 variables:
    +##  $ iso3c           : chr  "AFG" "AFG" "AFG" "AFG" ...
    +##  $ time            : int  1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
    +##  $ country         : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
    +##  $ Cases           : int  2792 5166 2900 640 353 2012 1511 638 1154 492 ...
    +##  $ vaccine_antigen : chr  "MCV1" "MCV1" "MCV1" "MCV1" ...
    +##  $ vaccine_coverage: int  11 NA 8 9 14 14 14 31 34 22 ...
    +##  $ total_pop       : chr  "12486631" "11155195" "10088289" "9951449" ...
    +

    You can make plots and add captions in Markdown as well.

    +
    meas_plot <- subset(meas, country == "India" & vaccine_antigen == "MCV1")
    +plot(
    +    meas_plot$time, meas_plot$Cases,
    +    xlab = "Year",
    +    ylab = "Measles cases by year in India",
    +    type = "b"
    +)
    +
    +Meases cases over time in India. +

    +Meases cases over time in India. +

    +
    +

    Note that if you want to automatically reference your +figures like you would need to for a research paper, you will +also need to use the bookdown package, and you can read +about it here. +For this document, we would have to write out “Figure 1.” manually in +our text.

    +
    +
    +

    Including tables and figures from files

    +

    Including tables is a bit more complicated, because unlike +plot(), R cannot produce any tables on its own. Instead we +need to use another package. The easiest option is to use the +knitr package which has a function called +knitr::kable() that can make a table for us, like this.

    +
    meas_table <- data.frame(
    +    "Median cases" = median(meas_plot$Cases),
    +    "IQR cases" = IQR(meas_plot$Cases)
    +)
    +
    +knitr::kable(
    +    meas_table,
    +    caption = "Median and IQR number of measles cases across all years in India."
    +)
    + + + + + + + + + + + + + + +
    Median and IQR number of measles cases across all years in +India.
    Median.casesIQR.cases
    4707244015.5
    +

    You can also use the kableExtra package to format your +table more nicely. In general there are a lot of nice table making +packages in R, like we saw with the tinytable package in +the exercise.

    +
    tinytable::tt(meas_table)
    + + + + + + tinytable_o73g4xhg2p32dbgu79cj + + + + + + + +
    + + + + + + + + + + + + + + + +
    Median.casesIQR.cases
    4707244015.5
    +
    + + + + + + +

    Finally, if you want to include a figure that you already saved +somewhere, you can do that with knitr also.

    +
    knitr::include_graphics(here::here("images", "xkcd.png"))
    +

    +
    +
    +

    R Markdown resources

    + +
    +
    +

    References

    + + +
    +
    +Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R +Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook. +
    +
    +
    + + + + +
    + + + + + + + + + + + + + + + diff --git a/modules/Module13-Iteration.qmd b/modules/Module13-Iteration.qmd index 4cb69aa..42baf50 100644 --- a/modules/Module13-Iteration.qmd +++ b/modules/Module13-Iteration.qmd @@ -1,5 +1,5 @@ --- -title: "Iteration in R" +title: "Module 13: Iteration in R" format: revealjs: toc: false diff --git a/references.qmd b/references.qmd index e597e6b..5da13e4 100644 --- a/references.qmd +++ b/references.qmd @@ -1,4 +1,5 @@ --- +title: Course Resources bibliography: SISMID-Module.bib nocite: "@*" --- @@ -6,9 +7,12 @@ nocite: "@*" # Data and Exercise downloads * Download all datasets here: [click to download](./downloads/data.zip). -* Download all exercises and solution files here: -* Download all slide decks here: -* Course GitHub where all materials can be found: [https://github.com/UGA-IDD/SISMID-2024](https://github.com/UGA-IDD/SISMID-2024). +* Download all exercises and solution files here: [click to download](./downloads/exercises.zip) +* Download all slide decks here: [click to download](./downloads/modules.zip) +* Get the example R Markdown document for Module 11 here: [click to download](./modules/Module11-Rmarkdown-Demo.Rmd){target="_blank"} + - And the sample bibligraphy "bib" file is here: [click to download](./modules/example-bib.bib){target="_blank"} + - And the rendered HTML file is here: [click to download](./modules/Module11-Rmarkdown-Demo.html){target="_blank"} +* Course GitHub where all materials can be found (to download the entire course as a zip file click the green "Code" button): [https://github.com/UGA-IDD/SISMID-2024](https://github.com/UGA-IDD/SISMID-2024){target="_blank"}. # Need help? diff --git a/renv.lock b/renv.lock index 4801f87..14ae8c3 100644 --- a/renv.lock +++ b/renv.lock @@ -9,6 +9,81 @@ ] }, "Packages": { + "DescTools": { + "Package": "DescTools", + "Version": "0.99.54", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Exact", + "MASS", + "R", + "Rcpp", + "base", + "boot", + "cli", + "data.table", + "expm", + "gld", + "grDevices", + "graphics", + "httr", + "methods", + "mvtnorm", + "readxl", + "rstudioapi", + "stats", + "utils", + "withr" + ], + "Hash": "cdd76cdd712d77020083cf669af8b3f3" + }, + "Exact": { + "Package": "Exact", + "Version": "3.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "rootSolve", + "stats", + "utils" + ], + "Hash": "1a43175d291899a4b2965b5d8db260e0" + }, + "MASS": { + "Package": "MASS", + "Version": "7.3-60.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats", + "utils" + ], + "Hash": "2f342c46163b0b54d7b64d1f798e2c78" + }, + "Matrix": { + "Package": "Matrix", + "Version": "1.7-0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "lattice", + "methods", + "stats", + "utils" + ], + "Hash": "1920b2f11133b12350024297d8a4ff4a" + }, "R6": { "Package": "R6", "Version": "2.5.1", @@ -19,6 +94,27 @@ ], "Hash": "470851b6d5d0ac559e9d01bb352b4021" }, + "Rcpp": { + "Package": "Rcpp", + "Version": "1.0.12", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "utils" + ], + "Hash": "5ea2700d21e038ace58269ecdbeb9ec0" + }, + "askpass": { + "Package": "askpass", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "sys" + ], + "Hash": "cad6cf7f1d5f6e906700b9d3e718c796" + }, "base64enc": { "Package": "base64enc", "Version": "0.1-3", @@ -53,6 +149,18 @@ ], "Hash": "9fe98599ca456d6552421db0d6772d8f" }, + "boot": { + "Package": "boot", + "Version": "1.3-30", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "stats" + ], + "Hash": "96abeed416a286d4a0f52e550b612343" + }, "box": { "Package": "box", "Version": "1.2.0", @@ -107,6 +215,31 @@ ], "Hash": "cd9a672193789068eb5a2aad65a0dedf" }, + "cellranger": { + "Package": "cellranger", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "rematch", + "tibble" + ], + "Hash": "f61dbaec772ccd2e17705c1e872e9e7c" + }, + "class": { + "Package": "class", + "Version": "7.3-22", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "R", + "stats", + "utils" + ], + "Hash": "f91f6b29f38b8c280f2b9477787d4bb2" + }, "cli": { "Package": "cli", "Version": "3.6.3", @@ -157,6 +290,27 @@ ], "Hash": "859d96e65ef198fd43e82b9628d593ef" }, + "curl": { + "Package": "curl", + "Version": "5.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "411ca2c03b1ce5f548345d2fc2685f7a" + }, + "data.table": { + "Package": "data.table", + "Version": "1.15.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "8ee9ac56ef633d0c7cab8b2ca87d683e" + }, "desc": { "Package": "desc", "Version": "1.4.3", @@ -201,6 +355,46 @@ ], "Hash": "45a6a596bf0108ee1ff16a040a2df897" }, + "e1071": { + "Package": "e1071", + "Version": "1.7-14", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "class", + "grDevices", + "graphics", + "methods", + "proxy", + "stats", + "utils" + ], + "Hash": "4ef372b716824753719a8a38b258442d" + }, + "epiDisplay": { + "Package": "epiDisplay", + "Version": "3.5.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "R", + "foreign", + "nnet", + "survival" + ], + "Hash": "2aa3e670bdb041ce8308e0dee5c4e672" + }, + "epitools": { + "Package": "epitools", + "Version": "0.5-10.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "b1eef914dc165a4974012e43efb58da0" + }, "evaluate": { "Package": "evaluate", "Version": "0.24.0", @@ -212,6 +406,17 @@ ], "Hash": "a1066cbc05caee9a4bf6d90f194ff4da" }, + "expm": { + "Package": "expm", + "Version": "0.999-9", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "methods" + ], + "Hash": "a9cfdee9645dd6b09ba8d4b9a9befa77" + }, "fansi": { "Package": "fansi", "Version": "1.0.6", @@ -259,6 +464,19 @@ ], "Hash": "1a0a9a3d5083d0d573c4214576f1e690" }, + "foreign": { + "Package": "foreign", + "Version": "0.8-86", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods", + "stats", + "utils" + ], + "Hash": "550170303dbb19d07b2bcc288068e7dc" + }, "fs": { "Package": "fs", "Version": "1.6.4", @@ -270,6 +488,19 @@ ], "Hash": "15aeb8c27f5ea5161f9f6a641fafd93a" }, + "gld": { + "Package": "gld", + "Version": "2.6.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "e1071", + "graphics", + "lmom", + "stats" + ], + "Hash": "71173258033324618dc8a09b3e27269e" + }, "glue": { "Package": "glue", "Version": "1.7.0", @@ -353,6 +584,21 @@ ], "Hash": "81d371a9cc60640e74e4ab6ac46dcedc" }, + "httr": { + "Package": "httr", + "Version": "1.4.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "curl", + "jsonlite", + "mime", + "openssl" + ], + "Hash": "ac107251d9d9fd72f0ca8049988f1d7f" + }, "jquerylib": { "Package": "jquerylib", "Version": "0.1.4", @@ -389,6 +635,21 @@ ], "Hash": "acf380f300c721da9fde7df115a5f86f" }, + "lattice": { + "Package": "lattice", + "Version": "0.22-6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "stats", + "utils" + ], + "Hash": "cc5ac1ba4c238c7ca9fa6a87ca11a7e2" + }, "lifecycle": { "Package": "lifecycle", "Version": "1.0.4", @@ -402,6 +663,18 @@ ], "Hash": "b8552d117e1b808b09a832f589b79035" }, + "lmom": { + "Package": "lmom", + "Version": "3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "stats" + ], + "Hash": "a69348cee0766082223f1c7e2a545505" + }, "magrittr": { "Package": "magrittr", "Version": "2.0.3", @@ -446,6 +719,39 @@ ], "Hash": "18e9c28c1d3ca1560ce30658b22ce104" }, + "mvtnorm": { + "Package": "mvtnorm", + "Version": "1.2-5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats" + ], + "Hash": "4d1891e59ac7a12b4e7e8a69349125f1" + }, + "nnet": { + "Package": "nnet", + "Version": "7.3-19", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats", + "utils" + ], + "Hash": "2c797b46eea7fb58ede195bc0b1f1138" + }, + "openssl": { + "Package": "openssl", + "Version": "2.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "askpass" + ], + "Hash": "2bcca3848e4734eb3b16103bc9aa4b8e" + }, "pillar": { "Package": "pillar", "Version": "1.9.0", @@ -483,6 +789,16 @@ ], "Hash": "6b01fc98b1e86c4f705ce9dcfd2f57c7" }, + "printr": { + "Package": "printr", + "Version": "0.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "knitr" + ], + "Hash": "03e0d4cc8152eed9515f517a8153c085" + }, "progress": { "Package": "progress", "Version": "1.2.3", @@ -497,6 +813,18 @@ ], "Hash": "f4625e061cb2865f111b47ff163a5ca6" }, + "proxy": { + "Package": "proxy", + "Version": "0.4-27", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "stats", + "utils" + ], + "Hash": "e0ef355c12942cf7a6b91a6cfaea8b3e" + }, "rappdirs": { "Package": "rappdirs", "Version": "0.3.3", @@ -530,6 +858,28 @@ ], "Hash": "9de96463d2117f6ac49980577939dfb3" }, + "readxl": { + "Package": "readxl", + "Version": "1.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cellranger", + "cpp11", + "progress", + "tibble", + "utils" + ], + "Hash": "8cf9c239b96df1bbb133b74aef77ad0a" + }, + "rematch": { + "Package": "rematch", + "Version": "2.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "cbff1b666c6fa6d21202f07e2318d4f1" + }, "renv": { "Package": "renv", "Version": "1.0.7", @@ -574,6 +924,19 @@ ], "Hash": "27f9502e1cdbfa195f94e03b0f517484" }, + "rootSolve": { + "Package": "rootSolve", + "Version": "1.8.2.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "stats" + ], + "Hash": "c6fa270a97604238a5ce5fe5d327fdad" + }, "rprojroot": { "Package": "rprojroot", "Version": "2.0.4", @@ -584,6 +947,13 @@ ], "Hash": "4c8415e0ec1e29f3f4f6fc108bef0144" }, + "rstudioapi": { + "Package": "rstudioapi", + "Version": "0.16.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "96710351d642b70e8f02ddeb237c46a7" + }, "sass": { "Package": "sass", "Version": "0.4.9", @@ -628,6 +998,29 @@ ], "Hash": "960e2ae9e09656611e0b8214ad543207" }, + "survival": { + "Package": "survival", + "Version": "3.6-4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "graphics", + "methods", + "splines", + "stats", + "utils" + ], + "Hash": "e6e3071f471513e4b85f98ca041303c7" + }, + "sys": { + "Package": "sys", + "Version": "3.4.2", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "3a1be13d68d47a8cd0bfd74739ca1555" + }, "tibble": { "Package": "tibble", "Version": "3.2.1", @@ -663,6 +1056,17 @@ ], "Hash": "829f27b9c4919c16b593794a6344d6c0" }, + "tinytable": { + "Package": "tinytable", + "Version": "0.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "f6e686513a32bf7bc645c3c14d7984ad" + }, "tinytex": { "Package": "tinytex", "Version": "0.51", diff --git a/schedule.qmd b/schedule.qmd index a906d89..d289e7f 100644 --- a/schedule.qmd +++ b/schedule.qmd @@ -18,19 +18,19 @@ All times are in Eastern Daylight Time (EDT; UTC-4) | Time | Section | |:--------------------|:--------| -| 08:30 am - 09:00 am | Module 0 Amy and Zane | -| 09:00 am - 10:00 am | Module 1 Amy | -| 10:00 am - 10:3- am | Coffee break | -| 10:30 am - 11:15 am | Module 2 Amy | -| 11:15 am - 11:30 am | Module 3 Zane | -| 11:30 am - 12:00 pm | Module 4 Zane | +| 08:30 am - 09:00 am | Module 0 (Amy and Zane) | +| 09:00 am - 10:00 am | Module 1 (Amy) | +| 10:00 am - 10:30 am | Coffee break | +| 10:30 am - 11:15 am | Module 2 (Amy) | +| 11:15 am - 11:30 am | Module 3 (Zane) | +| 11:30 am - 12:00 pm | Module 4 (Zane) | | 12:00 pm - 01:30 pm | Lunch (2nd floor lobby) | -| 01:30 pm - 02:15 pm | Module 5 Amy | +| 01:30 pm - 02:15 pm | Module 5 (Amy) | | 02:15 pm - 02:45 pm | Exercise 1| -| 02:45 pm - 03:00 pm | Start Module 6 Amy | +| 02:45 pm - 03:00 pm | Start Module 6 (Amy) | | 03:00 pm - 03:30 pm | Coffee break | -| 03:30 pm - 04:00 pm | Finish Module 6 Amy or Zane | -| 04:00 pm - 05:00 pm | Module 7, exercise 2 in remaining time Zane | +| 03:30 pm - 04:00 pm | Finish Module 6 (Amy or Zane) | +| 04:00 pm - 05:00 pm | Module 7, exercise 2 in remaining time (Zane) | | 05:00 pm - 07:00 pm | **Networking night** and poster session, Randal Rollins P01 | : {.striped .hover tbl-colwidths="[25,75]"} @@ -54,7 +54,7 @@ All times are in Eastern Daylight Time (EDT; UTC-4) | 03:00 pm - 03:30 pm | Coffee break | | 03:30 pm - 04:00 pm | Exercise 5 | | 04:00 pm - 04:30 pm | Review exercise 5 | -| 04:30 pm - 05:00 pm | Markdown module +| 04:30 pm - 05:00 pm | Module 11 | : {.striped .hover tbl-colwidths="[25,75]"} @@ -62,8 +62,8 @@ All times are in Eastern Daylight Time (EDT; UTC-4) | Time | Section | |:--------------------|:--------| -| 08:30 am - 10:00 am | content | +| 08:30 am - 10:00 am | tbd; Modules 12 (Amy) and 13 (Zane) | | 10:00 am - 10:15 am | Coffee break | -| 10:30 am - 12:00 pm | content | +| 10:30 am - 12:00 pm | tbd; Module 14, practice, questions, review | : {.striped .hover tbl-colwidths="[25,75]"}