diff --git a/_freeze/modules/Module05-DataImportExport/execute-results/html.json b/_freeze/modules/Module05-DataImportExport/execute-results/html.json index 657a309..f4f4c54 100644 --- a/_freeze/modules/Module05-DataImportExport/execute-results/html.json +++ b/_freeze/modules/Module05-DataImportExport/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "282655390a5073a7ad7bd9077ca1991f", + "hash": "ed2fc3dc7e59b325b935e6b65caa8728", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 5: Data Import and Export\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 5, you should be able to...\n\n- Use Base R functions to load data\n- Install and attach external R Packages to extend R's functionality\n- Load any type of data into R\n- Find loaded data in the Environment pane of RStudio\n- Reading and writing R .Rds and .Rda/.RData files\n\n\n## Import (read) Data\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- R can read almost any file format, especially with external, non-Base R, packages\n- We are going to focus on simple delimited files first. \n - comma separated (e.g. '.csv')\n - tab delimited (e.g. '.txt')\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column **delimiters** (IBM.com definition)\n\n## Mini exercise\n\n1. Download Module 5 data from the website and save the data to your data subdirectory -- specifically `SISMID_IntroToR_RProject/data`\n\n1. Open the '.csv' and '.txt' data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)\n\n1. Open the '.xlsx' data file in excel and familiarize yourself with the data\n\t\t-\t\tif you use a Mac **do not** open in Numbers, it can corrupt the file\n\t\t-\t\tif you do not have excel, you can upload it to Google Sheets\n\n1. Determine the delimiter of the two '.txt' files\n\n## Mini exercise\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/txt_files.png){width=100%}\n:::\n:::\n\n\n\n\n## Import delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `read.csv()` and `read.delim()` to importing data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read.csv\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input, so 'na.strings' values may need\n their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file (not a connection) so the character data can be\n re-encoded. See the 'Encoding' section of the help for\n 'file', the 'R Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should nuls be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded nuls are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")\n```\n\n\n:::\n:::\n\n\n\n## Import .csv files\n\nFunction signature reminder\n```\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n\n\nNote #1, I assigned the data frame to an object called `df`. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object. \n\nNote #2, If the data is imported correct, you can expect to see the `df` object ready to be used.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/df_in_env.png){width=100%}\n:::\n:::\n\n\n\n## Import .txt files\n\n`read.csv()` is a special case of `read.delim()` -- a general function to read a delimited file into a data frame \n\nReminder function signature\n```\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\t\t- `file` is the path to your file, in quotes \n\t\t- `delim` is what separates the fields within a record. The default for csv is comma\n\nWe can import the '.txt' files given that we know that 'serodata1.txt' uses a tab delimiter and 'serodata2.txt' uses a semicolon delimiter.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n```\n:::\n\n\n\nThe dataset is now successfully read into your R workspace, **many times actually.** Notice, that each time we imported the data we assigned the data to the `df` object, meaning we replaced it each time we reassigned the `df` object. \n\n\n## What if we have a .xlsx file - what do we do?\n\n1. Ask Google / ChatGPT\n2. Find and vet function and package you want\n3. Install package\n4. Attach package\n5. Use function\n\n\n## 1. Internet Search\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ChatGPT.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/GoogleSearch.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/StackOverflow.png){width=100%}\n:::\n:::\n\n\n\n## 2. Find and vet function and package you want\n\nI am getting consistent message to use the the `read_excel()` function found in the `readxl` package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task. \n\n## 3. Install Package\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also attach the package.\n\nTo install a package you can \n\n1. go to Tools ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"package_name\")\n```\n:::\n\n\n\n\nTherefore,\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"readxl\")\n```\n:::\n\n\n\n## 4. Attach Package\n\nReminder - To attach (i.e., be able to use the package) you can use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(package_name)\n```\n:::\n\n\n\nTherefore, \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(readxl)\n```\n:::\n\n\n\n## 5. Use Function\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read_excel\n```\n:::\n\nRead xls and xlsx files\n\nDescription:\n\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\n\nUsage:\n\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\n\n path: Path to the xls/xlsx file.\n\n sheet: Sheet to read. Either a string (the name of a sheet), or an\n integer (the position of the sheet). Ignored if the sheet is\n specified via 'range'. If neither argument specifies the\n sheet, defaults to the first sheet.\n\n range: A cell range to read from, as described in\n cell-specification. Includes typical Excel ranges like\n \"B3:D87\", possibly including the sheet name like\n \"Budget!B2:G14\", and more. Interpreted strictly, even if the\n range forces the inclusion of leading or trailing empty rows\n or columns. Takes precedence over 'skip', 'n_max' and\n 'sheet'.\n\ncol_names: 'TRUE' to use the first row as column names, 'FALSE' to get\n default names, or a character vector giving a name for each\n column. If user provides 'col_types' as a vector, 'col_names'\n can have one entry per column, i.e. have the same length as\n 'col_types', or one entry per unskipped column.\n\ncol_types: Either 'NULL' to guess all from the spreadsheet or a\n character vector containing one entry per column from these\n options: \"skip\", \"guess\", \"logical\", \"numeric\", \"date\",\n \"text\" or \"list\". If exactly one 'col_type' is specified, it\n will be recycled. The content of a cell in a skipped column\n is never read and that column will not appear in the data\n frame output. A list cell loads a column as a list of length\n 1 vectors, which are typed using the type guessing logic from\n 'col_types = NULL', but on a cell-by-cell basis.\n\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\n\n trim_ws: Should leading and trailing whitespace be trimmed?\n\n skip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\n\n n_max: Maximum number of data rows to read. Trailing empty rows are\n automatically skipped, so this is an upper bound on the\n number of rows in the returned tibble. Ignored if 'range' is\n given.\n\nguess_max: Maximum number of data rows to use for guessing column\n types.\n\nprogress: Display a progress spinner? By default, the spinner appears\n only in an interactive session, outside the context of\n knitting a document, and when the call is likely to run for\n several seconds or more. See 'readxl_progress()' for more\n details.\n\n.name_repair: Handling of column names. Passed along to\n 'tibble::as_tibble()'. readxl's default is `.name_repair =\n \"unique\", which ensures column names are not empty and are\n unique.\n\nValue:\n\n A tibble\n\nSee Also:\n\n cell-specification for more details on targetting cells with the\n 'range' argument\n\nExamples:\n\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )\n\n\n\n## 5. Use Function\n\nReminder of function signature\n```\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n\n\n## What would happen if we made these mistakes (*)\n\n1. What do you think would happen if I had imported the data without assigning it to an object \n\n\n::: {.cell}\n\n```{.r .cell-code}\nread_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n\n2. What do you think would happen if I forgot to specify the `sheet` argument?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndd <- read_excel(path = \"data/serodata.xlsx\")\n```\n:::\n\n\n\n\n## Installing and attaching packages - Common confusion\n\n
\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to attach a package each time you want to use it. \n\n
\n\nThe exception to this rule are the \"base\" set of packages (i.e., **Base R**) that are installed automatically when you install R and that automatically attached whenever you open R or RStudio.\n\n\n## Common Error\n\nBe prepared to see this error\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nError: could not find function \"some_function_name\"\n```\n:::\n\n\n\nThis usually means that either \n\n- you called the function by the wrong name \n- you have not installed a package that contains the function\n- you have installed a package but you forgot to attach it (i.e., `require(package_name)`) -- **most likely**\n\n\n## Export (write) Data \n\n- Exporting or 'Writing out' data allows you to save modified files for future use or sharing\n- R can write almost any file format, especially with external, non-Base R, packages\n- We are going to focus again on writing delimited files\n\n\n## Export delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `write.csv()` and `write.table()` to exporting data.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n () for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)\n```\n\n\n:::\n:::\n\n\n\n## Export delimited data\n\nLet's practice exporting the data as three files with three different delimiters (comma, tab, semicolon)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n```\n:::\n\n\n\nNote, I wrote the data to new file names. Even though we didn't change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.\n\n## R .rds and .rda/RData files\n\nThere are two file extensions worth discussing.\n\nR has two native data formats—'Rdata' (sometimes shortened to 'Rda') and 'Rds'. These formats are used when R objects are saved for later use. 'Rdata' is used to save multiple R objects, while 'Rds' is used to save a single R object. 'Rds' is fast to write/read and is very small.\n\n## .rds binary file\n\nSaving datasets in `.rds` format can save time if you have to read it back in later.\n\n`write_rds()` and `read_rds()` from `readr` package can be used to write/read a single R object to/from file.\n\n```\nrequire(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")\n```\n\n\n## .rda/RData files \n\nThe Base R functions `save()` and `load()` can be used to save and load multiple R objects. \n\n`save()` writes an external representation of R objects to the specified file, and can by loaded back into the environment using `load()`. A nice feature about using `save` and `load` is that the R object(s) is directly imported into the environment and you don't have to specify the name. The files can be saved as `.RData` or `.Rda` files.\n\nFunction signature\n```\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\n```\n\nNote, that you separate the objects you want to save with commas.\n\n\n\n## Summary\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- The Base R 'util' package has useful functions including `read.csv()` and `read.delim()` to importing/reading data or `write.csv()` and `write.table()` for exporting/writing data\n- When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used\n- If data are imported correctly, they can be found in the Environment pane of RStudio\n- You only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it. \n- To complete a task you don't know how to do (e.g., reading in an excel data file) use the following steps: 1. Asl Google / ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", - "supporting": [], + "markdown": "---\ntitle: \"Module 5: Data Import and Export\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 5, you should be able to...\n\n- Use Base R functions to load data\n- Install and attach external R Packages to extend R's functionality\n- Load any type of data into R\n- Find loaded data in the Environment pane of RStudio\n- Reading and writing R .Rds and .Rda/.RData files\n\n\n## Import (read) Data\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- R can read almost any file format, especially with external, non-Base R, packages\n- We are going to focus on simple delimited files first. \n - comma separated (e.g. '.csv')\n - tab delimited (e.g. '.txt')\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column **delimiters** (IBM.com definition)\n\n## Mini exercise\n\n1. Download 5 data from the website and save the data to your data subdirectory -- specifically `SISMID_IntroToR_RProject/data`\n\n1. Open the 'serodata.csv' and 'serodata1.txt' and 'serodata2.txt' data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)\n\n1. Determine the delimiter of the two '.txt' files\n\n1. Open the 'serodata.xlsx' data file in excel and familiarize yourself with the data\n\t\t-\t\tif you use a Mac **do not** open in Numbers, it can corrupt the file\n\t\t-\t\tif you do not have excel, you can upload it to Google Sheets\n\n\n## Mini exercise\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/txt_files.png){width=100%}\n:::\n:::\n\n\n\n## Import delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `read.csv()` and `read.delim()` to importing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read.csv\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input (if enabled), so 'na.strings' values\n may need their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file when given as a character string (not on an\n existing connection) so the character data can be re-encoded.\n See the 'Encoding' section of the help for 'file', the 'R\n Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should NULs be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded NULs are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")\n```\n\n\n:::\n:::\n\n\n## Import .csv files\n\nFunction signature reminder\n```\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n\nNote #1, I assigned the data frame to an object called `df`. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object. \n\nNote #2, If the data is imported correct, you can expect to see the `df` object ready to be used.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/df_in_env.png){width=100%}\n:::\n:::\n\n\n## Import .txt files\n\n`read.csv()` is a special case of `read.delim()` -- a general function to read a delimited file into a data frame \n\nReminder function signature\n```\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\t\t- `file` is the path to your file, in quotes \n\t\t- `delim` is what separates the fields within a record. The default for csv is comma\n\nWe can import the '.txt' files given that we know that 'serodata1.txt' uses a tab delimiter and 'serodata2.txt' uses a semicolon delimiter.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n```\n:::\n\n\nThe dataset is now successfully read into your R workspace, **many times actually.** Notice, that each time we imported the data we assigned the data to the `df` object, meaning we replaced it each time we reassigned the `df` object. \n\n\n## What if we have a .xlsx file - what do we do?\n\n1. Ask Google / ChatGPT\n2. Find and vet function and package you want\n3. Install package\n4. Attach package\n5. Use function\n\n\n## 1. Internet Search\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ChatGPT.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/GoogleSearch.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/StackOverflow.png){width=100%}\n:::\n:::\n\n\n## 2. Find and vet function and package you want\n\nI am getting consistent message to use the the `read_excel()` function found in the `readxl` package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task. \n\n## 3. Install Package\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also attach the package.\n\nTo install a package you can \n\n1. go to Tools ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"package_name\")\n```\n:::\n\n\n\nTherefore,\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"readxl\")\n```\n:::\n\n\n## 4. Attach Package\n\nReminder - To attach (i.e., be able to use the package) you can use the following code:\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(package_name)\n```\n:::\n\n\nTherefore, \n\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(readxl)\n```\n:::\n\n\n## 5. Use Function\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read_excel\n```\n:::\n\nRead xls and xlsx files\n\nDescription:\n\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\n\nUsage:\n\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\n\n path: Path to the xls/xlsx file.\n\n sheet: Sheet to read. Either a string (the name of a sheet), or an\n integer (the position of the sheet). Ignored if the sheet is\n specified via 'range'. If neither argument specifies the\n sheet, defaults to the first sheet.\n\n range: A cell range to read from, as described in\n cell-specification. Includes typical Excel ranges like\n \"B3:D87\", possibly including the sheet name like\n \"Budget!B2:G14\", and more. Interpreted strictly, even if the\n range forces the inclusion of leading or trailing empty rows\n or columns. Takes precedence over 'skip', 'n_max' and\n 'sheet'.\n\ncol_names: 'TRUE' to use the first row as column names, 'FALSE' to get\n default names, or a character vector giving a name for each\n column. If user provides 'col_types' as a vector, 'col_names'\n can have one entry per column, i.e. have the same length as\n 'col_types', or one entry per unskipped column.\n\ncol_types: Either 'NULL' to guess all from the spreadsheet or a\n character vector containing one entry per column from these\n options: \"skip\", \"guess\", \"logical\", \"numeric\", \"date\",\n \"text\" or \"list\". If exactly one 'col_type' is specified, it\n will be recycled. The content of a cell in a skipped column\n is never read and that column will not appear in the data\n frame output. A list cell loads a column as a list of length\n 1 vectors, which are typed using the type guessing logic from\n 'col_types = NULL', but on a cell-by-cell basis.\n\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\n\n trim_ws: Should leading and trailing whitespace be trimmed?\n\n skip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\n\n n_max: Maximum number of data rows to read. Trailing empty rows are\n automatically skipped, so this is an upper bound on the\n number of rows in the returned tibble. Ignored if 'range' is\n given.\n\nguess_max: Maximum number of data rows to use for guessing column\n types.\n\nprogress: Display a progress spinner? By default, the spinner appears\n only in an interactive session, outside the context of\n knitting a document, and when the call is likely to run for\n several seconds or more. See 'readxl_progress()' for more\n details.\n\n.name_repair: Handling of column names. Passed along to\n 'tibble::as_tibble()'. readxl's default is `.name_repair =\n \"unique\", which ensures column names are not empty and are\n unique.\n\nValue:\n\n A tibble\n\nSee Also:\n\n cell-specification for more details on targetting cells with the\n 'range' argument\n\nExamples:\n\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )\n\n\n## 5. Use Function\n\nReminder of function signature\n```\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n\n## What would happen if we made these mistakes (*)\n\n1. What do you think would happen if I had imported the data without assigning it to an object \n\n::: {.cell}\n\n```{.r .cell-code}\nread_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n2. What do you think would happen if I forgot to specify the `sheet` argument?\n\n::: {.cell}\n\n```{.r .cell-code}\ndd <- read_excel(path = \"data/serodata.xlsx\")\n```\n:::\n\n\n\n## Installing and attaching packages - Common confusion\n\n
\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to attach a package each time you want to use it. \n\n
\n\nThe exception to this rule are the \"base\" set of packages (i.e., **Base R**) that are installed automatically when you install R and that automatically attached whenever you open R or RStudio.\n\n\n## Common Error\n\nBe prepared to see this error\n\n\n::: {.cell}\n\n```{.r .cell-code}\nError: could not find function \"some_function_name\"\n```\n:::\n\n\nThis usually means that either \n\n- you called the function by the wrong name \n- you have not installed a package that contains the function\n- you have installed a package but you forgot to attach it (i.e., `require(package_name)`) -- **most likely**\n\n\n## Export (write) Data \n\n- Exporting or 'Writing out' data allows you to save modified files for future use or sharing\n- R can write almost any file format, especially with external, non-Base R, packages\n- We are going to focus again on writing delimited files\n\n\n## Export delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `write.csv()` and `write.table()` to exporting data.\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n () for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)\n```\n\n\n:::\n:::\n\n\n## Export delimited data\n\nLet's practice exporting the data as three files with three different delimiters (comma, tab, semicolon)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n```\n:::\n\n\nNote, I wrote the data to new file names. Even though we didn't change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.\n\n## R .rds and .rda/RData files\n\nThere are two file extensions worth discussing.\n\nR has two native data formats—'Rdata' (sometimes shortened to 'Rda') and 'Rds'. These formats are used when R objects are saved for later use. 'Rdata' is used to save multiple R objects, while 'Rds' is used to save a single R object. 'Rds' is fast to write/read and is very small.\n\n## .rds binary file\n\nSaving datasets in `.rds` format can save time if you have to read it back in later.\n\n`write_rds()` and `read_rds()` from `readr` package can be used to write/read a single R object to/from file.\n\n```\nrequire(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")\n```\n\n\n## .rda/RData files \n\nThe Base R functions `save()` and `load()` can be used to save and load multiple R objects. \n\n`save()` writes an external representation of R objects to the specified file, and can by loaded back into the environment using `load()`. A nice feature about using `save` and `load` is that the R object(s) is directly imported into the environment and you don't have to specify the name. The files can be saved as `.RData` or `.Rda` files.\n\nFunction signature\n```\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\n```\n\nNote, that you separate the objects you want to save with commas.\n\n\n\n## Summary\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- The Base R 'util' package has useful functions including `read.csv()` and `read.delim()` to importing/reading data or `write.csv()` and `write.table()` for exporting/writing data\n- When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used\n- If data are imported correctly, they can be found in the Environment pane of RStudio\n- You only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it. \n- To complete a task you don't know how to do (e.g., reading in an excel data file) use the following steps: 1. Asl Google / ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "supporting": [ + "Module05-DataImportExport_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json b/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json index 35921de..12258fc 100644 --- a/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json +++ b/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "389041878ab63fd7fa1f2c8c5e5c78df", + "hash": "a3288c5122c31e58f8ecab5ed04395c2", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 8: Data Merging and Reshaping\"\nformat:\n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 8, you should be able to...\n\n- Merge/join data together\n- Reshape data from wide to long\n- Reshape data from long to wide\n\n## Joining types\n\nPay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:\n\n- 1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)\n- 1:m merge (one-to-many merge) – More complex (things often go wrong)\n - The \"one\" suggests that one dataset has the merging variable (e.g., id) each represented once and the \"many” implies that one dataset has the merging variable represented multiple times\n- m:m merge (many-to-many merge) – Danger zone (can be unpredictable)\n \n\n## one-to-one merge\n\n- This means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)\n- Will likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)\n- The merging variable (e.g., id) each represented a single time\n- You should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.\n\n## `merge()` function\n\nWe will use the `merge()` function to conduct one-to-one merge\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?merge\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nMerge Two Data Frames\n\nDescription:\n\n Merge two data frames by common columns or row names, or do other\n versions of database _join_ operations.\n\nUsage:\n\n merge(x, y, ...)\n \n ## Default S3 method:\n merge(x, y, ...)\n \n ## S3 method for class 'data.frame'\n merge(x, y, by = intersect(names(x), names(y)),\n by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,\n sort = TRUE, suffixes = c(\".x\",\".y\"), no.dups = TRUE,\n incomparables = NULL, ...)\n \nArguments:\n\n x, y: data frames, or objects to be coerced to one.\n\nby, by.x, by.y: specifications of the columns used for merging. See\n 'Details'.\n\n all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =\n L', where 'L' is either 'TRUE' or 'FALSE'.\n\n all.x: logical; if 'TRUE', then extra rows will be added to the\n output, one for each row in 'x' that has no matching row in\n 'y'. These rows will have 'NA's in those columns that are\n usually filled with values from 'y'. The default is 'FALSE',\n so that only rows with data from both 'x' and 'y' are\n included in the output.\n\n all.y: logical; analogous to 'all.x'.\n\n sort: logical. Should the result be sorted on the 'by' columns?\n\nsuffixes: a character vector of length 2 specifying the suffixes to be\n used for making unique the names of columns in the result\n which are not used for merging (appearing in 'by' etc).\n\n no.dups: logical indicating that 'suffixes' are appended in more cases\n to avoid duplicated column names in the result. This was\n implicitly false before R version 3.5.0.\n\nincomparables: values which cannot be matched. See 'match'. This is\n intended to be used for merging on one column, so these are\n incomparable values of that column.\n\n ...: arguments to be passed to or from methods.\n\nDetails:\n\n 'merge' is a generic function whose principal method is for data\n frames: the default method coerces its arguments to data frames\n and calls the '\"data.frame\"' method.\n\n By default the data frames are merged on the columns with names\n they both have, but separate specifications of the columns can be\n given by 'by.x' and 'by.y'. The rows in the two data frames that\n match on the specified columns are extracted, and joined together.\n If there is more than one match, all possible matches contribute\n one row each. For the precise meaning of 'match', see 'match'.\n\n Columns to merge on can be specified by name, number or by a\n logical vector: the name '\"row.names\"' or the number '0' specifies\n the row names. If specified by name it must correspond uniquely\n to a named column in the input.\n\n If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero\n vector or 'NULL'), the result, 'r', is the _Cartesian product_ of\n 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +\n ncol(y))'.\n\n If 'all.x' is true, all the non matching cases of 'x' are appended\n to the result as well, with 'NA' filled in the corresponding\n columns of 'y'; analogously for 'all.y'.\n\n If the columns in the data frames not used in merging have any\n common names, these have 'suffixes' ('\".x\"' and '\".y\"' by default)\n appended to try to make the names of the result unique. If this\n is not possible, an error is thrown.\n\n If a 'by.x' column name matches one of 'y', and if 'no.dups' is\n true (as by default), the y version gets suffixed as well,\n avoiding duplicate column names in the result.\n\n The complexity of the algorithm used is proportional to the length\n of the answer.\n\n In SQL database terminology, the default value of 'all = FALSE'\n gives a _natural join_, a special case of an _inner join_.\n Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =\n TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)\n outer join_. DBMSes do not match 'NULL' records, equivalent to\n 'incomparables = NA' in R.\n\nValue:\n\n A data frame. The rows are by default lexicographically sorted on\n the common columns, but for 'sort = FALSE' are in an unspecified\n order. The columns are the common columns followed by the\n remaining columns in 'x' and then those in 'y'. If the matching\n involved row names, an extra character column called 'Row.names'\n is added at the left, and in all cases the result has 'automatic'\n row names.\n\nNote:\n\n This is intended to work with data frames with vector-like\n columns: some aspects work with data frames containing matrices,\n but not all.\n\n Currently long vectors are not accepted for inputs, which are thus\n restricted to less than 2^31 rows. That restriction also applies\n to the result for 32-bit platforms.\n\nSee Also:\n\n 'data.frame', 'by', 'cbind'.\n\n 'dendrogram' for a class which has a 'merge' method.\n\nExamples:\n\n authors <- data.frame(\n ## I(*) : use character columns of names to get sensible sort order\n surname = I(c(\"Tukey\", \"Venables\", \"Tierney\", \"Ripley\", \"McNeil\")),\n nationality = c(\"US\", \"Australia\", \"US\", \"UK\", \"Australia\"),\n deceased = c(\"yes\", rep(\"no\", 4)))\n authorN <- within(authors, { name <- surname; rm(surname) })\n books <- data.frame(\n name = I(c(\"Tukey\", \"Venables\", \"Tierney\",\n \"Ripley\", \"Ripley\", \"McNeil\", \"R Core\")),\n title = c(\"Exploratory Data Analysis\",\n \"Modern Applied Statistics ...\",\n \"LISP-STAT\",\n \"Spatial Statistics\", \"Stochastic Simulation\",\n \"Interactive Data Analysis\",\n \"An Introduction to R\"),\n other.author = c(NA, \"Ripley\", NA, NA, NA, NA,\n \"Venables & Smith\"))\n \n (m0 <- merge(authorN, books))\n (m1 <- merge(authors, books, by.x = \"surname\", by.y = \"name\"))\n m2 <- merge(books, authors, by.x = \"name\", by.y = \"surname\")\n stopifnot(exprs = {\n identical(m0, m2[, names(m0)])\n as.character(m1[, 1]) == as.character(m2[, 1])\n all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])\n identical(dim(merge(m1, m2, by = NULL)),\n c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))\n })\n \n ## \"R core\" is missing from authors and appears only here :\n merge(authors, books, by.x = \"surname\", by.y = \"name\", all = TRUE)\n \n \n ## example of using 'incomparables'\n x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)\n y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)\n merge(x, y, by = c(\"k1\",\"k2\")) # NA's match\n merge(x, y, by = \"k1\") # NA's match, so 6 rows\n merge(x, y, by = \"k2\", incomparables = NA) # 2 rows\n\n\n\n \n## Lets import the new data we want to merge and take a look\n\nThe new data `serodata_new.csv` represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_new <- read.csv(\"data/serodata_new.csv\")\nstr(df_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t636 obs. of 3 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...\n $ age : int 6 8 8 8 5 8 8 NA 8 6 ...\n```\n\n\n:::\n\n```{.r .cell-code}\nsummary(df_new)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age |\n|:--|:--------------|:-----------------|:-------------|\n| |Min. :5006 |Min. : 0.0051 |Min. : 5.00 |\n| |1st Qu.:6328 |1st Qu.: 0.2751 |1st Qu.: 7.00 |\n| |Median :7494 |Median : 1.5477 |Median :10.00 |\n| |Mean :7490 |Mean : 82.7684 |Mean :10.63 |\n| |3rd Qu.:8736 |3rd Qu.:129.6389 |3rd Qu.:14.00 |\n| |Max. :9982 |Max. :950.6590 |Max. :19.00 |\n| |NA |NA |NA's :9 |\n:::\n:::\n\n\n\n\n## Merge the new data with the original data\n\nLets load the old data as well and look for a variable, or variables, to merge by.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n:::\n\n\n\nWe notice that `observation_id` seems to be the obvious variable by which to merge. However, we also realize that `IgG_concentration` and `age` are the exact same names. If we merge now we see that R has forced the `IgG_concentration` and `age` to have a `.x` or `.y` to make sure that these variables are different.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration.x| age.x|gender |slum | IgG_concentration.y| age.y|\n|--------------:|-------------------:|-----:|:------|:--------|-------------------:|-----:|\n| 5006| 164.2979452| 7|Male |Non slum | 155.5811325| 11|\n| 5024| 0.3000000| 5|Female |Non slum | 0.2918605| 9|\n| 5026| 0.3000000| 10|Female |Non slum | 0.2542945| 14|\n| 5030| 0.0555556| 7|Female |Non slum | 0.0533262| 11|\n| 5035| 26.2112514| 11|Female |Non slum | 22.0159300| 15|\n| 5054| 0.3000000| 3|Male |Non slum | 0.2709671| 7|\n:::\n:::\n\n\n\n## Merge the new data with the original data\n\nWhat do we do?\n\nThe first option is to rename the `IgG_concentration` and `age` variables before the merge, so that it is clear which is time point 1 and time point 2. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration_time1 <- df$IgG_concentration\ndf$age_time1 <- df$age\ndf$IgG_concentration <- df$age <- NULL #remove the original variables\n\ndf_new$IgG_concentration_time2 <- df_new$IgG_concentration\ndf_new$age_time2 <- df_new$age\ndf_new$IgG_concentration <- df_new$age <- NULL #remove the original variables\n```\n:::\n\n\n\nNow, lets merge.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))\nstr(df_all_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...\n $ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ...\n```\n\n\n:::\n:::\n\n\n\n## Merge the new data with the original data\n\nThe second option is to add a time variable to the two data sets and then merge by `observation_id`, `time`, `age`, and `IgG_concentration`. Note, I need to read in the data again b/c I removed the `IgG_concentration` and `age` variables.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ndf_new <- read.csv(\"data/serodata_new.csv\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$time <- 1 #you can put in one number and it will repeat it\ndf_new$time <- 2\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | time|\n|--------------:|-----------------:|---:|:------|:--------|----:|\n| 5772| 0.3176895| 2|Female |Non slum | 1|\n| 8095| 3.4368231| 4|Female |Non slum | 1|\n| 9784| 0.3000000| 4|Male |Non slum | 1|\n| 9338| 143.2363014| 4|Male |Non slum | 1|\n| 6369| 0.4476534| 1|Male |Non slum | 1|\n| 6885| 0.0252708| 4|Male |Non slum | 1|\n:::\n\n```{.r .cell-code}\nhead(df_new)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age| time|\n|--------------:|-----------------:|---:|----:|\n| 5772| 0.2612388| 6| 2|\n| 8095| 2.9809049| 8| 2|\n| 9784| 0.2819489| 8| 2|\n| 9338| 136.6382260| 8| 2|\n| 6369| 0.3810119| 5| 2|\n| 6885| 0.0245951| 8| 2|\n:::\n:::\n\n\n\nNow, lets merge. Note, \"By default the data frames are merged on the columns with names they both have\" therefore if I don't specify the by argument it will merge on all matching variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_long <- merge(df, df_new, all.x=T, all.y=T)\nhead(df_all_long)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age| time|gender |slum |\n|--------------:|-----------------:|---:|----:|:------|:--------|\n| 5006| 155.5811325| 11| 2|NA |NA |\n| 5006| 164.2979452| 7| 1|Male |Non slum |\n| 5024| 0.2918605| 9| 2|NA |NA |\n| 5024| 0.3000000| 5| 1|Female |Non slum |\n| 5026| 0.2542945| 14| 2|NA |NA |\n| 5026| 0.3000000| 10| 1|Female |Non slum |\n:::\n:::\n\n\n\nNote, there are 1287 rows, which is the sum of the number of rows of `df` (651 rows) and `df_new` (636 rows)\n\nNotice that there are some missing values though, because `df_new` doesn't have\nthe `gender` or `slum` variables. If we assume that those are constant and\ndon't change between the two study points, we can fill in the data points\nbefore merging for an easy solution. One easy way to make a new dataframe from\n`df_new` with extra columns is to use the `transform()` function, which lets\nus make multiple column changes to a data frame at one time. We just\nneed to make sure to match the correct `observation_id` values together, using\nthe `match()` function.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_new_filled <- transform(\n df_new,\n gender = df[match(df_new$observation_id, df$observation_id), \"gender\"],\n slum = df[match(df_new$observation_id, df$observation_id), \"slum\"]\n)\n```\n:::\n\n\n\nNow we can redo the merge.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_long <- merge(df, df_new_filled, all.x=T, all.y=T)\nhead(df_all_long)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | time|\n|--------------:|-----------------:|---:|:------|:--------|----:|\n| 5006| 155.5811325| 11|Male |Non slum | 2|\n| 5006| 164.2979452| 7|Male |Non slum | 1|\n| 5024| 0.2918605| 9|Female |Non slum | 2|\n| 5024| 0.3000000| 5|Female |Non slum | 1|\n| 5026| 0.2542945| 14|Female |Non slum | 2|\n| 5026| 0.3000000| 10|Female |Non slum | 1|\n:::\n:::\n\n\n\nLooks good now! Another solution would be to edit the data file, or use\na function that can actually fill in missing values for the same individual,\nlike `zoo::na.locf()`.\n\n## What is wide/long data?\n\nAbove, we actually created a wide and long version of the data.\n\nWide: has many columns\n\n- multiple columns per individual, values spread across multiple columns \n- easier for humans to read\n \nLong: has many rows\n\n- column names become data\n- multiple rows per observation, a single column contains the values\n- easier for R to make plots & do analysis\n\n## `reshape()` function \n\nThe `reshape()` function allows you to toggle between wide and long data\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?reshape\n```\n:::\n\nReshape Grouped Data\n\nDescription:\n\n This function reshapes a data frame between 'wide' format (with\n repeated measurements in separate columns of the same row) and\n 'long' format (with the repeated measurements in separate rows).\n\nUsage:\n\n reshape(data, varying = NULL, v.names = NULL, timevar = \"time\",\n idvar = \"id\", ids = 1:NROW(data),\n times = seq_along(varying[[1]]),\n drop = NULL, direction, new.row.names = NULL,\n sep = \".\",\n split = if (sep == \"\") {\n list(regexp = \"[A-Za-z][0-9]\", include = TRUE)\n } else {\n list(regexp = sep, include = FALSE, fixed = TRUE)}\n )\n \n ### Typical usage for converting from long to wide format:\n \n # reshape(data, direction = \"wide\",\n # idvar = \"___\", timevar = \"___\", # mandatory\n # v.names = c(___), # time-varying variables\n # varying = list(___)) # auto-generated if missing\n \n ### Typical usage for converting from wide to long format:\n \n ### If names of wide-format variables are in a 'nice' format\n \n # reshape(data, direction = \"long\",\n # varying = c(___), # vector \n # sep) # to help guess 'v.names' and 'times'\n \n ### To specify long-format variable names explicitly\n \n # reshape(data, direction = \"long\",\n # varying = ___, # list / matrix / vector (use with care)\n # v.names = ___, # vector of variable names in long format\n # timevar, times, # name / values of constructed time variable\n # idvar, ids) # name / values of constructed id variable\n \nArguments:\n\n data: a data frame\n\n varying: names of sets of variables in the wide format that correspond\n to single variables in long format ('time-varying'). This is\n canonically a list of vectors of variable names, but it can\n optionally be a matrix of names, or a single vector of names.\n In each case, when 'direction = \"long\"', the names can be\n replaced by indices which are interpreted as referring to\n 'names(data)'. See 'Details' for more details and options.\n\n v.names: names of variables in the long format that correspond to\n multiple variables in the wide format. See 'Details'.\n\n timevar: the variable in long format that differentiates multiple\n records from the same group or individual. If more than one\n record matches, the first will be taken (with a warning).\n\n idvar: Names of one or more variables in long format that identify\n multiple records from the same group/individual. These\n variables may also be present in wide format.\n\n ids: the values to use for a newly created 'idvar' variable in\n long format.\n\n times: the values to use for a newly created 'timevar' variable in\n long format. See 'Details'.\n\n drop: a vector of names of variables to drop before reshaping.\n\ndirection: character string, partially matched to either '\"wide\"' to\n reshape to wide format, or '\"long\"' to reshape to long\n format.\n\nnew.row.names: character or 'NULL': a non-null value will be used for\n the row names of the result.\n\n sep: A character vector of length 1, indicating a separating\n character in the variable names in the wide format. This is\n used for guessing 'v.names' and 'times' arguments based on\n the names in 'varying'. If 'sep == \"\"', the split is just\n before the first numeral that follows an alphabetic\n character. This is also used to create variable names when\n reshaping to wide format.\n\n split: A list with three components, 'regexp', 'include', and\n (optionally) 'fixed'. This allows an extended interface to\n variable name splitting. See 'Details'.\n\nDetails:\n\n Although 'reshape()' can be used in a variety of contexts, the\n motivating application is data from longitudinal studies, and the\n arguments of this function are named and described in those terms.\n A longitudinal study is characterized by repeated measurements of\n the same variable(s), e.g., height and weight, on each unit being\n studied (e.g., individual persons) at different time points (which\n are assumed to be the same for all units). These variables are\n called time-varying variables. The study may include other\n variables that are measured only once for each unit and do not\n vary with time (e.g., gender and race); these are called\n time-constant variables.\n\n A 'wide' format representation of a longitudinal dataset will have\n one record (row) for each unit, typically with some time-constant\n variables that occupy single columns, and some time-varying\n variables that occupy multiple columns (one column for each time\n point). A 'long' format representation of the same dataset will\n have multiple records (rows) for each individual, with the\n time-constant variables being constant across these records and\n the time-varying variables varying across the records. The 'long'\n format dataset will have two additional variables: a 'time'\n variable identifying which time point each record comes from, and\n an 'id' variable showing which records refer to the same unit.\n\n The type of conversion (long to wide or wide to long) is\n determined by the 'direction' argument, which is mandatory unless\n the 'data' argument is the result of a previous call to 'reshape'.\n In that case, the operation can be reversed simply using\n 'reshape(data)' (the other arguments are stored as attributes on\n the data frame).\n\n Conversion from long to wide format with 'direction = \"wide\"' is\n the simpler operation, and is mainly useful in the context of\n multivariate analysis where data is often expected as a\n wide-format matrix. In this case, the time variable 'timevar' and\n id variable 'idvar' must be specified. All other variables are\n assumed to be time-varying, unless the time-varying variables are\n explicitly specified via the 'v.names' argument. A warning is\n issued if time-constant variables are not actually constant.\n\n Each time-varying variable is expanded into multiple variables in\n the wide format. The names of these expanded variables are\n generated automatically, unless they are specified as the\n 'varying' argument in the form of a list (or matrix) with one\n component (or row) for each time-varying variable. If 'varying' is\n a vector of names, it is implicitly converted into a matrix, with\n one row for each time-varying variable. Use this option with care\n if there are multiple time-varying variables, as the ordering (by\n column, the default in the 'matrix' constructor) may be\n unintuitive, whereas the explicit list or matrix form is\n unambiguous.\n\n Conversion from wide to long with 'direction = \"long\"' is the more\n common operation as most (univariate) statistical modeling\n functions expect data in the long format. In the simpler case\n where there is only one time-varying variable, the corresponding\n columns in the wide format input can be specified as the 'varying'\n argument, which can be either a vector of column names or the\n corresponding column indices. The name of the corresponding\n variable in the long format output combining these columns can be\n optionally specified as the 'v.names' argument, and the name of\n the time variables as the 'timevar' argument. The values to use as\n the time values corresponding to the different columns in the wide\n format can be specified as the 'times' argument. If 'v.names' is\n unspecified, the function will attempt to guess 'v.names' and\n 'times' from 'varying' (an explicitly specified 'times' argument\n is unused in that case). The default expects variable names like\n 'x.1', 'x.2', where 'sep = \".\"' specifies to split at the dot and\n drop it from the name. To have alphabetic followed by numeric\n times use 'sep = \"\"'.\n\n Multiple time-varying variables can be specified in two ways,\n either with 'varying' as an atomic vector as above, or as a list\n (or a matrix). The first form is useful (and mandatory) if the\n automatic variable name splitting as described above is used; this\n requires the names of all time-varying variables to be suitably\n formatted in the same manner, and 'v.names' to be unspecified. If\n 'varying' is a list (with one component for each time-varying\n variable) or a matrix (one row for each time-varying variable),\n variable name splitting is not attempted, and 'v.names' and\n 'times' will generally need to be specified, although they will\n default to, respectively, the first variable name in each set, and\n sequential times.\n\n Also, guessing is not attempted if 'v.names' is given explicitly,\n even if 'varying' is an atomic vector. In that case, the number of\n time-varying variables is taken to be the length of 'v.names', and\n 'varying' is implicitly converted into a matrix, with one row for\n each time-varying variable. As in the case of long to wide\n conversion, the matrix is filled up by column, so careful\n attention needs to be paid to the order of variable names (or\n indices) in 'varying', which is taken to be like 'x.1', 'y.1',\n 'x.2', 'y.2' (i.e., variables corresponding to the same time point\n need to be grouped together).\n\n The 'split' argument should not usually be necessary. The\n 'split$regexp' component is passed to either 'strsplit' or\n 'regexpr', where the latter is used if 'split$include' is 'TRUE',\n in which case the splitting occurs after the first character of\n the matched string. In the 'strsplit' case, the separator is not\n included in the result, and it is possible to specify fixed-string\n matching using 'split$fixed'.\n\nValue:\n\n The reshaped data frame with added attributes to simplify\n reshaping back to the original form.\n\nSee Also:\n\n 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.\n 'xtabs' and 'as.data.frame.table' for creating contingency tables\n and converting them back to data frames.\n\nExamples:\n\n summary(Indometh) # data in long format\n \n ## long to wide (direction = \"wide\") requires idvar and timevar at a minimum\n reshape(Indometh, direction = \"wide\", idvar = \"Subject\", timevar = \"time\")\n \n ## can also explicitly specify name of combined variable\n wide <- reshape(Indometh, direction = \"wide\", idvar = \"Subject\",\n timevar = \"time\", v.names = \"conc\", sep= \"_\")\n wide\n \n ## reverse transformation\n reshape(wide, direction = \"long\")\n reshape(wide, idvar = \"Subject\", varying = list(2:12),\n v.names = \"conc\", direction = \"long\")\n \n ## times need not be numeric\n df <- data.frame(id = rep(1:4, rep(2,4)),\n visit = I(rep(c(\"Before\",\"After\"), 4)),\n x = rnorm(4), y = runif(4))\n df\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n ## warns that y is really varying\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\", v.names = \"x\")\n \n \n ## unbalanced 'long' data leads to NA fill in 'wide' form\n df2 <- df[1:7, ]\n df2\n reshape(df2, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n \n ## Alternative regular expressions for guessing names\n df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),\n dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))\n reshape(df3, direction = \"long\", varying = 3:5, sep = \"\")\n \n \n ## an example that isn't longitudinal data\n state.x77 <- as.data.frame(state.x77)\n long <- reshape(state.x77, idvar = \"state\", ids = row.names(state.x77),\n times = names(state.x77), timevar = \"Characteristic\",\n varying = list(names(state.x77)), direction = \"long\")\n \n reshape(long, direction = \"wide\")\n \n reshape(long, direction = \"wide\", new.row.names = unique(long$state))\n \n ## multiple id variables\n df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),\n time = rep(c(1,1,2,2), 3), score = rnorm(12))\n wide <- reshape(df3, idvar = c(\"school\", \"class\"), direction = \"wide\")\n wide\n ## transform back\n reshape(wide)\n\n\n\n\n## wide to long data\n\nReminder: \"typical usage for converting from long to wide format\"\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### If names of wide-format variables are in a 'nice' format\n\nreshape(data, direction = \"long\",\n varying = c(___), # vector \n sep) # to help guess 'v.names' and 'times'\n\n### To specify long-format variable names explicitly\n\nreshape(data, direction = \"long\",\n varying = ___, # list / matrix / vector (use with care)\n v.names = ___, # vector of variable names in long format\n timevar, times, # name / values of constructed time variable\n idvar, ids) # name / values of constructed id variable\n```\n:::\n\n\n\nWe can try to apply that to our data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_wide_to_long <-\n reshape(\n # First argument is the wide-format data frame to be reshaped\n df_all_wide,\n # We are inputting wide data and expect long format as output\n direction = \"long\",\n # \"varying\" argument is a list of vectors. Each vector in the list is a\n # group of time-varying (or grouping-factor-varying) variables which\n # should become one variable after reformat. We want two variables after\n # reformating, so we need two vectors in a list.\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n ),\n # \"v.names\" is a vector of names for the new long-format variables, it\n # should have the same length as the list for varying and the names will\n # be assigned in order.\n v.names = c(\"IgG_concentration\", \"age\"),\n # Name of the variable for the time index that will be created\n timevar = \"time\",\n # Values of the time variable that should be created. Note that if you\n # have any missing observations over time, they NEED to be in the dataset\n # as NAs or your times will get messed up.\n times = 1:2,\n # 'idvar' is a variable that marks which records belong to each\n # observational unit, for us that is the ID marking individuals.\n idvar = \"observation_id\"\n )\n```\n:::\n\n\n\nNotice that this has exactly twice as many rows as our wide data format, and\ndoesn't appear to have any systematic missingness, so it seems correct.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1302\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_all_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n:::\n\n\n\n## long to wide data\n\nReminder: \"typical usage for converting from long to wide format\"\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreshape(data, direction = \"wide\",\n idvar = \"___\", timevar = \"___\", # mandatory\n v.names = c(___), # time-varying variables\n varying = list(___)) # auto-generated if missing\n```\n:::\n\n\n\nWe can try to apply that to our data. Note that the arguments are the same\nas in the wide to long case, but we don't need to specify the `times` argument\nbecause they are in the data already. The `varying` argument is optional also,\nand R will auto-generate names for the wide variables if it is left empty.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_long_to_wide <-\n reshape(\n df_all_long,\n direction = \"wide\",\n idvar = \"observation_id\",\n timevar = \"time\",\n v.names = c(\"IgG_concentration\", \"age\"),\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n )\n )\n```\n:::\n\n\n\nWe can do the same checks to make sure we pivoted correctly.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_long_to_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time1 : int 11 9 14 11 15 7 7 16 18 10 ...\n $ IgG_concentration_time2: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time2 : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeWide\")=List of 5\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ timevar: chr \"time\"\n ..$ idvar : chr \"observation_id\"\n ..$ times : num [1:2] 2 1\n ..$ varying: chr [1:2, 1:2] \"IgG_concentration_time1\" \"age_time1\" \"IgG_concentration_time2\" \"age_time2\"\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_long_to_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_all_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1287\n```\n\n\n:::\n:::\n\n\n\nNote that this time we don't have exactly twice as many records because of some\nquirks in how `reshape()` works. When we go from wide to long, R will create\nnew records with NA values at the second time point for the individuals who\nwere not in the second study -- it won't do that when we go from long to\nwide data. This is why it can be important to make sure all of your\nmissing data are **explicit** rather than **implicit**.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# For the original long dataset, we can see that not all individuals have 2\n# time points\nall(table(df_all_long$observation_id) == 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\n# But for the reshaped version they do all have 2 time points\nall(table(df_wide_to_long$observation_id) == 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n\n## `reshape` metadata\n\nWhenever you use `reshape()` to change the data format, it leaves behind some\nmetadata on our new data frame, as an `attr`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n```\n\n\n:::\n:::\n\n\n\nThis stores information so we can `reshape()` back to the other format and\nwe don't have to specify arguments again.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_back_to_wide <- reshape(df_wide_to_long)\n```\n:::\n\n\n\n## Let's get real\n\nUse the `pivot_wider()` and `pivot_longer()` from the tidyr package!\n\n\n\n## Summary\n\n- the `merge()` function can be used to marge datasets. \n- pay close attention to the number of rows in your data set before and after a merge\n- wide data has many columns and has many columns per observation\n- long data has many rows and can have multiple rows per observation\n- the `reshape()` function allows you to toggle between wide and long data. although we highly recommend using `pivot_wider()` and `pivot_longer()` from the tidyr package instead \n\t\t\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", - "supporting": [], + "markdown": "---\ntitle: \"Module 8: Data Merging and Reshaping\"\nformat:\n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 8, you should be able to...\n\n- Merge/join data together\n- Reshape data from wide to long\n- Reshape data from long to wide\n\n## Joining types\n\nPay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:\n\n- 1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)\n- 1:m merge (one-to-many merge) – More complex (things often go wrong)\n - The \"one\" suggests that one dataset has the merging variable (e.g., id) each represented once and the \"many” implies that one dataset has the merging variable represented multiple times\n- m:m merge (many-to-many merge) – Danger zone (can be unpredictable)\n \n\n## one-to-one merge\n\n- This means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)\n- Will likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)\n- The merging variable (e.g., id) each represented a single time\n- You should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.\n\n## `merge()` function\n\nWe will use the `merge()` function to conduct one-to-one merge\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?merge\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nMerge Two Data Frames\n\nDescription:\n\n Merge two data frames by common columns or row names, or do other\n versions of database _join_ operations.\n\nUsage:\n\n merge(x, y, ...)\n \n ## Default S3 method:\n merge(x, y, ...)\n \n ## S3 method for class 'data.frame'\n merge(x, y, by = intersect(names(x), names(y)),\n by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,\n sort = TRUE, suffixes = c(\".x\",\".y\"), no.dups = TRUE,\n incomparables = NULL, ...)\n \nArguments:\n\n x, y: data frames, or objects to be coerced to one.\n\nby, by.x, by.y: specifications of the columns used for merging. See\n 'Details'.\n\n all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =\n L', where 'L' is either 'TRUE' or 'FALSE'.\n\n all.x: logical; if 'TRUE', then extra rows will be added to the\n output, one for each row in 'x' that has no matching row in\n 'y'. These rows will have 'NA's in those columns that are\n usually filled with values from 'y'. The default is 'FALSE',\n so that only rows with data from both 'x' and 'y' are\n included in the output.\n\n all.y: logical; analogous to 'all.x'.\n\n sort: logical. Should the result be sorted on the 'by' columns?\n\nsuffixes: a character vector of length 2 specifying the suffixes to be\n used for making unique the names of columns in the result\n which are not used for merging (appearing in 'by' etc).\n\n no.dups: logical indicating that 'suffixes' are appended in more cases\n to avoid duplicated column names in the result. This was\n implicitly false before R version 3.5.0.\n\nincomparables: values which cannot be matched. See 'match'. This is\n intended to be used for merging on one column, so these are\n incomparable values of that column.\n\n ...: arguments to be passed to or from methods.\n\nDetails:\n\n 'merge' is a generic function whose principal method is for data\n frames: the default method coerces its arguments to data frames\n and calls the '\"data.frame\"' method.\n\n By default the data frames are merged on the columns with names\n they both have, but separate specifications of the columns can be\n given by 'by.x' and 'by.y'. The rows in the two data frames that\n match on the specified columns are extracted, and joined together.\n If there is more than one match, all possible matches contribute\n one row each. For the precise meaning of 'match', see 'match'.\n\n Columns to merge on can be specified by name, number or by a\n logical vector: the name '\"row.names\"' or the number '0' specifies\n the row names. If specified by name it must correspond uniquely\n to a named column in the input.\n\n If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero\n vector or 'NULL'), the result, 'r', is the _Cartesian product_ of\n 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +\n ncol(y))'.\n\n If 'all.x' is true, all the non matching cases of 'x' are appended\n to the result as well, with 'NA' filled in the corresponding\n columns of 'y'; analogously for 'all.y'.\n\n If the columns in the data frames not used in merging have any\n common names, these have 'suffixes' ('\".x\"' and '\".y\"' by default)\n appended to try to make the names of the result unique. If this\n is not possible, an error is thrown.\n\n If a 'by.x' column name matches one of 'y', and if 'no.dups' is\n true (as by default), the y version gets suffixed as well,\n avoiding duplicate column names in the result.\n\n The complexity of the algorithm used is proportional to the length\n of the answer.\n\n In SQL database terminology, the default value of 'all = FALSE'\n gives a _natural join_, a special case of an _inner join_.\n Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =\n TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)\n outer join_. DBMSes do not match 'NULL' records, equivalent to\n 'incomparables = NA' in R.\n\nValue:\n\n A data frame. The rows are by default lexicographically sorted on\n the common columns, but for 'sort = FALSE' are in an unspecified\n order. The columns are the common columns followed by the\n remaining columns in 'x' and then those in 'y'. If the matching\n involved row names, an extra character column called 'Row.names'\n is added at the left, and in all cases the result has 'automatic'\n row names.\n\nNote:\n\n This is intended to work with data frames with vector-like\n columns: some aspects work with data frames containing matrices,\n but not all.\n\n Currently long vectors are not accepted for inputs, which are thus\n restricted to less than 2^31 rows. That restriction also applies\n to the result for 32-bit platforms.\n\nSee Also:\n\n 'data.frame', 'by', 'cbind'.\n\n 'dendrogram' for a class which has a 'merge' method.\n\nExamples:\n\n authors <- data.frame(\n ## I(*) : use character columns of names to get sensible sort order\n surname = I(c(\"Tukey\", \"Venables\", \"Tierney\", \"Ripley\", \"McNeil\")),\n nationality = c(\"US\", \"Australia\", \"US\", \"UK\", \"Australia\"),\n deceased = c(\"yes\", rep(\"no\", 4)))\n authorN <- within(authors, { name <- surname; rm(surname) })\n books <- data.frame(\n name = I(c(\"Tukey\", \"Venables\", \"Tierney\",\n \"Ripley\", \"Ripley\", \"McNeil\", \"R Core\")),\n title = c(\"Exploratory Data Analysis\",\n \"Modern Applied Statistics ...\",\n \"LISP-STAT\",\n \"Spatial Statistics\", \"Stochastic Simulation\",\n \"Interactive Data Analysis\",\n \"An Introduction to R\"),\n other.author = c(NA, \"Ripley\", NA, NA, NA, NA,\n \"Venables & Smith\"))\n \n (m0 <- merge(authorN, books))\n (m1 <- merge(authors, books, by.x = \"surname\", by.y = \"name\"))\n m2 <- merge(books, authors, by.x = \"name\", by.y = \"surname\")\n stopifnot(exprs = {\n identical(m0, m2[, names(m0)])\n as.character(m1[, 1]) == as.character(m2[, 1])\n all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])\n identical(dim(merge(m1, m2, by = NULL)),\n c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))\n })\n \n ## \"R core\" is missing from authors and appears only here :\n merge(authors, books, by.x = \"surname\", by.y = \"name\", all = TRUE)\n \n \n ## example of using 'incomparables'\n x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)\n y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)\n merge(x, y, by = c(\"k1\",\"k2\")) # NA's match\n merge(x, y, by = \"k1\") # NA's match, so 6 rows\n merge(x, y, by = \"k2\", incomparables = NA) # 2 rows\n\n\n## Join Types\n\n- Full join: includes all unique observations in object df.x and df.y\n - `merged.df <- merge(df.x, df.y, all.x=T, all.y=T, by=merge_variable)`\n - arguments `all = TRUE` is the same as `all.x = TRUE, all.y = TRUE`\n - the number of rows in `merged.df` is >= max(nrow(df.x), nrow(df.y))\n- Inner join: includes observations that are in both df.x and df.y\n - `merged.df <- merge(df.x, df.y, all.x=F, all.y=F, by=merge_variable)`\n - the number of rows in `merged.df` is <= min(nrow(df.x), nrow(df.y))\n- Left join: joining on the first object (df.x) so it includes observations that in df.x\n - `merged.df <- merge(df.x, df.y, all.x=T, all.y=F, by=merge_variable)`\n - the number of rows in `merged.df` is nrow(df.x)\n- Right join: joining on the second object (df.y) so it includes observations that in df.y\n - `merged.df <- merge(df.x, df.y, all.x=F, all.y=T, by=merge_variable)`\n - the number of rows in `merged.df` is nrow(df.y)\n \n## Lets import the new data we want to merge and take a look\n\nThe new data `serodata_new.csv` represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_new <- read.csv(\"data/serodata_new.csv\")\nstr(df_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t636 obs. of 3 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...\n $ age : int 6 8 8 8 5 8 8 NA 8 6 ...\n```\n\n\n:::\n\n```{.r .cell-code}\nsummary(df_new)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age |\n|:--|:--------------|:-----------------|:-------------|\n| |Min. :5006 |Min. : 0.0051 |Min. : 5.00 |\n| |1st Qu.:6328 |1st Qu.: 0.2751 |1st Qu.: 7.00 |\n| |Median :7494 |Median : 1.5477 |Median :10.00 |\n| |Mean :7490 |Mean : 82.7684 |Mean :10.63 |\n| |3rd Qu.:8736 |3rd Qu.:129.6389 |3rd Qu.:14.00 |\n| |Max. :9982 |Max. :950.6590 |Max. :19.00 |\n| |NA |NA |NA's :9 |\n:::\n:::\n\n\n\n## Merge the new data with the original data\n\nLets load the old data as well and look for a variable, or variables, to merge by.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n:::\n\n\nWe notice that `observation_id` seems to be the obvious variable by which to merge. However, we also realize that `IgG_concentration` and `age` are the exact same names. If we merge now we see that R has forced the `IgG_concentration` and `age` to have a `.x` or `.y` to make sure that these variables are different.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration.x| age.x|gender |slum | IgG_concentration.y| age.y|\n|--------------:|-------------------:|-----:|:------|:--------|-------------------:|-----:|\n| 5006| 164.2979452| 7|Male |Non slum | 155.5811325| 11|\n| 5024| 0.3000000| 5|Female |Non slum | 0.2918605| 9|\n| 5026| 0.3000000| 10|Female |Non slum | 0.2542945| 14|\n| 5030| 0.0555556| 7|Female |Non slum | 0.0533262| 11|\n| 5035| 26.2112514| 11|Female |Non slum | 22.0159300| 15|\n| 5054| 0.3000000| 3|Male |Non slum | 0.2709671| 7|\n:::\n:::\n\n\n## Merge the new data with the original data\n\nWhat do we do?\n\nThe first option is to rename the `IgG_concentration` and `age` variables before the merge, so that it is clear which is time point 1 and time point 2. \n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration_time1 <- df$IgG_concentration\ndf$age_time1 <- df$age\ndf$IgG_concentration <- df$age <- NULL #remove the original variables\n\ndf_new$IgG_concentration_time2 <- df_new$IgG_concentration\ndf_new$age_time2 <- df_new$age\ndf_new$IgG_concentration <- df_new$age <- NULL #remove the original variables\n```\n:::\n\n\nNow, lets merge.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))\nstr(df_all_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...\n $ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ...\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(df_all_wide)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id|gender |slum | IgG_concentration_time1| age_time1| IgG_concentration_time2| age_time2|\n|--------------:|:------|:--------|-----------------------:|---------:|-----------------------:|---------:|\n| 5006|Male |Non slum | 164.2979452| 7| 155.5811325| 11|\n| 5024|Female |Non slum | 0.3000000| 5| 0.2918605| 9|\n| 5026|Female |Non slum | 0.3000000| 10| 0.2542945| 14|\n| 5030|Female |Non slum | 0.0555556| 7| 0.0533262| 11|\n| 5035|Female |Non slum | 26.2112514| 11| 22.0159300| 15|\n| 5054|Male |Non slum | 0.3000000| 3| 0.2709671| 7|\n:::\n:::\n\n\n## Merge the new data with the original data\n\nThe second option is to add a time variable to the two data sets and then merge by `observation_id`, `time`, `age`, and `IgG_concentration`. Note, I need to read in the data again b/c I removed the `IgG_concentration` and `age` variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ndf_new <- read.csv(\"data/serodata_new.csv\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$time <- 1 #you can put in one number and it will repeat it\ndf_new$time <- 2\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | time|\n|--------------:|-----------------:|---:|:------|:--------|----:|\n| 5772| 0.3176895| 2|Female |Non slum | 1|\n| 8095| 3.4368231| 4|Female |Non slum | 1|\n| 9784| 0.3000000| 4|Male |Non slum | 1|\n| 9338| 143.2363014| 4|Male |Non slum | 1|\n| 6369| 0.4476534| 1|Male |Non slum | 1|\n| 6885| 0.0252708| 4|Male |Non slum | 1|\n:::\n\n```{.r .cell-code}\nhead(df_new)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age| time|\n|--------------:|-----------------:|---:|----:|\n| 5772| 0.2612388| 6| 2|\n| 8095| 2.9809049| 8| 2|\n| 9784| 0.2819489| 8| 2|\n| 9338| 136.6382260| 8| 2|\n| 6369| 0.3810119| 5| 2|\n| 6885| 0.0245951| 8| 2|\n:::\n:::\n\n\nNow, lets merge. Note, \"By default the data frames are merged on the columns with names they both have\" therefore if I don't specify the by argument it will merge on all matching variables.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_long <- merge(df, df_new, all.x=T, all.y=T)\nstr(df_all_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1287 obs. of 6 variables:\n $ observation_id : int 5006 5006 5024 5024 5026 5026 5030 5030 5035 5035 ...\n $ IgG_concentration: num 155.581 164.298 0.292 0.3 0.254 ...\n $ age : int 11 7 9 5 14 10 11 7 15 11 ...\n $ time : num 2 1 2 1 2 1 2 1 2 1 ...\n $ gender : chr NA \"Male\" NA \"Female\" ...\n $ slum : chr NA \"Non slum\" NA \"Non slum\" ...\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(df_all_long)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age| time|gender |slum |\n|--------------:|-----------------:|---:|----:|:------|:--------|\n| 5006| 155.5811325| 11| 2|NA |NA |\n| 5006| 164.2979452| 7| 1|Male |Non slum |\n| 5024| 0.2918605| 9| 2|NA |NA |\n| 5024| 0.3000000| 5| 1|Female |Non slum |\n| 5026| 0.2542945| 14| 2|NA |NA |\n| 5026| 0.3000000| 10| 1|Female |Non slum |\n:::\n:::\n\n\nNote, there are 1287 rows, which is the sum of the number of rows of `df` (651 rows) and `df_new` (636 rows)\n\nNotice that there are some missing values though, because `df_new` doesn't have\nthe `gender` or `slum` variables. If we assume that those are constant and\ndon't change between the two study points, we can fill in the data points\nbefore merging for an easy solution. One easy way to make a new dataframe from\n`df_new` with extra columns is to use the `transform()` function, which lets\nus make multiple column changes to a data frame at one time. We just\nneed to make sure to match the correct `observation_id` values together, using\nthe `match()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_new_filled <- transform(\n df_new,\n gender = df[match(df_new$observation_id, df$observation_id), \"gender\"],\n slum = df[match(df_new$observation_id, df$observation_id), \"slum\"]\n)\n```\n:::\n\n\nNow we can redo the merge.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_long <- merge(df, df_new_filled, all.x=T, all.y=T)\nhead(df_all_long)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | time|\n|--------------:|-----------------:|---:|:------|:--------|----:|\n| 5006| 155.5811325| 11|Male |Non slum | 2|\n| 5006| 164.2979452| 7|Male |Non slum | 1|\n| 5024| 0.2918605| 9|Female |Non slum | 2|\n| 5024| 0.3000000| 5|Female |Non slum | 1|\n| 5026| 0.2542945| 14|Female |Non slum | 2|\n| 5026| 0.3000000| 10|Female |Non slum | 1|\n:::\n:::\n\n\nLooks good now! Another solution would be to edit the data file, or use\na function that can actually fill in missing values for the same individual,\nlike `zoo::na.locf()`.\n\n## What is wide/long data?\n\nAbove, we actually created a wide and long version of the data.\n\nWide: has many columns\n\n- multiple columns per individual, values spread across multiple columns \n- easier for humans to read\n \nLong: has many rows\n\n- column names become data\n- multiple rows per observation, a single column contains the values\n- easier for R to make plots & do analysis\n\n## `reshape()` function \n\nThe `reshape()` function allows you to toggle between wide and long data\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?reshape\n```\n:::\n\nReshape Grouped Data\n\nDescription:\n\n This function reshapes a data frame between 'wide' format (with\n repeated measurements in separate columns of the same row) and\n 'long' format (with the repeated measurements in separate rows).\n\nUsage:\n\n reshape(data, varying = NULL, v.names = NULL, timevar = \"time\",\n idvar = \"id\", ids = 1:NROW(data),\n times = seq_along(varying[[1]]),\n drop = NULL, direction, new.row.names = NULL,\n sep = \".\",\n split = if (sep == \"\") {\n list(regexp = \"[A-Za-z][0-9]\", include = TRUE)\n } else {\n list(regexp = sep, include = FALSE, fixed = TRUE)}\n )\n \n ### Typical usage for converting from long to wide format:\n \n # reshape(data, direction = \"wide\",\n # idvar = \"___\", timevar = \"___\", # mandatory\n # v.names = c(___), # time-varying variables\n # varying = list(___)) # auto-generated if missing\n \n ### Typical usage for converting from wide to long format:\n \n ### If names of wide-format variables are in a 'nice' format\n \n # reshape(data, direction = \"long\",\n # varying = c(___), # vector \n # sep) # to help guess 'v.names' and 'times'\n \n ### To specify long-format variable names explicitly\n \n # reshape(data, direction = \"long\",\n # varying = ___, # list / matrix / vector (use with care)\n # v.names = ___, # vector of variable names in long format\n # timevar, times, # name / values of constructed time variable\n # idvar, ids) # name / values of constructed id variable\n \nArguments:\n\n data: a data frame\n\n varying: names of sets of variables in the wide format that correspond\n to single variables in long format ('time-varying'). This is\n canonically a list of vectors of variable names, but it can\n optionally be a matrix of names, or a single vector of names.\n In each case, when 'direction = \"long\"', the names can be\n replaced by indices which are interpreted as referring to\n 'names(data)'. See 'Details' for more details and options.\n\n v.names: names of variables in the long format that correspond to\n multiple variables in the wide format. See 'Details'.\n\n timevar: the variable in long format that differentiates multiple\n records from the same group or individual. If more than one\n record matches, the first will be taken (with a warning).\n\n idvar: Names of one or more variables in long format that identify\n multiple records from the same group/individual. These\n variables may also be present in wide format.\n\n ids: the values to use for a newly created 'idvar' variable in\n long format.\n\n times: the values to use for a newly created 'timevar' variable in\n long format. See 'Details'.\n\n drop: a vector of names of variables to drop before reshaping.\n\ndirection: character string, partially matched to either '\"wide\"' to\n reshape to wide format, or '\"long\"' to reshape to long\n format.\n\nnew.row.names: character or 'NULL': a non-null value will be used for\n the row names of the result.\n\n sep: A character vector of length 1, indicating a separating\n character in the variable names in the wide format. This is\n used for guessing 'v.names' and 'times' arguments based on\n the names in 'varying'. If 'sep == \"\"', the split is just\n before the first numeral that follows an alphabetic\n character. This is also used to create variable names when\n reshaping to wide format.\n\n split: A list with three components, 'regexp', 'include', and\n (optionally) 'fixed'. This allows an extended interface to\n variable name splitting. See 'Details'.\n\nDetails:\n\n Although 'reshape()' can be used in a variety of contexts, the\n motivating application is data from longitudinal studies, and the\n arguments of this function are named and described in those terms.\n A longitudinal study is characterized by repeated measurements of\n the same variable(s), e.g., height and weight, on each unit being\n studied (e.g., individual persons) at different time points (which\n are assumed to be the same for all units). These variables are\n called time-varying variables. The study may include other\n variables that are measured only once for each unit and do not\n vary with time (e.g., gender and race); these are called\n time-constant variables.\n\n A 'wide' format representation of a longitudinal dataset will have\n one record (row) for each unit, typically with some time-constant\n variables that occupy single columns, and some time-varying\n variables that occupy multiple columns (one column for each time\n point). A 'long' format representation of the same dataset will\n have multiple records (rows) for each individual, with the\n time-constant variables being constant across these records and\n the time-varying variables varying across the records. The 'long'\n format dataset will have two additional variables: a 'time'\n variable identifying which time point each record comes from, and\n an 'id' variable showing which records refer to the same unit.\n\n The type of conversion (long to wide or wide to long) is\n determined by the 'direction' argument, which is mandatory unless\n the 'data' argument is the result of a previous call to 'reshape'.\n In that case, the operation can be reversed simply using\n 'reshape(data)' (the other arguments are stored as attributes on\n the data frame).\n\n Conversion from long to wide format with 'direction = \"wide\"' is\n the simpler operation, and is mainly useful in the context of\n multivariate analysis where data is often expected as a\n wide-format matrix. In this case, the time variable 'timevar' and\n id variable 'idvar' must be specified. All other variables are\n assumed to be time-varying, unless the time-varying variables are\n explicitly specified via the 'v.names' argument. A warning is\n issued if time-constant variables are not actually constant.\n\n Each time-varying variable is expanded into multiple variables in\n the wide format. The names of these expanded variables are\n generated automatically, unless they are specified as the\n 'varying' argument in the form of a list (or matrix) with one\n component (or row) for each time-varying variable. If 'varying' is\n a vector of names, it is implicitly converted into a matrix, with\n one row for each time-varying variable. Use this option with care\n if there are multiple time-varying variables, as the ordering (by\n column, the default in the 'matrix' constructor) may be\n unintuitive, whereas the explicit list or matrix form is\n unambiguous.\n\n Conversion from wide to long with 'direction = \"long\"' is the more\n common operation as most (univariate) statistical modeling\n functions expect data in the long format. In the simpler case\n where there is only one time-varying variable, the corresponding\n columns in the wide format input can be specified as the 'varying'\n argument, which can be either a vector of column names or the\n corresponding column indices. The name of the corresponding\n variable in the long format output combining these columns can be\n optionally specified as the 'v.names' argument, and the name of\n the time variables as the 'timevar' argument. The values to use as\n the time values corresponding to the different columns in the wide\n format can be specified as the 'times' argument. If 'v.names' is\n unspecified, the function will attempt to guess 'v.names' and\n 'times' from 'varying' (an explicitly specified 'times' argument\n is unused in that case). The default expects variable names like\n 'x.1', 'x.2', where 'sep = \".\"' specifies to split at the dot and\n drop it from the name. To have alphabetic followed by numeric\n times use 'sep = \"\"'.\n\n Multiple time-varying variables can be specified in two ways,\n either with 'varying' as an atomic vector as above, or as a list\n (or a matrix). The first form is useful (and mandatory) if the\n automatic variable name splitting as described above is used; this\n requires the names of all time-varying variables to be suitably\n formatted in the same manner, and 'v.names' to be unspecified. If\n 'varying' is a list (with one component for each time-varying\n variable) or a matrix (one row for each time-varying variable),\n variable name splitting is not attempted, and 'v.names' and\n 'times' will generally need to be specified, although they will\n default to, respectively, the first variable name in each set, and\n sequential times.\n\n Also, guessing is not attempted if 'v.names' is given explicitly,\n even if 'varying' is an atomic vector. In that case, the number of\n time-varying variables is taken to be the length of 'v.names', and\n 'varying' is implicitly converted into a matrix, with one row for\n each time-varying variable. As in the case of long to wide\n conversion, the matrix is filled up by column, so careful\n attention needs to be paid to the order of variable names (or\n indices) in 'varying', which is taken to be like 'x.1', 'y.1',\n 'x.2', 'y.2' (i.e., variables corresponding to the same time point\n need to be grouped together).\n\n The 'split' argument should not usually be necessary. The\n 'split$regexp' component is passed to either 'strsplit' or\n 'regexpr', where the latter is used if 'split$include' is 'TRUE',\n in which case the splitting occurs after the first character of\n the matched string. In the 'strsplit' case, the separator is not\n included in the result, and it is possible to specify fixed-string\n matching using 'split$fixed'.\n\nValue:\n\n The reshaped data frame with added attributes to simplify\n reshaping back to the original form.\n\nSee Also:\n\n 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.\n 'xtabs' and 'as.data.frame.table' for creating contingency tables\n and converting them back to data frames.\n\nExamples:\n\n summary(Indometh) # data in long format\n \n ## long to wide (direction = \"wide\") requires idvar and timevar at a minimum\n reshape(Indometh, direction = \"wide\", idvar = \"Subject\", timevar = \"time\")\n \n ## can also explicitly specify name of combined variable\n wide <- reshape(Indometh, direction = \"wide\", idvar = \"Subject\",\n timevar = \"time\", v.names = \"conc\", sep= \"_\")\n wide\n \n ## reverse transformation\n reshape(wide, direction = \"long\")\n reshape(wide, idvar = \"Subject\", varying = list(2:12),\n v.names = \"conc\", direction = \"long\")\n \n ## times need not be numeric\n df <- data.frame(id = rep(1:4, rep(2,4)),\n visit = I(rep(c(\"Before\",\"After\"), 4)),\n x = rnorm(4), y = runif(4))\n df\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n ## warns that y is really varying\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\", v.names = \"x\")\n \n \n ## unbalanced 'long' data leads to NA fill in 'wide' form\n df2 <- df[1:7, ]\n df2\n reshape(df2, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n \n ## Alternative regular expressions for guessing names\n df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),\n dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))\n reshape(df3, direction = \"long\", varying = 3:5, sep = \"\")\n \n \n ## an example that isn't longitudinal data\n state.x77 <- as.data.frame(state.x77)\n long <- reshape(state.x77, idvar = \"state\", ids = row.names(state.x77),\n times = names(state.x77), timevar = \"Characteristic\",\n varying = list(names(state.x77)), direction = \"long\")\n \n reshape(long, direction = \"wide\")\n \n reshape(long, direction = \"wide\", new.row.names = unique(long$state))\n \n ## multiple id variables\n df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),\n time = rep(c(1,1,2,2), 3), score = rnorm(12))\n wide <- reshape(df3, idvar = c(\"school\", \"class\"), direction = \"wide\")\n wide\n ## transform back\n reshape(wide)\n\n\n\n## wide to long data\n\nReminder: \"typical usage for converting from long to wide format\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### If names of wide-format variables are in a 'nice' format\n\nreshape(data, direction = \"long\",\n varying = c(___), # vector \n sep) # to help guess 'v.names' and 'times'\n\n### To specify long-format variable names explicitly\n\nreshape(data, direction = \"long\",\n varying = ___, # list / matrix / vector (use with care)\n v.names = ___, # vector of variable names in long format\n timevar, times, # name / values of constructed time variable\n idvar, ids) # name / values of constructed id variable\n```\n:::\n\n\nWe can try to apply that to our data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_wide_to_long <-\n reshape(\n # First argument is the wide-format data frame to be reshaped\n df_all_wide,\n # We are inputting wide data and expect long format as output\n direction = \"long\",\n # \"varying\" argument is a list of vectors. Each vector in the list is a\n # group of time-varying (or grouping-factor-varying) variables which\n # should become one variable after reformat. We want two variables after\n # reformating, so we need two vectors in a list.\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n ),\n # \"v.names\" is a vector of names for the new long-format variables, it\n # should have the same length as the list for varying and the names will\n # be assigned in order.\n v.names = c(\"IgG_concentration\", \"age\"),\n # Name of the variable for the time index that will be created\n timevar = \"time\",\n # Values of the time variable that should be created. Note that if you\n # have any missing observations over time, they NEED to be in the dataset\n # as NAs or your times will get messed up.\n times = 1:2,\n # 'idvar' is a variable that marks which records belong to each\n # observational unit, for us that is the ID marking individuals.\n idvar = \"observation_id\"\n )\n```\n:::\n\n\nNotice that this has exactly twice as many rows as our wide data format, and\ndoesn't appear to have any systematic missingness, so it seems correct.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1302\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_all_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n:::\n\n\n## long to wide data\n\nReminder: \"typical usage for converting from long to wide format\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreshape(data, direction = \"wide\",\n idvar = \"___\", timevar = \"___\", # mandatory\n v.names = c(___), # time-varying variables\n varying = list(___)) # auto-generated if missing\n```\n:::\n\n\nWe can try to apply that to our data. Note that the arguments are the same\nas in the wide to long case, but we don't need to specify the `times` argument\nbecause they are in the data already. The `varying` argument is optional also,\nand R will auto-generate names for the wide variables if it is left empty.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_long_to_wide <-\n reshape(\n df_all_long,\n direction = \"wide\",\n idvar = \"observation_id\",\n timevar = \"time\",\n v.names = c(\"IgG_concentration\", \"age\"),\n varying = list(\n c(\"IgG_concentration_time1\", \"IgG_concentration_time2\"),\n c(\"age_time1\", \"age_time2\")\n )\n )\n```\n:::\n\n\nWe can do the same checks to make sure we pivoted correctly.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_long_to_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time1 : int 11 9 14 11 15 7 7 16 18 10 ...\n $ IgG_concentration_time2: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time2 : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeWide\")=List of 5\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ timevar: chr \"time\"\n ..$ idvar : chr \"observation_id\"\n ..$ times : num [1:2] 2 1\n ..$ varying: chr [1:2, 1:2] \"IgG_concentration_time1\" \"age_time1\" \"IgG_concentration_time2\" \"age_time2\"\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_long_to_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_all_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1287\n```\n\n\n:::\n:::\n\n\nNote that this time we don't have exactly twice as many records because of some\nquirks in how `reshape()` works. When we go from wide to long, R will create\nnew records with NA values at the second time point for the individuals who\nwere not in the second study -- it won't do that when we go from long to\nwide data. This is why it can be important to make sure all of your\nmissing data are **explicit** rather than **implicit**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# For the original long dataset, we can see that not all individuals have 2\n# time points\nall(table(df_all_long$observation_id) == 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\n# But for the reshaped version they do all have 2 time points\nall(table(df_wide_to_long$observation_id) == 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n## `reshape` metadata\n\nWhenever you use `reshape()` to change the data format, it leaves behind some\nmetadata on our new data frame, as an `attr`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nstr(df_wide_to_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t1302 obs. of 6 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ time : int 1 1 1 1 1 1 1 1 1 1 ...\n $ IgG_concentration: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age : int 7 5 10 7 11 3 3 12 14 6 ...\n - attr(*, \"reshapeLong\")=List of 4\n ..$ varying:List of 2\n .. ..$ : chr [1:2] \"IgG_concentration_time1\" \"IgG_concentration_time2\"\n .. ..$ : chr [1:2] \"age_time1\" \"age_time2\"\n ..$ v.names: chr [1:2] \"IgG_concentration\" \"age\"\n ..$ idvar : chr \"observation_id\"\n ..$ timevar: chr \"time\"\n```\n\n\n:::\n:::\n\n\nThis stores information so we can `reshape()` back to the other format and\nwe don't have to specify arguments again.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_back_to_wide <- reshape(df_wide_to_long)\n```\n:::\n\n\n## Let's get real\n\nWe recommend checking out the `pivot_wider()` and `pivot_longer()` from the tidyr package!\n\n\n\n## Summary\n\n- the `merge()` function can be used to merge datasets. \n- pay close attention to the number of rows in your data set before and after a merge\n- wide data has many columns per observation\n- long data has many rows per observation\n- the `reshape()`function allows you to toggle between wide and long data. although we highly recommend playing around with the `pivot_wider()` and `pivot_longer()` from the tidyr package instead \n\t\t\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "supporting": [ + "Module08-DataMergeReshape_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/modules/Module10-DataVisualization/execute-results/html.json b/_freeze/modules/Module10-DataVisualization/execute-results/html.json index a2a9b00..94ebfba 100644 --- a/_freeze/modules/Module10-DataVisualization/execute-results/html.json +++ b/_freeze/modules/Module10-DataVisualization/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "d9183bfceea5026fb81db2ef5b4efdfa", + "hash": "0ded86997ba6cc19572f0805d4e82715", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhelp(package = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.4.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.4.1\nNeedsCompilation: yes\nEnhances: vcd\nBuilt: R 4.4.1; x86_64-w64-mingw32; 2024-06-14 08:20:40\n UTC; windows\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n\n\n:::\n:::\n\n\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\n-\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n## 1. Parameters\n\n\n\n\n::: {.cell figwidth='100%'}\n::: {.cell-output-display}\n![](images/par.png)\n:::\n:::\n\n\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n * '\"ask\"',\n\n * '\"fig\"', '\"fin\"',\n\n * '\"lheight\"',\n\n * '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n * '\"new\"',\n\n * '\"oma\"', '\"omd\"', '\"omi\"',\n\n * '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n * '\"usr\"',\n\n * '\"xlog\"', '\"ylog\"',\n\n * '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Hexadecimal colors can be in the long hexadecimal form\n (e.g., '\"#rrggbb\"' or '\"#rrggbbaa\"') or the short form (e.g,\n '\"#rgb\"' or '\"#rgba\"'). The short form is expanded to the long\n form by replicating digits (not by adding zeroes), e.g., '\"#rgb\"'\n becomes '\"#rrggbb\"'. Colors can also be specified by giving an\n index into a small table of colors, the 'palette': indices wrap\n round so with the default palette of size 8, '10' is the same as\n '2'. This provides compatibility with S. Index '0' corresponds\n to the background color. Note that the palette (apart from '0'\n which is per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'pdf', 'postscript' and setting up device regions by\n 'layout' and 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n\n\n## Common parameter options\n\nEight useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `hist()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n * a vector giving the breakpoints between histogram cells,\n\n * a function to compute the vector of breakpoints,\n\n * a single number giving the number of cells for the\n histogram,\n\n * a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n * a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equispaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equispaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n\n\n## `hist()` example\n\nReminder function signature\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n\n\n## `plot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n * '\"p\"' for *p*oints,\n\n * '\"l\"' for *l*ines,\n\n * '\"b\"' for *b*oth,\n\n * '\"c\"' for the lines part alone of '\"b\"',\n\n * '\"o\"' for both '*o*verplotted',\n\n * '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n * '\"s\"' for stair *s*teps,\n\n * '\"S\"' for other *s*teps, see 'Details' below,\n\n * '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n\n\n## `plot()` example\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (IU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n\n\n## Adding more stuff to the same plot\n\n* We can use the functions `points()` or `lines()` to add additional points\nor additional lines to an existing plot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(\n\tdf$age[df$slum == \"Non slum\"],\n\tdf$IgG_concentration[df$slum == \"Non slum\"],\n\ttype = \"p\",\n\tmain = \"IgG Concentration vs Age\",\n\txlab = \"Age (years)\",\n\tylab = \"IgG Concentration (IU/mL)\",\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"lightblue\",\n\txlim = range(df$age, na.rm = TRUE),\n\tylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n\tdf$age[df$slum == \"Mixed\"],\n\tdf$IgG_concentration[df$slum == \"Mixed\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"blue\"\n)\npoints(\n\tdf$age[df$slum == \"Slum\"],\n\tdf$IgG_concentration[df$slum == \"Slum\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"darkblue\"\n)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png){width=960}\n:::\n:::\n\n\n\n\n* The `lines()` function works similarly for connected lines.\n* Note that the `points()` or `lines()` functions must be called with a `plot()`-style function\n* We will show how we could draw a `legend()` in a future section.\n\n\n## `boxplot()` Help File\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers et\n al., 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n\n\n## `boxplot()` example\n\nReminder function signature\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png){width=960}\n:::\n:::\n\n\n\n\n\n## `barplot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBar Plots\n\nDescription:\n\n Creates a bar plot with vertical or horizontal bars.\n\nUsage:\n\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\n\n height: either a vector or matrix of values describing the bars which\n make up the plot. If 'height' is a vector, the plot consists\n of a sequence of rectangular bars with heights given by the\n values in the vector. If 'height' is a matrix and 'beside'\n is 'FALSE' then each bar of the plot corresponds to a column\n of 'height', with the values in the column giving the heights\n of stacked sub-bars making up the bar. If 'height' is a\n matrix and 'beside' is 'TRUE', then the values in each column\n are juxtaposed rather than stacked.\n\n width: optional vector of bar widths. Re-cycled to length the number\n of bars drawn. Specifying a single value will have no\n visible effect unless 'xlim' is specified.\n\n space: the amount of space (as a fraction of the average bar width)\n left before each bar. May be given as a single number or one\n number per bar. If 'height' is a matrix and 'beside' is\n 'TRUE', 'space' may be specified by two numbers, where the\n first is the space between bars in the same group, and the\n second the space between the groups. If not given\n explicitly, it defaults to 'c(0,1)' if 'height' is a matrix\n and 'beside' is 'TRUE', and to 0.2 otherwise.\n\nnames.arg: a vector of names to be plotted below each bar or group of\n bars. If this argument is omitted, then the names are taken\n from the 'names' attribute of 'height' if this is a vector,\n or the column names if it is a matrix.\n\nlegend.text: a vector of text used to construct a legend for the plot,\n or a logical indicating whether a legend should be included.\n This is only useful when 'height' is a matrix. In that case\n given legend labels should correspond to the rows of\n 'height'; if 'legend.text' is true, the row names of 'height'\n will be used as labels if they are non-null.\n\n beside: a logical value. If 'FALSE', the columns of 'height' are\n portrayed as stacked bars, and if 'TRUE' the columns are\n portrayed as juxtaposed bars.\n\n horiz: a logical value. If 'FALSE', the bars are drawn vertically\n with the first bar to the left. If 'TRUE', the bars are\n drawn horizontally with the first at the bottom.\n\n density: a vector giving the density of shading lines, in lines per\n inch, for the bars or bar components. The default value of\n 'NULL' means that no shading lines are drawn. Non-positive\n values of 'density' also inhibit the drawing of shading\n lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise), for the bars or bar components.\n\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\n\n border: the color to be used for the border of the bars. Use 'border\n = NA' to omit borders. If there are shading lines, 'border =\n TRUE' means use the same colour for the border as for the\n shading lines.\n\nmain, sub: main title and subtitle for the plot.\n\n xlab: a label for the x axis.\n\n ylab: a label for the y axis.\n\n xlim: limits for the x axis.\n\n ylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\n axes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\n\naxisnames: logical. If 'TRUE', and if there are 'names.arg' (see\n above), the other axis is drawn (with 'lty = 0') and labeled.\n\ncex.axis: expansion factor for numeric axis labels (see 'par('cex')').\n\ncex.names: expansion factor for axis names (bar labels).\n\n inside: logical. If 'TRUE', the lines which divide adjacent\n (non-stacked!) bars will be drawn. Only applies when 'space\n = 0' (which it partly is when 'beside = TRUE').\n\n plot: logical. If 'FALSE', nothing is plotted.\n\naxis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to\n the axis and tick marks of the categorical (default\n horizontal) axis. Note that by default the axis is\n suppressed.\n\n offset: a vector indicating how much the bars should be shifted\n relative to the x axis.\n\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\n\nargs.legend: list of additional arguments to pass to 'legend()'; names\n of the list are used as argument names. Only used if\n 'legend.text' is supplied.\n\n formula: a formula where the 'y' variables are numeric data to plot\n against the categorical 'x' variables. The formula can have\n one of three forms:\n\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\n data: a data frame (or list) from which the variables in formula\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\n\nValue:\n\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\n\nAuthor(s):\n\n R Core, with a contribution by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\n\nExamples:\n\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))\n\n\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder function signature\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png){width=960}\n:::\n:::\n\n\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n\n\n:::\n:::\n\n\n\n\n\n\n## Add legend to the plot\n\nReminder function signature\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n\n## Add legend to the plot\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png){width=960}\n:::\n:::\n\n\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n## `barplot()` example\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png){width=960}\n:::\n:::\n\n\n\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n\n## `barplot()` example\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png){width=960}\n:::\n:::\n\n\n\n\n## Saving plots to file\n\nIf you want to include your graphic in a paper or anything else, you need to\nsave it as an image. One limitation of base R graphics is that the process for\nsaving plots is a bit annoying.\n\n1. Open a graphics device connection with a graphics function -- examples\ninclude `pdf()`, `png()`, and `tiff()` for the most useful.\n1. Run the code that creates your plot.\n1. Use `dev.off()` to close the graphics device connection.\n\nLet's do an example.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Open the graphics device\npng(\n\t\"my-barplot.png\",\n\twidth = 800,\n\theight = 450,\n\tunits = \"px\"\n)\n# Set the plot layout -- this is an alternative to par(mfrow = ...)\nlayout(matrix(c(1, 2), ncol = 2))\n# Make the plot\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n# Close the graphics device\ndev.off()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npng \n 2 \n```\n\n\n:::\n\n```{.r .cell-code}\n# Reset the layout\nlayout(1)\n```\n:::\n\n\n\n\nNote: after you do an interactive graphics session, it is often helpful to\nrestart R or run the function `graphics.off()` before opening the graphics\nconnection device.\n\n## Base R plots vs the Tidyverse ggplot2 package\n\nIt is good to know both b/c they each have their strengths\n\n## Summary\n\n- the Base R 'graphics' package has a ton of graphics options that allow for ultimate flexibility\n- Base R plots typically include setting plot options (`par()`), mapping data to the plot (e.g., `plot()`, `barplot()`, `points()`, `lines()`), and creating a legend (`legend()`). \n- the functions `points()` or `lines()` add additional points or additional lines to an existing plot, but must be called with a `plot()`-style function\n- in Base R plotting the legend is not automatically generated, so be careful when creating it\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", + "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhelp(package = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.4.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.4.1\nNeedsCompilation: yes\nEnhances: vcd\nBuilt: R 4.4.1; x86_64-apple-darwin20; 2024-06-15 17:31:38\n UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n\n\n:::\n:::\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\n-\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n## 1. Parameters\n\n\n::: {.cell figwidth='100%'}\n::: {.cell-output-display}\n![](images/par.png)\n:::\n:::\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Hexadecimal colors can be in the long hexadecimal form\n (e.g., '\"#rrggbb\"' or '\"#rrggbbaa\"') or the short form (e.g,\n '\"#rgb\"' or '\"#rgba\"'). The short form is expanded to the long\n form by replicating digits (not by adding zeroes), e.g., '\"#rgb\"'\n becomes '\"#rrggbb\"'. Colors can also be specified by giving an\n index into a small table of colors, the 'palette': indices wrap\n round so with the default palette of size 8, '10' is the same as\n '2'. This provides compatibility with S. Index '0' corresponds\n to the background color. Note that the palette (apart from '0'\n which is per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'pdf', 'postscript' and setting up device regions by\n 'layout' and 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n## Common parameter options\n\nEight useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `hist()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equispaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equispaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n## `hist()` example\n\nReminder function signature\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n## `plot()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n## `plot()` example\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (IU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n## Adding more stuff to the same plot\n\n* We can use the functions `points()` or `lines()` to add additional points\nor additional lines to an existing plot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(\n\tdf$age[df$slum == \"Non slum\"],\n\tdf$IgG_concentration[df$slum == \"Non slum\"],\n\ttype = \"p\",\n\tmain = \"IgG Concentration vs Age\",\n\txlab = \"Age (years)\",\n\tylab = \"IgG Concentration (IU/mL)\",\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"lightblue\",\n\txlim = range(df$age, na.rm = TRUE),\n\tylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n\tdf$age[df$slum == \"Mixed\"],\n\tdf$IgG_concentration[df$slum == \"Mixed\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"blue\"\n)\npoints(\n\tdf$age[df$slum == \"Slum\"],\n\tdf$IgG_concentration[df$slum == \"Slum\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"darkblue\"\n)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png){width=960}\n:::\n:::\n\n\n* The `lines()` function works similarly for connected lines.\n* Note that the `points()` or `lines()` functions must be called with a `plot()`-style function\n* We will show how we could draw a `legend()` in a future section.\n\n\n## `boxplot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers et\n al., 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n## `boxplot()` example\n\nReminder function signature\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBar Plots\n\nDescription:\n\n Creates a bar plot with vertical or horizontal bars.\n\nUsage:\n\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\n\n height: either a vector or matrix of values describing the bars which\n make up the plot. If 'height' is a vector, the plot consists\n of a sequence of rectangular bars with heights given by the\n values in the vector. If 'height' is a matrix and 'beside'\n is 'FALSE' then each bar of the plot corresponds to a column\n of 'height', with the values in the column giving the heights\n of stacked sub-bars making up the bar. If 'height' is a\n matrix and 'beside' is 'TRUE', then the values in each column\n are juxtaposed rather than stacked.\n\n width: optional vector of bar widths. Re-cycled to length the number\n of bars drawn. Specifying a single value will have no\n visible effect unless 'xlim' is specified.\n\n space: the amount of space (as a fraction of the average bar width)\n left before each bar. May be given as a single number or one\n number per bar. If 'height' is a matrix and 'beside' is\n 'TRUE', 'space' may be specified by two numbers, where the\n first is the space between bars in the same group, and the\n second the space between the groups. If not given\n explicitly, it defaults to 'c(0,1)' if 'height' is a matrix\n and 'beside' is 'TRUE', and to 0.2 otherwise.\n\nnames.arg: a vector of names to be plotted below each bar or group of\n bars. If this argument is omitted, then the names are taken\n from the 'names' attribute of 'height' if this is a vector,\n or the column names if it is a matrix.\n\nlegend.text: a vector of text used to construct a legend for the plot,\n or a logical indicating whether a legend should be included.\n This is only useful when 'height' is a matrix. In that case\n given legend labels should correspond to the rows of\n 'height'; if 'legend.text' is true, the row names of 'height'\n will be used as labels if they are non-null.\n\n beside: a logical value. If 'FALSE', the columns of 'height' are\n portrayed as stacked bars, and if 'TRUE' the columns are\n portrayed as juxtaposed bars.\n\n horiz: a logical value. If 'FALSE', the bars are drawn vertically\n with the first bar to the left. If 'TRUE', the bars are\n drawn horizontally with the first at the bottom.\n\n density: a vector giving the density of shading lines, in lines per\n inch, for the bars or bar components. The default value of\n 'NULL' means that no shading lines are drawn. Non-positive\n values of 'density' also inhibit the drawing of shading\n lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise), for the bars or bar components.\n\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\n\n border: the color to be used for the border of the bars. Use 'border\n = NA' to omit borders. If there are shading lines, 'border =\n TRUE' means use the same colour for the border as for the\n shading lines.\n\nmain, sub: main title and subtitle for the plot.\n\n xlab: a label for the x axis.\n\n ylab: a label for the y axis.\n\n xlim: limits for the x axis.\n\n ylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\n axes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\n\naxisnames: logical. If 'TRUE', and if there are 'names.arg' (see\n above), the other axis is drawn (with 'lty = 0') and labeled.\n\ncex.axis: expansion factor for numeric axis labels (see 'par('cex')').\n\ncex.names: expansion factor for axis names (bar labels).\n\n inside: logical. If 'TRUE', the lines which divide adjacent\n (non-stacked!) bars will be drawn. Only applies when 'space\n = 0' (which it partly is when 'beside = TRUE').\n\n plot: logical. If 'FALSE', nothing is plotted.\n\naxis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to\n the axis and tick marks of the categorical (default\n horizontal) axis. Note that by default the axis is\n suppressed.\n\n offset: a vector indicating how much the bars should be shifted\n relative to the x axis.\n\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\n\nargs.legend: list of additional arguments to pass to 'legend()'; names\n of the list are used as argument names. Only used if\n 'legend.text' is supplied.\n\n formula: a formula where the 'y' variables are numeric data to plot\n against the categorical 'x' variables. The formula can have\n one of three forms:\n\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\n data: a data frame (or list) from which the variables in formula\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\n\nValue:\n\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\n\nAuthor(s):\n\n R Core, with a contribution by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\n\nExamples:\n\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder function signature\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png){width=960}\n:::\n:::\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n\n\n:::\n:::\n\n\n\n\n## Add legend to the plot\n\nReminder function signature\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n## Add legend to the plot\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n## `barplot()` example\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png){width=960}\n:::\n:::\n\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n## `barplot()` example\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png){width=960}\n:::\n:::\n\n\n## Saving plots to file\n\nIf you want to include your graphic in a paper or anything else, you need to\nsave it as an image. One limitation of base R graphics is that the process for\nsaving plots is a bit annoying.\n\n1. Open a graphics device connection with a graphics function -- examples\ninclude `pdf()`, `png()`, and `tiff()` for the most useful.\n1. Run the code that creates your plot.\n1. Use `dev.off()` to close the graphics device connection.\n\nLet's do an example.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Open the graphics device\npng(\n\t\"my-barplot.png\",\n\twidth = 800,\n\theight = 450,\n\tunits = \"px\"\n)\n# Set the plot layout -- this is an alternative to par(mfrow = ...)\nlayout(matrix(c(1, 2), ncol = 2))\n# Make the plot\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n# Close the graphics device\ndev.off()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nquartz_off_screen \n 2 \n```\n\n\n:::\n\n```{.r .cell-code}\n# Reset the layout\nlayout(1)\n```\n:::\n\n\nNote: after you do an interactive graphics session, it is often helpful to\nrestart R or run the function `graphics.off()` before opening the graphics\nconnection device.\n\n## Base R plots vs the Tidyverse ggplot2 package\n\nIt is good to know both b/c they each have their strengths\n\n## Summary\n\n- the Base R 'graphics' package has a ton of graphics options that allow for ultimate flexibility\n- Base R plots typically include setting plot options (`par()`), mapping data to the plot (e.g., `plot()`, `barplot()`, `points()`, `lines()`), and creating a legend (`legend()`). \n- the functions `points()` or `lines()` add additional points or additional lines to an existing plot, but must be called with a `plot()`-style function\n- in Base R plotting the legend is not automatically generated, so be careful when creating it\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", "supporting": [ "Module10-DataVisualization_files" ], diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png index 554b146..baf3c4b 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png index 7771ab5..5535ebf 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png index b7ef384..24d0d37 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png index faacd3e..4e5c9c8 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png index ee446a7..bf214c3 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png index 66ec6a4..ca0e2f6 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png index b02c913..ccb4316 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png index 8784b6a..a7e02e6 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png index 554d371..57c867f 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png index a83659f..edfae88 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png index 8bfbbe3..232d44e 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png index a7fecb4..c6eb02c 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png differ diff --git a/_freeze/modules/Module12-Iteration/execute-results/html.json b/_freeze/modules/Module12-Iteration/execute-results/html.json new file mode 100644 index 0000000..c817a1d --- /dev/null +++ b/_freeze/modules/Module12-Iteration/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "3dc5f13d9b279cbe4fc38ed3b2fc6560", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Module 12: Iteration in R\"\nformat:\n revealjs:\n toc: false\n---\n\n\n\n\n\n## Learning goals\n\n1. Replace repetitive code with a `for` loop\n1. Use vectorization to replace unnecessary loops\n\n## What is iteration?\n\n* Whenever you repeat something, that's iteration.\n* In `R`, this means running the same code multiple times in a row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n## Parts of a loop\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,9\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\nThe **header** declares how many times we will repeat the same code. The header\ncontains a **control variable** that changes in each repetition and a\n**sequence** of values for the control variable to take.\n\n## Parts of a loop\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2-8\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\nThe **body** of the loop contains code that will be repeated a number of times\nbased on the header instructions. In `R`, the body has to be surrounded by\ncurly braces.\n\n## Header parts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n* `for`: keyword that declares we are doing a for loop.\n* `(...)`: parentheses after `for` declare the control variable and sequence.\n* `this_island`: the control variable.\n* `in`: keyword that separates the control varibale and sequence.\n* `levels(penguins$island)`: the sequence.\n* `{}`: curly braces will contain the body code.\n\n## Header parts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n* Since `levels(penguins$island)` evaluates to\n`c(\"Biscoe\", \"Dream\", \"Torgersen\")`, our loop will repeat 3 times.\n\n| Iteration | `this_island` |\n|-----------|---------------|\n| 1 | \"Biscoe\" |\n| 2 | \"Dream\" |\n| 3 | \"Torgersen\" |\n\n* Everything inside of `{...}` will be repeated three times.\n\n## Loop iteration 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\n```\n\n\n:::\n:::\n\n\n## Loop iteration 2\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Dream Island was 18.34 mm.\n```\n\n\n:::\n:::\n\n\n## Loop iteration 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n## The loop structure automates this process for us so we don't have to copy and paste our code!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n## Side note: the pipe operator `|>` {.scrollable}\n\n* This operator allows us to chain commands together so the output of the\nprevious statement is passed into the next statement.\n* E.g. the code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n```\n:::\n\n\nwill be transformed by R into\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tround(\n\t\tmean(\n\t\t\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"],\n\t\t\tna.rm = TRUE\n\t\t),\n\t\tdigits = 2\n\t)\n```\n:::\n\n\nbefore it gets run. So using the pipe is a way to avoid deeply nested functions.\n\nNote that another alernative could be like this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_data <- penguins$bill_depth_mm[penguins$island == \"Torgersen\"]\nisland_mean_raw <- mean(island_data, na.rm = TRUE)\nisland_mean <- round(island_mean_raw, digits = 2)\n```\n:::\n\n\nSo using `|>` can also help us to avoid a lot of assignments.\n\n* **Whichever style you prefer is fine!** Some people like the pipe, some\npeople like nesting, and some people like intermediate assignments. All three\nare perfectly fine as long as your code is neat and commented.\n* If you go on to the `tidyverse` class, you will use a lot of piping -- it\nis a very popular coding style in R these days thanks to the inventors of\nthe `tidyverse` packages.\n* Also note that you need R version 4.1.0 or better to use `|>`. If you are\non an older version of R, it will not be available.\n\n**Now, back to loops!**\n\n## Remember: write DRY code!\n\n* DRY = \"Don't Repeat Yourself\"\n* Instead of copying and pasting, write loops and functions.\n* Easier to debug and change in the future!\n\n. . .\n\n* Of course, we all copy and paste code sometimes. If you are running on a\ntight deadline or can't get a loop or function to work, you might need to.\n**DRY code is good, but working code is best!**\n\n## {#tweet-slide data-menu-title=\"Hadley tweet\" .center}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](../images/hadley-tweet.PNG)\n:::\n:::\n\n\n## You try it!\n\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints\nthe squared number.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n\tcat(i ^ 2, \"\\n\")\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100 \n```\n\n\n:::\n:::\n\n\n## Wait, did we need to do that? {.incremental}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n## Wait, did we need to do that? {.scrollable}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n\n\n:::\n:::\n\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 3 5 7 9 11 13 15 17 19\n```\n\n\n:::\n:::\n\n\n. . .\n\n* So you should really try vectorization first, then use loops only when\nyou can't use vectorization.\n\n## Loop walkthrough\n\n* Let's walk through a complex but useful example where we can't use\nvectorization.\n* Load the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n\tsubset(vaccine_antigen == \"MCV1\")\nstr(meas)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...\n```\n\n\n:::\n:::\n\n\n## Loop walkthrough\n\n* First, make an empty `list`. This is where we'll store our results. Make it\nthe same length as the number of countries in the dataset.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n```\n:::\n\n\n* This is called *preallocation* and it can make your loops much faster.\n\n## Loop walkthrough\n\n* Loop through every country in the dataset, and get the median, first and third\nquartiles, and range for each country. Store those summary statistics in a data frame.\n* What should the header look like?\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n```\n:::\n\n\n. . .\n\n* Note that we use the **index** as the control variable. When you need to\ndo complex operations inside a loop, this is easier than the **for-each**\nconstruction we used earlier.\n\n## Loop walkthrough {.scrollable}\n\n* Now write out the body of the code. First we need to subset the data, to get\nonly the data for the current country.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n}\n```\n:::\n\n\n. . .\n\n* Next we need to get the summary of the cases for that country.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n}\n```\n:::\n\n\n. . .\n\n* Next we save the summary statistics into a data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n}\n```\n:::\n\n\n. . .\n\n* And finally, we save the data frame as the next element in our storage list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\t# Save the results to our container\n\tres[[i]] <- country_summary\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n:::\n\n\n. . .\n\n* Let's take a look at the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(res)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n```\n\n\n:::\n:::\n\n\n* How do we deal with this to get it into a nice form?\n\n. . .\n\n* We can use a *vectorization* trick: the function `do.call()` seems like\nancient computer science magic. And it is. But it will actually help us a\nlot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres_df <- do.call(rbind, res)\nhead(res_df)\n```\n\n::: {.cell-output-display}\n\n\n|country | min| Q1| median| Q3| max|\n|:--------------------|---:|-------:|------:|-------:|------:|\n|Afghanistan | 353| 1154.00| 2205| 5166.0| 31107|\n|Angola | 29| 700.00| 3271| 14474.0| 30067|\n|Albania | 0| 1.00| 12| 29.0| 136034|\n|Andorra | 0| 0.00| 1| 2.0| 5|\n|United Arab Emirates | 22| 89.75| 320| 1128.0| 2913|\n|Argentina | 0| 0.00| 17| 4591.5| 42093|\n:::\n:::\n\n\n* It combined our data frames together! Let's take a look at the `rbind` and\n`do.call()` help packages to see what happened.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?rbind\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. Old-style categories (integer vectors with\n levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n```\n\n\n:::\n:::\n\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?do.call\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n```\n\n\n:::\n:::\n\n\n. . .\n\n* OK, so basically what happened is that\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndo.call(rbind, list)\n```\n:::\n\n\n* Gets transformed into\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n```\n:::\n\n\n* That's vectorization magic!\n\n## You try it! (if we have time) {.smaller}\n\n* Use the code you wrote before the get the incidence per 1000 people on the\nentire measles data set (add a column for incidence to the full data).\n* Use the code `plot(NULL, NULL, ...)` to make a blank plot. You will need to\nset the `xlim` and `ylim` arguments to sensible values, and specify the axis\ntitles as \"Year\" and \"Incidence per 1000 people\".\n* Using a `for` loop and the `lines()` function, make a plot that shows all of\nthe incidence curves over time, overlapping on the plot.\n* HINT: use `col = adjustcolor(black, alpha.f = 0.25)` to make the curves\npartially transparent, so you can see the overlap.\n* BONUS PROBLEM: using the function `cumsum()`, make a plot of the cumulative\ncases (not standardized) over time for all of the countries. (Dealing with\nthe NA's here is tricky!!)\n\n## Main problem solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(0, 50),\n\txlab = \"Year\",\n\tylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cases_per_thousand,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n```\n:::\n\n\n## Main problem solution\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module12-Iteration_files/figure-revealjs/unnamed-chunk-32-1.png){width=960}\n:::\n:::\n\n\n## Bonus problem solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n\tx = ifelse(is.na(meas$Cases), 0, meas$Cases),\n\tmeas$country,\n\tFUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(1, 6.2e6),\n\txlab = \"Year\",\n\tylab = paste0(\"Cumulative cases since\", min(meas$time))\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cumulative_cases,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n\ntext(\n\tx = 2020,\n\ty = 6e6,\n\tlabels = \"China →\"\n)\n```\n:::\n\n\n## Bonus problem solution\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module12-Iteration_files/figure-revealjs/unnamed-chunk-34-1.png){width=960}\n:::\n:::\n\n\n## More practice on your own {.smaller}\n\n* Merge the `countries-regions.csv` data with the `measles_final.Rds` data.\nReshape the measles data so that `MCV1` and `MCV2` vaccine coverage are two\nseparate columns. Then use a loop to fit a poisson regression model for each\ncontinent where `Cases` is the outcome, and `MCV1 coverage` and `MCV2 coverage`\nare the predictors. Discuss your findings, and try adding an interation term.\n* Assess the impact of `age_months` as a confounder in the Diphtheria serology\ndata. First, write code to transform `age_months` into age ranges for each\nyear. Then, using a loop, calculate the crude odds ratio for the effect of\nvaccination on infection for each of the age ranges. How does the odds ratio\nchange as age increases? Can you formalize this analysis by fitting a logistic\nregression model with `age_months` and vaccination as predictors?\n\n\n", + "supporting": [ + "Module12-Iteration_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-32-1.png b/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-32-1.png new file mode 100644 index 0000000..4e5792a Binary files /dev/null and b/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-32-1.png differ diff --git a/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-34-1.png b/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-34-1.png new file mode 100644 index 0000000..1a7e5a5 Binary files /dev/null and b/_freeze/modules/Module12-Iteration/figure-revealjs/unnamed-chunk-34-1.png differ diff --git a/_freeze/modules/Module13-Functions/execute-results/html.json b/_freeze/modules/Module13-Functions/execute-results/html.json new file mode 100644 index 0000000..0bdcaec --- /dev/null +++ b/_freeze/modules/Module13-Functions/execute-results/html.json @@ -0,0 +1,19 @@ +{ + "hash": "24fe9f90add9d98a5efd4fc90b80f020", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Module 13: Functions\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 13, you should be able to:\n\n- Create your own function\n\n## Writing your own functions\n\nSo far, we have seen many functions (e.g., `c()`, `class()`, `mean()`, `tranform()`, `aggregate()` and many more\n\n**why create your own function?**\n\n1. to cut down on repetitive coding\n2. to organize code into manageable chunks\n3. to avoid running code unintentionally\n4. to use names that make sense to you\n\n## Writing your own functions\n\nHere we will write a function that multiplies some number (x) by 2:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_2 <- function(x) x*2\n```\n:::\n\nWhen you run the line of code above, you make it ready to use (no output yet!)\nLet's test it!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_2(x=10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 20\n```\n\n\n:::\n:::\n\n\n## Writing your own functions: { }\n\nAdding the curly brackets - `{ }` - allows you to use functions spanning multiple lines:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_3 <- function(x) {\n x*3\n}\ntimes_3(x=10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 30\n```\n\n\n:::\n:::\n\n\n## Writing your own functions: `return`\n\nIf we want something specific for the function's output, we use `return()`. Note, if you want to return more than one object, you need to put it into a list using the `list()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_4 <- function(x) {\n output <- x * 4\n return(list(output, x))\n}\ntimes_4(x = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 40\n\n[[2]]\n[1] 10\n```\n\n\n:::\n:::\n\n\n\n## Function Syntax\n\nThis is a brief introduction. The syntax is:\n\n```\nfunctionName = function(inputs) {\n< function body >\nreturn(list(value1, value2))\n}\n```\n\nNote to create the function for use you need to \n\n1. Code/type the function\n2. Execute/run the lines of code\n\nOnly then will the function be available in the Environment pane and ready to use.\n\n## Writing your own functions: multiple arguments\n\nFunctions can take multiple arguments / inputs. Here the function has two arguments `x` and `y`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_2_plus_y <- function(x, y) {\n out <- x * 2 + y\n return(out)\n}\ntimes_2_plus_y(x = 10, y = 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23\n```\n\n\n:::\n:::\n\n\n## Writing your own functions: arugment defaults\n\nFunctions can have default arguments. This lets us use the function without specifying the arguments\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntimes_2_plus_y <- function(x = 10, y = 3) {\n out <- x * 2 + y\n return(out)\n}\ntimes_2_plus_y()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 23\n```\n\n\n:::\n:::\n\n\nWe got an answer b/c we put defaults into the function arguments.\n\n## Writing a simple function\n\nLet's write a function, `sqdif`, that:\n\n1. takes two numbers `x` and `y` with default values of 2 and 3.\n2. takes the difference\n3. squares this difference\n4. then returns the final value\n\n```\nfunctionName = function(inputs) {\n< function body >\nreturn(list(value1, value2))\n}\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsqdif <- function(x=2,y=3){\n output <- (x-y)^2\n return(output)\n}\n\nsqdif()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\nsqdif(x=10,y=5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 25\n```\n\n\n:::\n\n```{.r .cell-code}\nsqdif(10,5)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 25\n```\n\n\n:::\n:::\n\n\n## Writing your own functions: characters\n\nFunctions can have any kind of data type input. For example, here is a function with characters:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nloud <- function(word) {\n output <- rep(toupper(word), 5)\n return(output)\n}\nloud(word = \"hooray!\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"HOORAY!\" \"HOORAY!\" \"HOORAY!\" \"HOORAY!\" \"HOORAY!\"\n```\n\n\n:::\n:::\n\n\n\n## Using functions with `aggregate()`\n\nYou can apply functions easily to groups with `aggregate()`. As a reminder, we learned `aggregate()` yesterday in Module 9. We will take a quick look at the data.\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum age_group seropos\n1 5772 0.31768953 2 Female Non slum young 0\n2 8095 3.43682311 4 Female Non slum young 0\n3 9784 0.30000000 4 Male Non slum young 0\n4 9338 143.23630140 4 Male Non slum young 1\n5 6369 0.44765343 1 Male Non slum young 0\n6 6885 0.02527076 4 Male Non slum young 0\n```\n\n\n:::\n:::\n\n\nThen, we used the following code to estimate the standard deviation of `IgG_concentration` for each unique combination of `age_group` and `slum` variables. \n\n\n::: {.cell}\n\n```{.r .cell-code}\naggregate(\n\tIgG_concentration ~ age_group + slum,\n\tdata = df,\n\tFUN = sd # standard deviation\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n age_group slum IgG_concentration\n1 young Mixed 174.89797\n2 middle Mixed 162.08188\n3 old Mixed 150.07063\n4 young Non slum 114.68422\n5 middle Non slum 177.62113\n6 old Non slum 141.22330\n7 young Slum 61.85705\n8 middle Slum 202.42018\n9 old Slum 74.75217\n```\n\n\n:::\n:::\n\n\n\n## Using functions with `aggregate()`\n\nBut, lets say we want to do something different. Rather than taking the standard deviation and using a function that already exists (`sd()`), lets take the natural log of `IgG_concentration` and then get the mean. To do this, we can create our own function and this plug it into the `FUN` argument. \n\nStep 1 - code/type our own function\n\n::: {.cell}\n\n```{.r .cell-code}\nlog_mean_function <- function(x){\n\toutput <- mean(log(x))\n\treturn(output)\n}\n```\n:::\n\n\n
\n\nStep 2 - execute our function (i.e., run the lines of code), and you would not be able to see it in you Environment pane.\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/log_mean_function.png){fig-align='left' width=100%}\n:::\n:::\n\n\n
\n\nStep 3 - use the function by plugging it in the `aggregate()` function in order to complete our task\n\n::: {.cell}\n\n```{.r .cell-code}\naggregate(\n\tIgG_concentration ~ age_group + slum,\n\tdata = df,\n\tFUN = log_mean_function\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n age_group slum IgG_concentration\n1 young Mixed 0.50082888\n2 middle Mixed 2.85916401\n3 old Mixed 3.13971163\n4 young Non slum 0.14060433\n5 middle Non slum 2.30717077\n6 old Non slum 3.77021233\n7 young Slum -0.04611508\n8 middle Slum 2.46490973\n9 old Slum 3.52357989\n```\n\n\n:::\n:::\n\n\n\n## Example from Module 12\n\nIn the last Module 12, we used loops to loop through every country in the dataset, and get the median, first and third quartiles, and range for each country and stored those summary statistics in a data frame.\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\t# Save the results to our container\n\tres[[i]] <- country_summary\n}\n```\n:::\n\n\n## Function instead of Loop\n\nHere we are going to set up a function that takes our data frame and outputs the median, first and third quartiles, and range of measles cases for a specified country.\n\nStep 1 - code/type our own function. We specify two arguments, the first argument is our data frame and the second is one country's iso3 code. Notice, I included common documentation for \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nget_country_stats <- function(df, iso3_code){\n\t\n\tcountry_data <- subset(df, iso3c == iso3_code)\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\tcountry_name <- unique(country_data$country)\n\t\n\tcountry_summary <- data.frame(\n\t\tcountry = country_name,\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\treturn(country_summary)\n}\n```\n:::\n\n\nStep 2 - execute our function (i.e., run the lines of code), and you would not be able to see it in you Environment pane.\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/get_country_stats_function.png){fig-align='left' width=100%}\n:::\n:::\n\n\nStep 3 - use the function by pulling out stats for India and Pakistan\n\n::: {.cell}\n\n```{.r .cell-code}\nget_country_stats(df=meas, iso3_code=\"IND\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n country min Q1 median Q3 max\n1 India 3305 30813 47072 74828.5 252940\n```\n\n\n:::\n\n```{.r .cell-code}\nget_country_stats(df=meas, iso3_code=\"PAK\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n country min Q1 median Q3 max\n1 Pakistan 386 2065 3903 13860.5 55543\n```\n\n\n:::\n:::\n\n\n\n## Summary\n\n- Simple functions take the form:\n```\nfunctionName = function(arguments) {\n\t< function body >\n\treturn(list(outputs))\n}\n```\n- We can specify arguments defaults when you create the function\n\n\n## Mini Exercise\n\nCreate your own function that saves a line plot of a time series of measles cases for a specified country.\n\nStep 1. Determine your arguments, which are the same as the last example\n\nStep 2. Begin your function by subsetting the data to include only the country specified in the arguments (i.e, `country_data`), this is the same as the first line of code in the last example.\n\nStep 3. Return to Module 10 to remember how to use the `plot()` function. Hint you will need to specify the argument `type=\"l\" to make it a line plot. \n\nStep 4. Return to your function and add code to create a new plot using the `country_data` object. Note you will need to use the `png()` function before the `plot()` function and end it with `dev.off()` in order to save the file.\n\nStep 5. Use the function to generate a plot for India and Pakistan\n\n# Mini Exercise Answer\n\n\n::: {.cell}\n\n```{.r .cell-code}\nget_time_series_plot <- function(df, iso3_code){\n\t\n\tcountry_data <- subset(df, iso3c == iso3_code)\n\t\n\tpng(filename=paste0(\"output/time_series_\", iso3_code, \".png\"))\n\tplot(country_data$time, country_data$Cases, type=\"l\", xlab=\"year\", ylab=\"Measles Cases\")\n\tdev.off()\n\t\n}\n\nget_time_series_plot(df=meas, iso3_code=\"IND\")\nget_time_series_plot(df=meas, iso3_code=\"PAK\")\n```\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css index 58f87c5..f3bfd0c 100644 --- a/_freeze/site_libs/revealjs/dist/theme/quarto.css +++ b/_freeze/site_libs/revealjs/dist/theme/quarto.css @@ -1,8 +1,5 @@ -@import"./fonts/source-sans-pro/source-sans-pro.css";:root{--r-background-color: #fff;--r-main-font: Source Sans Pro, Helvetica, sans-serif;--r-main-font-size: 40px;--r-main-color: #222;--r-block-margin: 12px;--r-heading-margin: 0 0 12px 0;--r-heading-font: Source Sans Pro, Helvetica, sans-serif;--r-heading-color: #222;--r-heading-line-height: 1.2;--r-heading-letter-spacing: normal;--r-heading-text-transform: none;--r-heading-text-shadow: none;--r-heading-font-weight: 600;--r-heading1-text-shadow: none;--r-heading1-size: 2.5em;--r-heading2-size: 1.6em;--r-heading3-size: 1.3em;--r-heading4-size: 1em;--r-code-font: SFMono-Regular, Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace;--r-link-color: #2a76dd;--r-link-color-dark: #1a53a1;--r-link-color-hover: #5692e4;--r-selection-background-color: #98bdef;--r-selection-color: #fff}.reveal-viewport{background:#fff;background-color:var(--r-background-color)}.reveal{font-family:var(--r-main-font);font-size:var(--r-main-font-size);font-weight:normal;color:var(--r-main-color)}.reveal ::selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal ::-moz-selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal .slides section,.reveal .slides section>section{line-height:1.3;font-weight:inherit}.reveal h1,.reveal h2,.reveal h3,.reveal h4,.reveal h5,.reveal h6{margin:var(--r-heading-margin);color:var(--r-heading-color);font-family:var(--r-heading-font);font-weight:var(--r-heading-font-weight);line-height:var(--r-heading-line-height);letter-spacing:var(--r-heading-letter-spacing);text-transform:var(--r-heading-text-transform);text-shadow:var(--r-heading-text-shadow);word-wrap:break-word}.reveal h1{font-size:var(--r-heading1-size)}.reveal h2{font-size:var(--r-heading2-size)}.reveal h3{font-size:var(--r-heading3-size)}.reveal h4{font-size:var(--r-heading4-size)}.reveal h1{text-shadow:var(--r-heading1-text-shadow)}.reveal p{margin:var(--r-block-margin) 0;line-height:1.3}.reveal h1:last-child,.reveal h2:last-child,.reveal h3:last-child,.reveal h4:last-child,.reveal h5:last-child,.reveal h6:last-child{margin-bottom:0}.reveal img,.reveal video,.reveal iframe{max-width:95%;max-height:95%}.reveal strong,.reveal b{font-weight:bold}.reveal em{font-style:italic}.reveal ol,.reveal dl,.reveal ul{display:inline-block;text-align:left;margin:0 0 0 1em}.reveal ol{list-style-type:decimal}.reveal ul{list-style-type:disc}.reveal ul ul{list-style-type:square}.reveal ul ul ul{list-style-type:circle}.reveal ul ul,.reveal ul ol,.reveal ol ol,.reveal ol ul{display:block;margin-left:40px}.reveal dt{font-weight:bold}.reveal dd{margin-left:40px}.reveal blockquote{display:block;position:relative;width:70%;margin:var(--r-block-margin) auto;padding:5px;font-style:italic;background:rgba(255,255,255,.05);box-shadow:0px 0px 2px rgba(0,0,0,.2)}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:inline-block}.reveal q{font-style:italic}.reveal pre{display:block;position:relative;width:90%;margin:var(--r-block-margin) auto;text-align:left;font-size:.55em;font-family:var(--r-code-font);line-height:1.2em;word-wrap:break-word;box-shadow:0px 5px 15px rgba(0,0,0,.15)}.reveal code{font-family:var(--r-code-font);text-transform:none;tab-size:2}.reveal pre code{display:block;padding:5px;overflow:auto;max-height:400px;word-wrap:normal}.reveal .code-wrapper{white-space:normal}.reveal .code-wrapper code{white-space:pre}.reveal table{margin:auto;border-collapse:collapse;border-spacing:0}.reveal table th{font-weight:bold}.reveal table th,.reveal table td{text-align:left;padding:.2em .5em .2em .5em;border-bottom:1px solid}.reveal table th[align=center],.reveal table td[align=center]{text-align:center}.reveal table th[align=right],.reveal table td[align=right]{text-align:right}.reveal table tbody tr:last-child th,.reveal table tbody tr:last-child td{border-bottom:none}.reveal sup{vertical-align:super;font-size:smaller}.reveal sub{vertical-align:sub;font-size:smaller}.reveal small{display:inline-block;font-size:.6em;line-height:1.2em;vertical-align:top}.reveal small *{vertical-align:top}.reveal img{margin:var(--r-block-margin) 0}.reveal a{color:var(--r-link-color);text-decoration:none;transition:color .15s ease}.reveal a:hover{color:var(--r-link-color-hover);text-shadow:none;border:none}.reveal .roll span:after{color:#fff;background:var(--r-link-color-dark)}.reveal .r-frame{border:4px solid var(--r-main-color);box-shadow:0 0 10px rgba(0,0,0,.15)}.reveal a .r-frame{transition:all .15s linear}.reveal a:hover .r-frame{border-color:var(--r-link-color);box-shadow:0 0 20px rgba(0,0,0,.55)}.reveal .controls{color:var(--r-link-color)}.reveal .progress{background:rgba(0,0,0,.2);color:var(--r-link-color)}@media print{.backgrounds{background-color:var(--r-background-color)}}.top-right{position:absolute;top:1em;right:1em}.visually-hidden{border:0;clip:rect(0 0 0 0);height:auto;margin:0;overflow:hidden;padding:0;position:absolute;width:1px;white-space:nowrap}.hidden{display:none !important}.zindex-bottom{z-index:-1 !important}figure.figure{display:block}.quarto-layout-panel{margin-bottom:1em}.quarto-layout-panel>figure{width:100%}.quarto-layout-panel>figure>figcaption,.quarto-layout-panel>.panel-caption{margin-top:10pt}.quarto-layout-panel>.table-caption{margin-top:0px}.table-caption p{margin-bottom:.5em}.quarto-layout-row{display:flex;flex-direction:row;align-items:flex-start}.quarto-layout-valign-top{align-items:flex-start}.quarto-layout-valign-bottom{align-items:flex-end}.quarto-layout-valign-center{align-items:center}.quarto-layout-cell{position:relative;margin-right:20px}.quarto-layout-cell:last-child{margin-right:0}.quarto-layout-cell figure,.quarto-layout-cell>p{margin:.2em}.quarto-layout-cell img{max-width:100%}.quarto-layout-cell .html-widget{width:100% !important}.quarto-layout-cell div figure p{margin:0}.quarto-layout-cell figure{display:block;margin-inline-start:0;margin-inline-end:0}.quarto-layout-cell table{display:inline-table}.quarto-layout-cell-subref figcaption,figure .quarto-layout-row figure figcaption{text-align:center;font-style:italic}.quarto-figure{position:relative;margin-bottom:1em}.quarto-figure>figure{width:100%;margin-bottom:0}.quarto-figure-left>figure>p,.quarto-figure-left>figure>div{text-align:left}.quarto-figure-center>figure>p,.quarto-figure-center>figure>div{text-align:center}.quarto-figure-right>figure>p,.quarto-figure-right>figure>div{text-align:right}.quarto-figure>figure>div.cell-annotation,.quarto-figure>figure>div code{text-align:left}figure>p:empty{display:none}figure>p:first-child{margin-top:0;margin-bottom:0}figure>figcaption.quarto-float-caption-bottom{margin-bottom:.5em}figure>figcaption.quarto-float-caption-top{margin-top:.5em}div[id^=tbl-]{position:relative}.quarto-figure>.anchorjs-link{position:absolute;top:.6em;right:.5em}div[id^=tbl-]>.anchorjs-link{position:absolute;top:.7em;right:.3em}.quarto-figure:hover>.anchorjs-link,div[id^=tbl-]:hover>.anchorjs-link,h2:hover>.anchorjs-link,h3:hover>.anchorjs-link,h4:hover>.anchorjs-link,h5:hover>.anchorjs-link,h6:hover>.anchorjs-link,.reveal-anchorjs-link>.anchorjs-link{opacity:1}#title-block-header{margin-block-end:1rem;position:relative;margin-top:-1px}#title-block-header .abstract{margin-block-start:1rem}#title-block-header .abstract .abstract-title{font-weight:600}#title-block-header a{text-decoration:none}#title-block-header .author,#title-block-header .date,#title-block-header .doi{margin-block-end:.2rem}#title-block-header .quarto-title-block>div{display:flex}#title-block-header .quarto-title-block>div>h1{flex-grow:1}#title-block-header .quarto-title-block>div>button{flex-shrink:0;height:2.25rem;margin-top:0}tr.header>th>p:last-of-type{margin-bottom:0px}table,table.table{margin-top:.5rem;margin-bottom:.5rem}caption,.table-caption{padding-top:.5rem;padding-bottom:.5rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-top{margin-top:.5rem;margin-bottom:.25rem;text-align:center}figure.quarto-float-tbl figcaption.quarto-float-caption-bottom{padding-top:.25rem;margin-bottom:.5rem;text-align:center}.utterances{max-width:none;margin-left:-8px}iframe{margin-bottom:1em}details{margin-bottom:1em}details[show]{margin-bottom:0}details>summary{color:#6f6f6f}details>summary>p:only-child{display:inline}pre.sourceCode,code.sourceCode{position:relative}dd code:not(.sourceCode),p code:not(.sourceCode){white-space:pre-wrap}code{white-space:pre}@media print{code{white-space:pre-wrap}}pre>code{display:block}pre>code.sourceCode{white-space:pre}pre>code.sourceCode>span>a:first-child::before{text-decoration:none}pre.code-overflow-wrap>code.sourceCode{white-space:pre-wrap}pre.code-overflow-scroll>code.sourceCode{white-space:pre}code a:any-link{color:inherit;text-decoration:none}code a:hover{color:inherit;text-decoration:underline}ul.task-list{padding-left:1em}[data-tippy-root]{display:inline-block}.tippy-content .footnote-back{display:none}.footnote-back{margin-left:.2em}.tippy-content{overflow-x:auto}.quarto-embedded-source-code{display:none}.quarto-unresolved-ref{font-weight:600}.quarto-cover-image{max-width:35%;float:right;margin-left:30px}.cell-output-display .widget-subarea{margin-bottom:1em}.cell-output-display:not(.no-overflow-x),.knitsql-table:not(.no-overflow-x){overflow-x:auto}.panel-input{margin-bottom:1em}.panel-input>div,.panel-input>div>div{display:inline-block;vertical-align:top;padding-right:12px}.panel-input>p:last-child{margin-bottom:0}.layout-sidebar{margin-bottom:1em}.layout-sidebar .tab-content{border:none}.tab-content>.page-columns.active{display:grid}div.sourceCode>iframe{width:100%;height:300px;margin-bottom:-0.5em}a{text-underline-offset:3px}div.ansi-escaped-output{font-family:monospace;display:block}/*! +@import"./fonts/source-sans-pro/source-sans-pro.css";:root{--r-background-color: #fff;--r-main-font: Source Sans Pro, Helvetica, sans-serif;--r-main-font-size: 40px;--r-main-color: #222;--r-block-margin: 12px;--r-heading-margin: 0 0 12px 0;--r-heading-font: Source Sans Pro, Helvetica, sans-serif;--r-heading-color: #222;--r-heading-line-height: 1.2;--r-heading-letter-spacing: normal;--r-heading-text-transform: none;--r-heading-text-shadow: none;--r-heading-font-weight: 600;--r-heading1-text-shadow: none;--r-heading1-size: 2.5em;--r-heading2-size: 1.6em;--r-heading3-size: 1.3em;--r-heading4-size: 1em;--r-code-font: SFMono-Regular, Menlo, Monaco, Consolas, Liberation Mono, Courier New, monospace;--r-link-color: #2a76dd;--r-link-color-dark: #1a53a1;--r-link-color-hover: #5692e4;--r-selection-background-color: #98bdef;--r-selection-color: #fff}.reveal-viewport{background:#fff;background-color:var(--r-background-color)}.reveal{font-family:var(--r-main-font);font-size:var(--r-main-font-size);font-weight:normal;color:var(--r-main-color)}.reveal ::selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal ::-moz-selection{color:var(--r-selection-color);background:var(--r-selection-background-color);text-shadow:none}.reveal .slides section,.reveal .slides section>section{line-height:1.3;font-weight:inherit}.reveal h1,.reveal h2,.reveal h3,.reveal h4,.reveal h5,.reveal h6{margin:var(--r-heading-margin);color:var(--r-heading-color);font-family:var(--r-heading-font);font-weight:var(--r-heading-font-weight);line-height:var(--r-heading-line-height);letter-spacing:var(--r-heading-letter-spacing);text-transform:var(--r-heading-text-transform);text-shadow:var(--r-heading-text-shadow);word-wrap:break-word}.reveal h1{font-size:var(--r-heading1-size)}.reveal h2{font-size:var(--r-heading2-size)}.reveal h3{font-size:var(--r-heading3-size)}.reveal h4{font-size:var(--r-heading4-size)}.reveal h1{text-shadow:var(--r-heading1-text-shadow)}.reveal p{margin:var(--r-block-margin) 0;line-height:1.3}.reveal h1:last-child,.reveal h2:last-child,.reveal h3:last-child,.reveal h4:last-child,.reveal h5:last-child,.reveal h6:last-child{margin-bottom:0}.reveal img,.reveal video,.reveal iframe{max-width:95%;max-height:95%}.reveal strong,.reveal b{font-weight:bold}.reveal em{font-style:italic}.reveal ol,.reveal dl,.reveal ul{display:inline-block;text-align:left;margin:0 0 0 1em}.reveal ol{list-style-type:decimal}.reveal ul{list-style-type:disc}.reveal ul ul{list-style-type:square}.reveal ul ul ul{list-style-type:circle}.reveal ul ul,.reveal ul ol,.reveal ol ol,.reveal ol ul{display:block;margin-left:40px}.reveal dt{font-weight:bold}.reveal dd{margin-left:40px}.reveal blockquote{display:block;position:relative;width:70%;margin:var(--r-block-margin) auto;padding:5px;font-style:italic;background:rgba(255,255,255,.05);box-shadow:0px 0px 2px rgba(0,0,0,.2)}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:inline-block}.reveal q{font-style:italic}.reveal pre{display:block;position:relative;width:90%;margin:var(--r-block-margin) auto;text-align:left;font-size:.55em;font-family:var(--r-code-font);line-height:1.2em;word-wrap:break-word;box-shadow:0px 5px 15px rgba(0,0,0,.15)}.reveal code{font-family:var(--r-code-font);text-transform:none;tab-size:2}.reveal pre code{display:block;padding:5px;overflow:auto;max-height:400px;word-wrap:normal}.reveal .code-wrapper{white-space:normal}.reveal .code-wrapper code{white-space:pre}.reveal table{margin:auto;border-collapse:collapse;border-spacing:0}.reveal table th{font-weight:bold}.reveal table th,.reveal table td{text-align:left;padding:.2em .5em .2em .5em;border-bottom:1px solid}.reveal table th[align=center],.reveal table td[align=center]{text-align:center}.reveal table th[align=right],.reveal table td[align=right]{text-align:right}.reveal table tbody tr:last-child th,.reveal table tbody tr:last-child td{border-bottom:none}.reveal sup{vertical-align:super;font-size:smaller}.reveal sub{vertical-align:sub;font-size:smaller}.reveal small{display:inline-block;font-size:.6em;line-height:1.2em;vertical-align:top}.reveal small *{vertical-align:top}.reveal img{margin:var(--r-block-margin) 0}.reveal a{color:var(--r-link-color);text-decoration:none;transition:color .15s ease}.reveal a:hover{color:var(--r-link-color-hover);text-shadow:none;border:none}.reveal .roll span:after{color:#fff;background:var(--r-link-color-dark)}.reveal .r-frame{border:4px solid var(--r-main-color);box-shadow:0 0 10px rgba(0,0,0,.15)}.reveal a .r-frame{transition:all .15s linear}.reveal a:hover .r-frame{border-color:var(--r-link-color);box-shadow:0 0 20px rgba(0,0,0,.55)}.reveal .controls{color:var(--r-link-color)}.reveal .progress{background:rgba(0,0,0,.2);color:var(--r-link-color)}@media print{.backgrounds{background-color:var(--r-background-color)}}.top-right{position:absolute;top:1em;right:1em}.visually-hidden{border:0;clip:rect(0 0 0 0);height:auto;margin:0;overflow:hidden;padding:0;position:absolute;width:1px;white-space:nowrap}.hidden{display:none !important}.zindex-bottom{z-index:-1 !important}figure.figure{display:block}.quarto-layout-panel{margin-bottom:1em}.quarto-layout-panel>figure{width:100%}.quarto-layout-panel>figure>figcaption,.quarto-layout-panel>.panel-caption{margin-top:10pt}.quarto-layout-panel>.table-caption{margin-top:0px}.table-caption p{margin-bottom:.5em}.quarto-layout-row{display:flex;flex-direction:row;align-items:flex-start}.quarto-layout-valign-top{align-items:flex-start}.quarto-layout-valign-bottom{align-items:flex-end}.quarto-layout-valign-center{align-items:center}.quarto-layout-cell{position:relative;margin-right:20px}.quarto-layout-cell:last-child{margin-right:0}.quarto-layout-cell figure,.quarto-layout-cell>p{margin:.2em}.quarto-layout-cell img{max-width:100%}.quarto-layout-cell .html-widget{width:100% !important}.quarto-layout-cell div figure p{margin:0}.quarto-layout-cell figure{display:block;margin-inline-start:0;margin-inline-end:0}.quarto-layout-cell table{display:inline-table}.quarto-layout-cell-subref figcaption,figure .quarto-layout-row figure figcaption{text-align:center;font-style:italic}.quarto-figure{position:relative;margin-bottom:1em}.quarto-figure>figure{width:100%;margin-bottom:0}.quarto-figure-left>figure>p,.quarto-figure-left>figure>div{text-align:left}.quarto-figure-center>figure>p,.quarto-figure-center>figure>div{text-align:center}.quarto-figure-right>figure>p,.quarto-figure-right>figure>div{text-align:right}.quarto-figure>figure>div.cell-annotation,.quarto-figure>figure>div code{text-align:left}figure>p:empty{display:none}figure>p:first-child{margin-top:0;margin-bottom:0}figure>figcaption{margin-top:.5em}figure.quarto-float-lst>figcaption{margin-bottom:.5em}div[id^=tbl-]{position:relative}.quarto-figure>.anchorjs-link{position:absolute;top:.6em;right:.5em}div[id^=tbl-]>.anchorjs-link{position:absolute;top:.7em;right:.3em}.quarto-figure:hover>.anchorjs-link,div[id^=tbl-]:hover>.anchorjs-link,h2:hover>.anchorjs-link,h3:hover>.anchorjs-link,h4:hover>.anchorjs-link,h5:hover>.anchorjs-link,h6:hover>.anchorjs-link,.reveal-anchorjs-link>.anchorjs-link{opacity:1}#title-block-header{margin-block-end:1rem;position:relative;margin-top:-1px}#title-block-header .abstract{margin-block-start:1rem}#title-block-header .abstract .abstract-title{font-weight:600}#title-block-header a{text-decoration:none}#title-block-header .author,#title-block-header .date,#title-block-header .doi{margin-block-end:.2rem}#title-block-header .quarto-title-block>div{display:flex}#title-block-header .quarto-title-block>div>h1{flex-grow:1}#title-block-header .quarto-title-block>div>button{flex-shrink:0;height:2.25rem;margin-top:0}tr.header>th>p:last-of-type{margin-bottom:0px}table,.table{caption-side:top;margin-bottom:1.5rem}figure.quarto-float-tbl figcaption,caption,.table-caption{padding-top:.5rem;padding-bottom:.5rem;text-align:center}.utterances{max-width:none;margin-left:-8px}iframe{margin-bottom:1em}details{margin-bottom:1em}details[show]{margin-bottom:0}details>summary{color:#6f6f6f}details>summary>p:only-child{display:inline}pre.sourceCode,code.sourceCode{position:relative}p code:not(.sourceCode){white-space:pre-wrap}code{white-space:pre}@media print{code{white-space:pre-wrap}}pre>code{display:block}pre>code.sourceCode{white-space:pre}pre>code.sourceCode>span>a:first-child::before{text-decoration:none}pre.code-overflow-wrap>code.sourceCode{white-space:pre-wrap}pre.code-overflow-scroll>code.sourceCode{white-space:pre}code a:any-link{color:inherit;text-decoration:none}code a:hover{color:inherit;text-decoration:underline}ul.task-list{padding-left:1em}[data-tippy-root]{display:inline-block}.tippy-content .footnote-back{display:none}.tippy-content{overflow-x:scroll}.quarto-embedded-source-code{display:none}.quarto-unresolved-ref{font-weight:600}.quarto-cover-image{max-width:35%;float:right;margin-left:30px}.cell-output-display .widget-subarea{margin-bottom:1em}.cell-output-display:not(.no-overflow-x),.knitsql-table:not(.no-overflow-x){overflow-x:auto}.panel-input{margin-bottom:1em}.panel-input>div,.panel-input>div>div{display:inline-block;vertical-align:top;padding-right:12px}.panel-input>p:last-child{margin-bottom:0}.layout-sidebar{margin-bottom:1em}.layout-sidebar .tab-content{border:none}.tab-content>.page-columns.active{display:grid}div.sourceCode>iframe{width:100%;height:300px;margin-bottom:-0.5em}a{text-underline-offset:3px}div.ansi-escaped-output{font-family:monospace;display:block}/*! * * ansi colors from IPython notebook's * -* we also add `bright-[color]-` synonyms for the `-[color]-intense` classes since -* that seems to be what ansi_up emits -* -*/.ansi-black-fg{color:#3e424d}.ansi-black-bg{background-color:#3e424d}.ansi-black-intense-black,.ansi-bright-black-fg{color:#282c36}.ansi-black-intense-black,.ansi-bright-black-bg{background-color:#282c36}.ansi-red-fg{color:#e75c58}.ansi-red-bg{background-color:#e75c58}.ansi-red-intense-red,.ansi-bright-red-fg{color:#b22b31}.ansi-red-intense-red,.ansi-bright-red-bg{background-color:#b22b31}.ansi-green-fg{color:#00a250}.ansi-green-bg{background-color:#00a250}.ansi-green-intense-green,.ansi-bright-green-fg{color:#007427}.ansi-green-intense-green,.ansi-bright-green-bg{background-color:#007427}.ansi-yellow-fg{color:#ddb62b}.ansi-yellow-bg{background-color:#ddb62b}.ansi-yellow-intense-yellow,.ansi-bright-yellow-fg{color:#b27d12}.ansi-yellow-intense-yellow,.ansi-bright-yellow-bg{background-color:#b27d12}.ansi-blue-fg{color:#208ffb}.ansi-blue-bg{background-color:#208ffb}.ansi-blue-intense-blue,.ansi-bright-blue-fg{color:#0065ca}.ansi-blue-intense-blue,.ansi-bright-blue-bg{background-color:#0065ca}.ansi-magenta-fg{color:#d160c4}.ansi-magenta-bg{background-color:#d160c4}.ansi-magenta-intense-magenta,.ansi-bright-magenta-fg{color:#a03196}.ansi-magenta-intense-magenta,.ansi-bright-magenta-bg{background-color:#a03196}.ansi-cyan-fg{color:#60c6c8}.ansi-cyan-bg{background-color:#60c6c8}.ansi-cyan-intense-cyan,.ansi-bright-cyan-fg{color:#258f8f}.ansi-cyan-intense-cyan,.ansi-bright-cyan-bg{background-color:#258f8f}.ansi-white-fg{color:#c5c1b4}.ansi-white-bg{background-color:#c5c1b4}.ansi-white-intense-white,.ansi-bright-white-fg{color:#a1a6b2}.ansi-white-intense-white,.ansi-bright-white-bg{background-color:#a1a6b2}.ansi-default-inverse-fg{color:#fff}.ansi-default-inverse-bg{background-color:#000}.ansi-bold{font-weight:bold}.ansi-underline{text-decoration:underline}:root{--quarto-body-bg: #fff;--quarto-body-color: #222;--quarto-text-muted: #6f6f6f;--quarto-border-color: #bbbbbb;--quarto-border-width: 1px;--quarto-border-radius: 4px}table.gt_table{color:var(--quarto-body-color);font-size:1em;width:100%;background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_column_spanner_outer{color:var(--quarto-body-color);background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_col_heading{color:var(--quarto-body-color);font-weight:bold;background-color:rgba(0,0,0,0)}table.gt_table thead.gt_col_headings{border-bottom:1px solid currentColor;border-top-width:inherit;border-top-color:var(--quarto-border-color)}table.gt_table thead.gt_col_headings:not(:first-child){border-top-width:1px;border-top-color:var(--quarto-border-color)}table.gt_table td.gt_row{border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-width:0px}table.gt_table tbody.gt_table_body{border-top-width:1px;border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-color:currentColor}div.columns{display:initial;gap:initial}div.column{display:inline-block;overflow-x:initial;vertical-align:top;width:50%}.code-annotation-tip-content{word-wrap:break-word}.code-annotation-container-hidden{display:none !important}dl.code-annotation-container-grid{display:grid;grid-template-columns:min-content auto}dl.code-annotation-container-grid dt{grid-column:1}dl.code-annotation-container-grid dd{grid-column:2}pre.sourceCode.code-annotation-code{padding-right:0}code.sourceCode .code-annotation-anchor{z-index:100;position:relative;float:right;background-color:rgba(0,0,0,0)}input[type=checkbox]{margin-right:.5ch}:root{--mermaid-bg-color: #fff;--mermaid-edge-color: #999;--mermaid-node-fg-color: #222;--mermaid-fg-color: #222;--mermaid-fg-color--lighter: #3c3c3c;--mermaid-fg-color--lightest: #555555;--mermaid-font-family: Source Sans Pro, Helvetica, sans-serif;--mermaid-label-bg-color: #fff;--mermaid-label-fg-color: #468;--mermaid-node-bg-color: rgba(68, 102, 136, 0.1);--mermaid-node-fg-color: #222}@media print{:root{font-size:11pt}#quarto-sidebar,#TOC,.nav-page{display:none}.page-columns .content{grid-column-start:page-start}.fixed-top{position:relative}.panel-caption,.figure-caption,figcaption{color:#666}}.code-copy-button{position:absolute;top:0;right:0;border:0;margin-top:5px;margin-right:5px;background-color:rgba(0,0,0,0);z-index:3}.code-copy-button:focus{outline:none}.code-copy-button-tooltip{font-size:.75em}pre.sourceCode:hover>.code-copy-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,');background-repeat:no-repeat;background-size:1rem 1rem}pre.sourceCode:hover>.code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,')}pre.sourceCode:hover>.code-copy-button:hover>.bi::before{background-image:url('data:image/svg+xml,')}pre.sourceCode:hover>.code-copy-button-checked:hover>.bi::before{background-image:url('data:image/svg+xml,')}.panel-tabset [role=tablist]{border-bottom:1px solid #bbb;list-style:none;margin:0;padding:0;width:100%}.panel-tabset [role=tablist] *{-webkit-box-sizing:border-box;box-sizing:border-box}@media(min-width: 30em){.panel-tabset [role=tablist] li{display:inline-block}}.panel-tabset [role=tab]{border:1px solid rgba(0,0,0,0);border-top-color:#bbb;display:block;padding:.5em 1em;text-decoration:none}@media(min-width: 30em){.panel-tabset [role=tab]{border-top-color:rgba(0,0,0,0);display:inline-block;margin-bottom:-1px}}.panel-tabset [role=tab][aria-selected=true]{background-color:#bbb}@media(min-width: 30em){.panel-tabset [role=tab][aria-selected=true]{background-color:rgba(0,0,0,0);border:1px solid #bbb;border-bottom-color:#fff}}@media(min-width: 30em){.panel-tabset [role=tab]:hover:not([aria-selected=true]){border:1px solid #bbb}}.code-with-filename .code-with-filename-file{margin-bottom:0;padding-bottom:2px;padding-top:2px;padding-left:.7em;border:var(--quarto-border-width) solid var(--quarto-border-color);border-radius:var(--quarto-border-radius);border-bottom:0;border-bottom-left-radius:0%;border-bottom-right-radius:0%}.code-with-filename div.sourceCode,.reveal .code-with-filename div.sourceCode{margin-top:0;border-top-left-radius:0%;border-top-right-radius:0%}.code-with-filename .code-with-filename-file pre{margin-bottom:0}.code-with-filename .code-with-filename-file{background-color:rgba(219,219,219,.8)}.quarto-dark .code-with-filename .code-with-filename-file{background-color:#555}.code-with-filename .code-with-filename-file strong{font-weight:400}.reveal.center .slide aside,.reveal.center .slide div.aside{position:initial}section.has-light-background,section.has-light-background h1,section.has-light-background h2,section.has-light-background h3,section.has-light-background h4,section.has-light-background h5,section.has-light-background h6{color:#222}section.has-light-background a,section.has-light-background a:hover{color:#2a76dd}section.has-light-background code{color:#4758ab}section.has-dark-background,section.has-dark-background h1,section.has-dark-background h2,section.has-dark-background h3,section.has-dark-background h4,section.has-dark-background h5,section.has-dark-background h6{color:#fff}section.has-dark-background a,section.has-dark-background a:hover{color:#42affa}section.has-dark-background code{color:#ffa07a}#title-slide,div.reveal div.slides section.quarto-title-block{text-align:center}#title-slide .subtitle,div.reveal div.slides section.quarto-title-block .subtitle{margin-bottom:2.5rem}.reveal .slides{text-align:left}.reveal .title-slide h1{font-size:1.6em}.reveal[data-navigation-mode=linear] .title-slide h1{font-size:2.5em}.reveal div.sourceCode{border:1px solid #bbb;border-radius:4px}.reveal pre{width:100%;box-shadow:none;background-color:#fff;border:none;margin:0;font-size:.55em}.reveal .code-with-filename .code-with-filename-file pre{background-color:unset}.reveal code{color:var(--quarto-hl-fu-color);background-color:rgba(0,0,0,0);white-space:pre-wrap}.reveal pre.sourceCode code{background-color:#fff;padding:6px 9px;max-height:500px;white-space:pre}.reveal pre code{background-color:#fff;color:#222}.reveal .column-output-location{display:flex;align-items:stretch}.reveal .column-output-location .column:first-of-type div.sourceCode{height:100%;background-color:#fff}.reveal blockquote{display:block;position:relative;color:#6f6f6f;width:unset;margin:var(--r-block-margin) auto;padding:.625rem 1.75rem;border-left:.25rem solid #6f6f6f;font-style:normal;background:none;box-shadow:none}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:block}.reveal .slide aside,.reveal .slide div.aside{position:absolute;bottom:20px;font-size:0.7em;color:#6f6f6f}.reveal .slide sup{font-size:0.7em}.reveal .slide.scrollable aside,.reveal .slide.scrollable div.aside{position:relative;margin-top:1em}.reveal .slide aside .aside-footnotes{margin-bottom:0}.reveal .slide aside .aside-footnotes li:first-of-type{margin-top:0}.reveal .layout-sidebar{display:flex;width:100%;margin-top:.8em}.reveal .layout-sidebar .panel-sidebar{width:270px}.reveal .layout-sidebar-left .panel-sidebar{margin-right:calc(0.5em*2)}.reveal .layout-sidebar-right .panel-sidebar{margin-left:calc(0.5em*2)}.reveal .layout-sidebar .panel-fill,.reveal .layout-sidebar .panel-center,.reveal .layout-sidebar .panel-tabset{flex:1}.reveal .panel-input,.reveal .panel-sidebar{font-size:.5em;padding:.5em;border-style:solid;border-color:#bbb;border-width:1px;border-radius:4px;background-color:#f8f9fa}.reveal .panel-sidebar :first-child,.reveal .panel-fill :first-child{margin-top:0}.reveal .panel-sidebar :last-child,.reveal .panel-fill :last-child{margin-bottom:0}.panel-input>div,.panel-input>div>div{vertical-align:middle;padding-right:1em}.reveal p,.reveal .slides section,.reveal .slides section>section{line-height:1.3}.reveal.smaller .slides section,.reveal .slides section.smaller,.reveal .slides section .callout{font-size:0.7em}.reveal.smaller .slides section section{font-size:inherit}.reveal.smaller .slides h1,.reveal .slides section.smaller h1{font-size:calc(2.5em/0.7)}.reveal.smaller .slides h2,.reveal .slides section.smaller h2{font-size:calc(1.6em/0.7)}.reveal.smaller .slides h3,.reveal .slides section.smaller h3{font-size:calc(1.3em/0.7)}.reveal .columns>.column>:not(ul,ol){margin-left:.25em;margin-right:.25em}.reveal .columns>.column:first-child>:not(ul,ol){margin-right:.5em;margin-left:0}.reveal .columns>.column:last-child>:not(ul,ol){margin-right:0;margin-left:.5em}.reveal .slide-number{color:#5692e4;background-color:#fff}.reveal .footer{color:#6f6f6f}.reveal .footer a{color:#2a76dd}.reveal .footer.has-dark-background{color:#fff}.reveal .footer.has-dark-background a{color:#7bc6fa}.reveal .footer.has-light-background{color:#505050}.reveal .footer.has-light-background a{color:#6a9bdd}.reveal .slide-number{color:#6f6f6f}.reveal .slide-number.has-dark-background{color:#fff}.reveal .slide-number.has-light-background{color:#505050}.reveal .slide figure>figcaption,.reveal .slide img.stretch+p.caption,.reveal .slide img.r-stretch+p.caption{font-size:0.7em}@media screen and (min-width: 500px){.reveal .controls[data-controls-layout=edges] .navigate-left{left:.2em}.reveal .controls[data-controls-layout=edges] .navigate-right{right:.2em}.reveal .controls[data-controls-layout=edges] .navigate-up{top:.4em}.reveal .controls[data-controls-layout=edges] .navigate-down{bottom:2.3em}}.tippy-box[data-theme~=light-border]{background-color:#fff;color:#222;border-radius:4px;border:solid 1px #6f6f6f;font-size:.6em}.tippy-box[data-theme~=light-border] .tippy-arrow{color:#6f6f6f}.tippy-box[data-placement^=bottom]>.tippy-content{padding:7px 10px;z-index:1}.reveal .callout.callout-style-simple .callout-body,.reveal .callout.callout-style-default .callout-body,.reveal .callout.callout-style-simple div.callout-title,.reveal .callout.callout-style-default div.callout-title{font-size:inherit}.reveal .callout.callout-style-default .callout-icon::before,.reveal .callout.callout-style-simple .callout-icon::before{height:2rem;width:2rem;background-size:2rem 2rem}.reveal .callout.callout-titled .callout-title p{margin-top:.5em}.reveal .callout.callout-titled .callout-icon::before{margin-top:1rem}.reveal .callout.callout-titled .callout-body>.callout-content>:last-child{margin-bottom:1rem}.reveal .panel-tabset [role=tab]{padding:.25em .7em}.reveal .slide-menu-button .fa-bars::before{background-image:url('data:image/svg+xml,')}.reveal .slide-chalkboard-buttons .fa-easel2::before{background-image:url('data:image/svg+xml,')}.reveal .slide-chalkboard-buttons .fa-brush::before{background-image:url('data:image/svg+xml,')}/*! light */.reveal ol[type=a]{list-style-type:lower-alpha}.reveal ol[type=a s]{list-style-type:lower-alpha}.reveal ol[type=A s]{list-style-type:upper-alpha}.reveal ol[type=i]{list-style-type:lower-roman}.reveal ol[type=i s]{list-style-type:lower-roman}.reveal ol[type=I s]{list-style-type:upper-roman}.reveal ol[type="1"]{list-style-type:decimal}.reveal ul.task-list{list-style:none}.reveal ul.task-list li input[type=checkbox]{width:2em;height:2em;margin:0 1em .5em -1.6em;vertical-align:middle}div.cell-output-display div.pagedtable-wrapper table.table{font-size:.6em}.reveal .code-annotation-container-hidden{display:none}.reveal code.sourceCode button.code-annotation-anchor,.reveal code.sourceCode .code-annotation-anchor{font-family:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;color:var(--quarto-hl-co-color);border:solid var(--quarto-hl-co-color) 1px;border-radius:50%;font-size:.7em;line-height:1.2em;margin-top:2px;user-select:none;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;-o-user-select:none}.reveal code.sourceCode button.code-annotation-anchor{cursor:pointer}.reveal code.sourceCode a.code-annotation-anchor{text-align:center;vertical-align:middle;text-decoration:none;cursor:default;height:1.2em;width:1.2em}.reveal code.sourceCode.fragment a.code-annotation-anchor{left:auto}.reveal #code-annotation-line-highlight-gutter{width:100%;border-top:solid var(--quarto-hl-co-color) 1px;border-bottom:solid var(--quarto-hl-co-color) 1px;z-index:2}.reveal #code-annotation-line-highlight{margin-left:-8em;width:calc(100% + 4em);border-top:solid var(--quarto-hl-co-color) 1px;border-bottom:solid var(--quarto-hl-co-color) 1px;z-index:2;margin-bottom:-2px}.reveal code.sourceCode .code-annotation-anchor.code-annotation-active{background-color:var(--quarto-hl-normal-color, #aaaaaa);border:solid var(--quarto-hl-normal-color, #aaaaaa) 1px;color:#fff;font-weight:bolder}.reveal pre.code-annotation-code{padding-top:0;padding-bottom:0}.reveal pre.code-annotation-code code{z-index:3;padding-left:0px}.reveal dl.code-annotation-container-grid{margin-left:.1em}.reveal dl.code-annotation-container-grid dt{margin-top:.65rem;font-family:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;border:solid #222 1px;border-radius:50%;height:1.3em;width:1.3em;line-height:1.3em;font-size:.5em;text-align:center;vertical-align:middle;text-decoration:none}.reveal dl.code-annotation-container-grid dd{margin-left:.25em}.reveal .scrollable ol li:first-child:nth-last-child(n+10),.reveal .scrollable ol li:first-child:nth-last-child(n+10)~li{margin-left:1em}html.print-pdf .reveal .slides .pdf-page:last-child{page-break-after:avoid}.reveal .quarto-title-block .quarto-title-authors{display:flex;justify-content:center}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author{padding-left:.5em;padding-right:.5em}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:hover,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:visited,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:active{color:inherit;text-decoration:none}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-name{margin-bottom:.1rem}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-email{margin-top:0px;margin-bottom:.4em;font-size:.6em}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-orcid img{margin-bottom:4px}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-affiliation{font-size:.7em;margin-top:0px;margin-bottom:8px}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-affiliation:first{margin-top:12px}/*# sourceMappingURL=f95d2bded9c28492b788fe14c3e9f347.css.map */ +*/.ansi-black-fg{color:#3e424d}.ansi-black-bg{background-color:#3e424d}.ansi-black-intense-fg{color:#282c36}.ansi-black-intense-bg{background-color:#282c36}.ansi-red-fg{color:#e75c58}.ansi-red-bg{background-color:#e75c58}.ansi-red-intense-fg{color:#b22b31}.ansi-red-intense-bg{background-color:#b22b31}.ansi-green-fg{color:#00a250}.ansi-green-bg{background-color:#00a250}.ansi-green-intense-fg{color:#007427}.ansi-green-intense-bg{background-color:#007427}.ansi-yellow-fg{color:#ddb62b}.ansi-yellow-bg{background-color:#ddb62b}.ansi-yellow-intense-fg{color:#b27d12}.ansi-yellow-intense-bg{background-color:#b27d12}.ansi-blue-fg{color:#208ffb}.ansi-blue-bg{background-color:#208ffb}.ansi-blue-intense-fg{color:#0065ca}.ansi-blue-intense-bg{background-color:#0065ca}.ansi-magenta-fg{color:#d160c4}.ansi-magenta-bg{background-color:#d160c4}.ansi-magenta-intense-fg{color:#a03196}.ansi-magenta-intense-bg{background-color:#a03196}.ansi-cyan-fg{color:#60c6c8}.ansi-cyan-bg{background-color:#60c6c8}.ansi-cyan-intense-fg{color:#258f8f}.ansi-cyan-intense-bg{background-color:#258f8f}.ansi-white-fg{color:#c5c1b4}.ansi-white-bg{background-color:#c5c1b4}.ansi-white-intense-fg{color:#a1a6b2}.ansi-white-intense-bg{background-color:#a1a6b2}.ansi-default-inverse-fg{color:#fff}.ansi-default-inverse-bg{background-color:#000}.ansi-bold{font-weight:bold}.ansi-underline{text-decoration:underline}:root{--quarto-body-bg: #fff;--quarto-body-color: #222;--quarto-text-muted: #6f6f6f;--quarto-border-color: #bbbbbb;--quarto-border-width: 1px;--quarto-border-radius: 4px}table.gt_table{color:var(--quarto-body-color);font-size:1em;width:100%;background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_column_spanner_outer{color:var(--quarto-body-color);background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_col_heading{color:var(--quarto-body-color);font-weight:bold;background-color:rgba(0,0,0,0)}table.gt_table thead.gt_col_headings{border-bottom:1px solid currentColor;border-top-width:inherit;border-top-color:var(--quarto-border-color)}table.gt_table thead.gt_col_headings:not(:first-child){border-top-width:1px;border-top-color:var(--quarto-border-color)}table.gt_table td.gt_row{border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-width:0px}table.gt_table tbody.gt_table_body{border-top-width:1px;border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-color:currentColor}div.columns{display:initial;gap:initial}div.column{display:inline-block;overflow-x:initial;vertical-align:top;width:50%}.code-annotation-tip-content{word-wrap:break-word}.code-annotation-container-hidden{display:none !important}dl.code-annotation-container-grid{display:grid;grid-template-columns:min-content auto}dl.code-annotation-container-grid dt{grid-column:1}dl.code-annotation-container-grid dd{grid-column:2}pre.sourceCode.code-annotation-code{padding-right:0}code.sourceCode .code-annotation-anchor{z-index:100;position:absolute;right:.5em;left:inherit;background-color:rgba(0,0,0,0)}input[type=checkbox]{margin-right:.5ch}:root{--mermaid-bg-color: #fff;--mermaid-edge-color: #999;--mermaid-node-fg-color: #222;--mermaid-fg-color: #222;--mermaid-fg-color--lighter: #3c3c3c;--mermaid-fg-color--lightest: #555555;--mermaid-font-family: Source Sans Pro, Helvetica, sans-serif;--mermaid-label-bg-color: #fff;--mermaid-label-fg-color: #468;--mermaid-node-bg-color: rgba(68, 102, 136, 0.1);--mermaid-node-fg-color: #222}@media print{:root{font-size:11pt}#quarto-sidebar,#TOC,.nav-page{display:none}.page-columns .content{grid-column-start:page-start}.fixed-top{position:relative}.panel-caption,.figure-caption,figcaption{color:#666}}.code-copy-button{position:absolute;top:0;right:0;border:0;margin-top:5px;margin-right:5px;background-color:rgba(0,0,0,0);z-index:3}.code-copy-button:focus{outline:none}.code-copy-button-tooltip{font-size:.75em}pre.sourceCode:hover>.code-copy-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,');background-repeat:no-repeat;background-size:1rem 1rem}pre.sourceCode:hover>.code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,')}pre.sourceCode:hover>.code-copy-button:hover>.bi::before{background-image:url('data:image/svg+xml,')}pre.sourceCode:hover>.code-copy-button-checked:hover>.bi::before{background-image:url('data:image/svg+xml,')}.panel-tabset [role=tablist]{border-bottom:1px solid #bbb;list-style:none;margin:0;padding:0;width:100%}.panel-tabset [role=tablist] *{-webkit-box-sizing:border-box;box-sizing:border-box}@media(min-width: 30em){.panel-tabset [role=tablist] li{display:inline-block}}.panel-tabset [role=tab]{border:1px solid rgba(0,0,0,0);border-top-color:#bbb;display:block;padding:.5em 1em;text-decoration:none}@media(min-width: 30em){.panel-tabset [role=tab]{border-top-color:rgba(0,0,0,0);display:inline-block;margin-bottom:-1px}}.panel-tabset [role=tab][aria-selected=true]{background-color:#bbb}@media(min-width: 30em){.panel-tabset [role=tab][aria-selected=true]{background-color:rgba(0,0,0,0);border:1px solid #bbb;border-bottom-color:#fff}}@media(min-width: 30em){.panel-tabset [role=tab]:hover:not([aria-selected=true]){border:1px solid #bbb}}.code-with-filename .code-with-filename-file{margin-bottom:0;padding-bottom:2px;padding-top:2px;padding-left:.7em;border:var(--quarto-border-width) solid var(--quarto-border-color);border-radius:var(--quarto-border-radius);border-bottom:0;border-bottom-left-radius:0%;border-bottom-right-radius:0%}.code-with-filename div.sourceCode,.reveal .code-with-filename div.sourceCode{margin-top:0;border-top-left-radius:0%;border-top-right-radius:0%}.code-with-filename .code-with-filename-file pre{margin-bottom:0}.code-with-filename .code-with-filename-file,.code-with-filename .code-with-filename-file pre{background-color:rgba(219,219,219,.8)}.quarto-dark .code-with-filename .code-with-filename-file,.quarto-dark .code-with-filename .code-with-filename-file pre{background-color:#555}.code-with-filename .code-with-filename-file strong{font-weight:400}.reveal.center .slide aside,.reveal.center .slide div.aside{position:initial}section.has-light-background,section.has-light-background h1,section.has-light-background h2,section.has-light-background h3,section.has-light-background h4,section.has-light-background h5,section.has-light-background h6{color:#222}section.has-light-background a,section.has-light-background a:hover{color:#2a76dd}section.has-light-background code{color:#4758ab}section.has-dark-background,section.has-dark-background h1,section.has-dark-background h2,section.has-dark-background h3,section.has-dark-background h4,section.has-dark-background h5,section.has-dark-background h6{color:#fff}section.has-dark-background a,section.has-dark-background a:hover{color:#42affa}section.has-dark-background code{color:#ffa07a}#title-slide,div.reveal div.slides section.quarto-title-block{text-align:center}#title-slide .subtitle,div.reveal div.slides section.quarto-title-block .subtitle{margin-bottom:2.5rem}.reveal .slides{text-align:left}.reveal .title-slide h1{font-size:1.6em}.reveal[data-navigation-mode=linear] .title-slide h1{font-size:2.5em}.reveal div.sourceCode{border:1px solid #bbb;border-radius:4px}.reveal pre{width:100%;box-shadow:none;background-color:#fff;border:none;margin:0;font-size:.55em}.reveal code{color:var(--quarto-hl-fu-color);background-color:rgba(0,0,0,0);white-space:pre-wrap}.reveal pre.sourceCode code{background-color:#fff;padding:6px 9px;max-height:500px;white-space:pre}.reveal pre code{background-color:#fff;color:#222}.reveal .column-output-location{display:flex;align-items:stretch}.reveal .column-output-location .column:first-of-type div.sourceCode{height:100%;background-color:#fff}.reveal blockquote{display:block;position:relative;color:#6f6f6f;width:unset;margin:var(--r-block-margin) auto;padding:.625rem 1.75rem;border-left:.25rem solid #6f6f6f;font-style:normal;background:none;box-shadow:none}.reveal blockquote p:first-child,.reveal blockquote p:last-child{display:block}.reveal .slide aside,.reveal .slide div.aside{position:absolute;bottom:20px;font-size:0.7em;color:#6f6f6f}.reveal .slide sup{font-size:0.7em}.reveal .slide.scrollable aside,.reveal .slide.scrollable div.aside{position:relative;margin-top:1em}.reveal .slide aside .aside-footnotes{margin-bottom:0}.reveal .slide aside .aside-footnotes li:first-of-type{margin-top:0}.reveal .layout-sidebar{display:flex;width:100%;margin-top:.8em}.reveal .layout-sidebar .panel-sidebar{width:270px}.reveal .layout-sidebar-left .panel-sidebar{margin-right:calc(0.5em*2)}.reveal .layout-sidebar-right .panel-sidebar{margin-left:calc(0.5em*2)}.reveal .layout-sidebar .panel-fill,.reveal .layout-sidebar .panel-center,.reveal .layout-sidebar .panel-tabset{flex:1}.reveal .panel-input,.reveal .panel-sidebar{font-size:.5em;padding:.5em;border-style:solid;border-color:#bbb;border-width:1px;border-radius:4px;background-color:#f8f9fa}.reveal .panel-sidebar :first-child,.reveal .panel-fill :first-child{margin-top:0}.reveal .panel-sidebar :last-child,.reveal .panel-fill :last-child{margin-bottom:0}.panel-input>div,.panel-input>div>div{vertical-align:middle;padding-right:1em}.reveal p,.reveal .slides section,.reveal .slides section>section{line-height:1.3}.reveal.smaller .slides section,.reveal .slides section.smaller,.reveal .slides section .callout{font-size:0.7em}.reveal.smaller .slides section section{font-size:inherit}.reveal.smaller .slides h1,.reveal .slides section.smaller h1{font-size:calc(2.5em/0.7)}.reveal.smaller .slides h2,.reveal .slides section.smaller h2{font-size:calc(1.6em/0.7)}.reveal.smaller .slides h3,.reveal .slides section.smaller h3{font-size:calc(1.3em/0.7)}.reveal .columns>.column>:not(ul,ol){margin-left:.25em;margin-right:.25em}.reveal .columns>.column:first-child>:not(ul,ol){margin-right:.5em;margin-left:0}.reveal .columns>.column:last-child>:not(ul,ol){margin-right:0;margin-left:.5em}.reveal .slide-number{color:#5692e4;background-color:#fff}.reveal .footer{color:#6f6f6f}.reveal .footer a{color:#2a76dd}.reveal .slide-number{color:#6f6f6f}.reveal .slide figure>figcaption,.reveal .slide img.stretch+p.caption,.reveal .slide img.r-stretch+p.caption{font-size:0.7em}@media screen and (min-width: 500px){.reveal .controls[data-controls-layout=edges] .navigate-left{left:.2em}.reveal .controls[data-controls-layout=edges] .navigate-right{right:.2em}.reveal .controls[data-controls-layout=edges] .navigate-up{top:.4em}.reveal .controls[data-controls-layout=edges] .navigate-down{bottom:2.3em}}.tippy-box[data-theme~=light-border]{background-color:#fff;color:#222;border-radius:4px;border:solid 1px #6f6f6f;font-size:.6em}.tippy-box[data-theme~=light-border] .tippy-arrow{color:#6f6f6f}.tippy-box[data-placement^=bottom]>.tippy-content{padding:7px 10px;z-index:1}.reveal .callout.callout-style-simple .callout-body,.reveal .callout.callout-style-default .callout-body,.reveal .callout.callout-style-simple div.callout-title,.reveal .callout.callout-style-default div.callout-title{font-size:inherit}.reveal .callout.callout-style-default .callout-icon::before,.reveal .callout.callout-style-simple .callout-icon::before{height:2rem;width:2rem;background-size:2rem 2rem}.reveal .callout.callout-titled .callout-title p{margin-top:.5em}.reveal .callout.callout-titled .callout-icon::before{margin-top:1rem}.reveal .callout.callout-titled .callout-body>.callout-content>:last-child{margin-bottom:1rem}.reveal .panel-tabset [role=tab]{padding:.25em .7em}.reveal .slide-menu-button .fa-bars::before{background-image:url('data:image/svg+xml,')}.reveal .slide-chalkboard-buttons .fa-easel2::before{background-image:url('data:image/svg+xml,')}.reveal .slide-chalkboard-buttons .fa-brush::before{background-image:url('data:image/svg+xml,')}/*! light */.reveal ol[type=a]{list-style-type:lower-alpha}.reveal ol[type=a s]{list-style-type:lower-alpha}.reveal ol[type=A s]{list-style-type:upper-alpha}.reveal ol[type=i]{list-style-type:lower-roman}.reveal ol[type=i s]{list-style-type:lower-roman}.reveal ol[type=I s]{list-style-type:upper-roman}.reveal ol[type="1"]{list-style-type:decimal}.reveal ul.task-list{list-style:none}.reveal ul.task-list li input[type=checkbox]{width:2em;height:2em;margin:0 1em .5em -1.6em;vertical-align:middle}div.cell-output-display div.pagedtable-wrapper table.table{font-size:.6em}.reveal .code-annotation-container-hidden{display:none}.reveal code.sourceCode button.code-annotation-anchor,.reveal code.sourceCode .code-annotation-anchor{font-family:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;color:var(--quarto-hl-co-color);border:solid var(--quarto-hl-co-color) 1px;border-radius:50%;font-size:.7em;line-height:1.2em;margin-top:2px;user-select:none;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;-o-user-select:none}.reveal code.sourceCode button.code-annotation-anchor{cursor:pointer}.reveal code.sourceCode a.code-annotation-anchor{text-align:center;vertical-align:middle;text-decoration:none;cursor:default;height:1.2em;width:1.2em}.reveal code.sourceCode.fragment a.code-annotation-anchor{left:auto}.reveal #code-annotation-line-highlight-gutter{width:100%;border-top:solid var(--quarto-hl-co-color) 1px;border-bottom:solid var(--quarto-hl-co-color) 1px;z-index:2}.reveal #code-annotation-line-highlight{margin-left:-8em;width:calc(100% + 4em);border-top:solid var(--quarto-hl-co-color) 1px;border-bottom:solid var(--quarto-hl-co-color) 1px;z-index:2;margin-bottom:-2px}.reveal code.sourceCode .code-annotation-anchor.code-annotation-active{background-color:var(--quarto-hl-normal-color, #aaaaaa);border:solid var(--quarto-hl-normal-color, #aaaaaa) 1px;color:#fff;font-weight:bolder}.reveal pre.code-annotation-code{padding-top:0;padding-bottom:0}.reveal pre.code-annotation-code code{z-index:3;padding-left:0px}.reveal dl.code-annotation-container-grid{margin-left:.1em}.reveal dl.code-annotation-container-grid dt{margin-top:.65rem;font-family:SFMono-Regular,Menlo,Monaco,Consolas,"Liberation Mono","Courier New",monospace;border:solid #222 1px;border-radius:50%;height:1.3em;width:1.3em;line-height:1.3em;font-size:.5em;text-align:center;vertical-align:middle;text-decoration:none}.reveal dl.code-annotation-container-grid dd{margin-left:.25em}.reveal .scrollable ol li:first-child:nth-last-child(n+10),.reveal .scrollable ol li:first-child:nth-last-child(n+10)~li{margin-left:1em}html.print-pdf .reveal .slides .pdf-page:last-child{page-break-after:avoid}.reveal .quarto-title-block .quarto-title-authors{display:flex;justify-content:center}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author{padding-left:.5em;padding-right:.5em}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:hover,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:visited,.reveal .quarto-title-block .quarto-title-authors .quarto-title-author a:active{color:inherit;text-decoration:none}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-name{margin-bottom:.1rem}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-email{margin-top:0px;margin-bottom:.4em;font-size:.6em}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-author-orcid img{margin-bottom:4px}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-affiliation{font-size:.7em;margin-top:0px;margin-bottom:8px}.reveal .quarto-title-block .quarto-title-authors .quarto-title-author .quarto-title-affiliation:first{margin-top:12px}/*# sourceMappingURL=f95d2bded9c28492b788fe14c3e9f347.css.map */ diff --git a/_quarto.yml b/_quarto.yml index a058234..1cbeed8 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -52,7 +52,11 @@ website: target: _blank - section: "Day 3" contents: - - href: modules/Module13-Iteration.qmd + - href: modules/Module11-Rmarkdown.qmd + target: _blank + - href: modules/Module12-Iteration.qmd + target: _blank + - href: modules/Module13-Functions.qmd target: _blank repo-url: https://github.com/UGA-IDD/SISMID-2024 reader-mode: true diff --git a/docs/index.html b/docs/index.html index 906407f..fc2f682 100644 --- a/docs/index.html +++ b/docs/index.html @@ -2,14 +2,14 @@ - + -Welcome – SISMID Module NUMBER Materials (2025) +SISMID Module NUMBER Materials (2025) - Welcome