diff --git a/_freeze/modules/Module06-DataSubset/execute-results/html.json b/_freeze/modules/Module06-DataSubset/execute-results/html.json index 10a4e80..a2e3d95 100644 --- a/_freeze/modules/Module06-DataSubset/execute-results/html.json +++ b/_freeze/modules/Module06-DataSubset/execute-results/html.json @@ -1,9 +1,11 @@ { - "hash": "e4b2e759b96a7dee0bfa94359eee224d", + "hash": "0bbf67d7985cff5d5614734a94dd46bb", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 6: Get to Know Your Data and Subsetting\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n#execute: \n# echo: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 6, you should be able to...\n\n- Use basic functions to get to know you data\n- Use three indexing approaches\n- Rely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Describe what logical operators are and how to use them\n- Use on the `subset()` function to subset data\n\n\n## Getting to know our data\n\nThe `dim()`, `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. \n\nLet's first read in the data from the previous module.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(df) # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651 5\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df) # number of rows\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n\n```{.r .cell-code}\nncol(df) # number of columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5\n```\n\n\n:::\n:::\n\n\n\n## Quick summary of data\n\nThe `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\nstr(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n```\n\n\n:::\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n```\n\n\n:::\n:::\n\n\n\nNote, if you have a very large dataset with 15+ variables, `summary()` is not so efficient. \n\n## Description of data\n\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop.\n\n## View the data as a whole dataframe\n\nThe `View()` function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(df)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ViewTab.png){width=100%}\n:::\n:::\n\n\n\n## View the data as a whole dataframe\n\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/View.png){width=90%}\n:::\n:::\n\n\n\nYou can also hold down `Cmd` or `CTRL` and click on the name of a data frame in your code.\n\n## Indexing\n\nR contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: `[`, `[[` and `$`. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list\n```\n:::\n\n\n\n## Vectors and multi-dimensional objects\n\nTo index a vector, `vector[i]` select the ith element. To index a multi-dimensional objects such as a matrix, `matrix[i, j]` selects the element in row i and column j, where as in a three dimensional `array[k, i, j]` selects the element in matrix k, row i, and column j. \n\nLet's practice by first creating the same objects as we did in Module 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\n\nHere is a reminder of what these objects look like.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 3 4 5\n```\n\n\n:::\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\nFinally, let's use indexing to pull out elements of the objects. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1[2] #pulling the second element\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n\n```{.r .cell-code}\nmatrix.object[1,2] #pulling the element in row 1 column 2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\n\n## List objects\n\nFor lists, one generally uses `list[[p]]` to select any single element p.\n\nLet's practice by creating the same list as we did in Module 1.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[[3]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\nWhat happens if we use a single square bracket?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\nThe `[[` operator is called the \"extract\" operator and gives us the element\nfrom the list. The `[` operator is called the \"subset\" operator and gives\nus a subset of the list, that is still a list.\n\n## $ for indexing for data frame\n\n`$` allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration\n```\n:::\n\n\n\nNote, if you have spaces in your variable name, you will need to use back ticks \\` after the `$`. This is a good reason to not create variables / column names with spaces.\n\n## $ for indexing with lists\n\n`$` allows only a literal character string or a symbol as the index. For a list it extracts a named element.\n\nList elements can be named\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\nIf list elements are named, than you can reference data from list using `$` or using double square brackets, `[[`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named$uga \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlist.object.named[[\"uga\"]] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n\n\n:::\n:::\n\n\n\n\n## Using indexing to rename columns\n\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration_IU/mL\", \"age_year\") # reassigns\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age_year\" \"gender\" \n[5] \"slum\" \n```\n\n\n:::\n:::\n\n\n\n
\n\nFor the sake of the module, I am going to reassign them back to the original variable names\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration\", \"age\") #reset\n```\n:::\n\n\n\n## Using indexing to subset by columns\n\nWe can also subset data frames and matrices (2-dimensional objects) using the bracket `[ row , column ]`. We can subset by columns and pull the `x` column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.\n\nFor example, here I am pulling the 3rd column, which has the variable name `age`, for all of rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[ , \"age\"] #same as df[ , 3]\n```\n:::\n\n\nWe can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n age gender\n1 2 Female\n2 4 Female\n3 4 Male\n4 4 Male\n5 1 Male\n6 4 Male\n7 4 Female\n8 NA Female\n9 4 Male\n10 2 Male\n11 3 Male\n12 15 Female\n13 8 Male\n14 12 Male\n15 15 Male\n16 9 Male\n17 8 Male\n18 7 Female\n19 11 Female\n20 10 Male\n21 8 Male\n22 11 Female\n23 2 Male\n24 2 Female\n25 3 Female\n26 5 Male\n27 1 Male\n28 3 Female\n29 5 Female\n30 5 Female\n31 3 Male\n32 1 Male\n33 4 Female\n34 3 Male\n35 2 Female\n36 11 Female\n37 7 Male\n38 8 Male\n39 6 Male\n40 6 Male\n41 11 Female\n42 10 Male\n43 6 Female\n44 12 Male\n45 11 Male\n46 10 Male\n47 11 Male\n48 13 Female\n49 3 Female\n50 4 Female\n51 3 Male\n52 1 Male\n53 2 Female\n54 2 Female\n55 4 Male\n56 2 Male\n57 2 Male\n58 3 Female\n59 3 Female\n60 4 Male\n61 1 Female\n62 13 Female\n63 13 Female\n64 6 Male\n65 13 Male\n66 5 Female\n67 13 Female\n68 14 Male\n69 13 Male\n70 8 Female\n71 7 Male\n72 6 Female\n73 13 Male\n74 3 Male\n75 4 Male\n76 2 Male\n77 NA Male\n78 5 Female\n79 3 Male\n80 3 Male\n81 14 Male\n82 11 Female\n83 7 Female\n84 7 Male\n85 11 Female\n86 9 Female\n87 14 Male\n88 13 Female\n89 1 Male\n90 1 Male\n91 4 Male\n92 1 Female\n93 2 Male\n94 3 Female\n95 2 Male\n96 1 Male\n97 2 Male\n98 2 Female\n99 4 Female\n100 5 Female\n101 5 Male\n102 6 Female\n103 14 Female\n104 14 Male\n105 10 Male\n106 6 Female\n107 6 Male\n108 8 Male\n109 6 Female\n110 12 Female\n111 12 Male\n112 14 Female\n113 15 Male\n114 12 Female\n115 4 Female\n116 4 Male\n117 3 Female\n118 NA Male\n119 2 Female\n120 3 Male\n121 NA Female\n122 3 Female\n123 3 Male\n124 2 Female\n125 4 Female\n126 10 Female\n127 7 Female\n128 11 Female\n129 6 Female\n130 11 Male\n131 9 Male\n132 6 Male\n133 13 Female\n134 10 Female\n135 6 Female\n136 11 Female\n137 7 Male\n138 6 Female\n139 4 Female\n140 4 Female\n141 4 Male\n142 4 Female\n143 4 Male\n144 4 Male\n145 3 Male\n146 4 Female\n147 3 Male\n148 3 Male\n149 13 Female\n150 7 Female\n151 10 Male\n152 6 Male\n153 10 Female\n154 12 Female\n155 10 Male\n156 10 Male\n157 13 Male\n158 13 Female\n159 5 Female\n160 3 Female\n161 4 Male\n162 1 Male\n163 3 Female\n164 4 Male\n165 4 Male\n166 1 Male\n167 5 Female\n168 6 Female\n169 14 Female\n170 6 Male\n171 13 Female\n172 9 Male\n173 11 Male\n174 10 Male\n175 5 Female\n176 14 Male\n177 7 Male\n178 10 Male\n179 6 Male\n180 5 Male\n181 3 Female\n182 4 Male\n183 2 Female\n184 3 Male\n185 3 Female\n186 2 Female\n187 3 Male\n188 5 Female\n189 2 Male\n190 3 Female\n191 14 Female\n192 9 Female\n193 14 Female\n194 9 Female\n195 8 Female\n196 7 Male\n197 13 Male\n198 8 Female\n199 6 Male\n200 12 Female\n201 14 Female\n202 15 Female\n203 2 Female\n204 4 Female\n205 3 Male\n206 3 Female\n207 3 Male\n208 4 Female\n209 3 Male\n210 14 Female\n211 8 Male\n212 7 Male\n213 14 Female\n214 13 Female\n215 13 Female\n216 7 Male\n217 8 Female\n218 10 Female\n219 9 Male\n220 9 Female\n221 3 Female\n222 4 Male\n223 4 Female\n224 4 Male\n225 2 Female\n226 1 Female\n227 3 Female\n228 2 Male\n229 3 Male\n230 5 Male\n231 2 Female\n232 2 Male\n233 9 Male\n234 13 Male\n235 10 Female\n236 6 Male\n237 13 Female\n238 11 Male\n239 10 Male\n240 8 Female\n241 9 Female\n242 10 Male\n243 14 Male\n244 1 Female\n245 2 Male\n246 3 Female\n247 2 Male\n248 3 Female\n249 2 Female\n250 3 Female\n251 5 Female\n252 10 Female\n253 7 Male\n254 13 Female\n255 15 Male\n256 11 Female\n257 10 Female\n258 3 Female\n259 2 Male\n260 3 Male\n261 3 Female\n262 3 Female\n263 4 Male\n264 3 Male\n265 2 Male\n266 4 Male\n267 2 Female\n268 8 Male\n269 11 Male\n270 6 Male\n271 14 Female\n272 14 Male\n273 5 Female\n274 5 Male\n275 10 Female\n276 13 Male\n277 6 Male\n278 5 Male\n279 12 Male\n280 2 Male\n281 3 Female\n282 1 Female\n283 1 Male\n284 1 Female\n285 2 Female\n286 5 Female\n287 5 Male\n288 4 Female\n289 2 Male\n290 NA Female\n291 6 Female\n292 8 Male\n293 15 Male\n294 11 Male\n295 14 Male\n296 6 Male\n297 10 Female\n298 12 Male\n299 14 Male\n300 10 Male\n301 1 Female\n302 3 Male\n303 2 Male\n304 3 Female\n305 4 Male\n306 3 Male\n307 4 Female\n308 4 Male\n309 1 Female\n310 7 Male\n311 11 Female\n312 7 Female\n313 5 Female\n314 10 Male\n315 9 Female\n316 13 Male\n317 11 Female\n318 13 Male\n319 9 Female\n320 15 Female\n321 7 Female\n322 4 Male\n323 1 Male\n324 1 Male\n325 2 Female\n326 2 Female\n327 3 Male\n328 2 Male\n329 3 Male\n330 4 Female\n331 7 Female\n332 11 Female\n333 10 Female\n334 5 Male\n335 8 Male\n336 15 Male\n337 14 Male\n338 2 Male\n339 2 Female\n340 2 Male\n341 5 Male\n342 4 Female\n343 3 Male\n344 5 Female\n345 4 Female\n346 2 Female\n347 1 Female\n348 7 Male\n349 8 Female\n350 NA Male\n351 9 Male\n352 8 Female\n353 5 Male\n354 14 Male\n355 14 Male\n356 7 Female\n357 13 Female\n358 2 Male\n359 1 Female\n360 1 Male\n361 4 Female\n362 3 Male\n363 4 Female\n364 3 Male\n365 1 Male\n366 5 Female\n367 4 Female\n368 4 Female\n369 4 Male\n370 11 Male\n371 15 Female\n372 12 Female\n373 11 Female\n374 8 Female\n375 13 Male\n376 10 Female\n377 10 Female\n378 15 Male\n379 8 Female\n380 14 Male\n381 4 Male\n382 1 Male\n383 5 Female\n384 2 Male\n385 2 Female\n386 4 Male\n387 4 Male\n388 2 Female\n389 3 Male\n390 11 Male\n391 10 Female\n392 6 Male\n393 12 Female\n394 10 Female\n395 8 Male\n396 8 Male\n397 13 Male\n398 10 Male\n399 13 Female\n400 10 Male\n401 2 Male\n402 4 Female\n403 3 Female\n404 2 Female\n405 1 Female\n406 3 Male\n407 3 Female\n408 4 Male\n409 5 Female\n410 5 Female\n411 1 Female\n412 11 Male\n413 6 Male\n414 14 Female\n415 8 Male\n416 8 Female\n417 9 Female\n418 7 Male\n419 6 Male\n420 12 Female\n421 8 Male\n422 11 Female\n423 14 Male\n424 3 Female\n425 1 Female\n426 5 Female\n427 2 Female\n428 3 Female\n429 4 Female\n430 2 Male\n431 3 Female\n432 4 Male\n433 1 Female\n434 7 Female\n435 10 Male\n436 11 Male\n437 7 Female\n438 10 Female\n439 14 Female\n440 7 Female\n441 11 Male\n442 12 Male\n443 10 Female\n444 6 Male\n445 13 Male\n446 8 Female\n447 2 Male\n448 3 Female\n449 1 Female\n450 2 Female\n451 NA Male\n452 NA Female\n453 4 Male\n454 4 Male\n455 1 Male\n456 2 Female\n457 2 Male\n458 12 Male\n459 12 Female\n460 8 Female\n461 14 Female\n462 13 Female\n463 6 Male\n464 11 Female\n465 11 Male\n466 10 Female\n467 12 Male\n468 14 Female\n469 11 Female\n470 1 Male\n471 2 Female\n472 3 Male\n473 3 Female\n474 5 Female\n475 3 Male\n476 1 Male\n477 4 Female\n478 4 Female\n479 4 Male\n480 2 Female\n481 5 Female\n482 7 Male\n483 8 Male\n484 10 Male\n485 6 Female\n486 7 Male\n487 10 Female\n488 6 Male\n489 6 Female\n490 15 Female\n491 5 Male\n492 3 Male\n493 5 Male\n494 3 Female\n495 5 Male\n496 5 Male\n497 1 Female\n498 1 Male\n499 7 Female\n500 14 Female\n501 9 Male\n502 10 Female\n503 10 Female\n504 11 Male\n505 11 Female\n506 12 Female\n507 11 Female\n508 12 Male\n509 12 Male\n510 10 Female\n511 1 Male\n512 2 Female\n513 4 Male\n514 2 Male\n515 3 Male\n516 3 Female\n517 2 Male\n518 4 Male\n519 3 Male\n520 1 Female\n521 4 Male\n522 12 Female\n523 6 Male\n524 7 Female\n525 7 Male\n526 13 Female\n527 8 Female\n528 7 Male\n529 8 Female\n530 8 Female\n531 11 Female\n532 14 Female\n533 3 Male\n534 2 Female\n535 2 Male\n536 3 Male\n537 2 Male\n538 2 Female\n539 3 Female\n540 2 Male\n541 5 Male\n542 10 Female\n543 14 Male\n544 9 Male\n545 6 Male\n546 7 Male\n547 14 Female\n548 7 Female\n549 7 Male\n550 9 Male\n551 14 Male\n552 10 Female\n553 13 Female\n554 5 Male\n555 4 Female\n556 4 Female\n557 5 Female\n558 4 Female\n559 4 Male\n560 4 Male\n561 3 Female\n562 1 Female\n563 4 Male\n564 1 Male\n565 1 Female\n566 7 Male\n567 13 Female\n568 10 Female\n569 14 Male\n570 12 Female\n571 14 Male\n572 8 Male\n573 7 Male\n574 11 Female\n575 8 Male\n576 12 Male\n577 9 Female\n578 5 Female\n579 4 Male\n580 3 Female\n581 2 Male\n582 2 Male\n583 3 Male\n584 4 Female\n585 4 Male\n586 4 Female\n587 5 Male\n588 3 Female\n589 6 Female\n590 3 Male\n591 11 Female\n592 11 Male\n593 7 Male\n594 8 Male\n595 6 Female\n596 10 Female\n597 8 Female\n598 8 Male\n599 9 Female\n600 8 Male\n601 13 Male\n602 11 Male\n603 8 Female\n604 2 Female\n605 4 Male\n606 2 Male\n607 2 Female\n608 4 Male\n609 2 Male\n610 4 Female\n611 2 Female\n612 4 Female\n613 1 Female\n614 4 Female\n615 12 Female\n616 7 Female\n617 11 Male\n618 6 Male\n619 8 Male\n620 14 Male\n621 11 Male\n622 7 Female\n623 14 Female\n624 6 Male\n625 13 Female\n626 13 Female\n627 3 Male\n628 1 Male\n629 3 Male\n630 1 Female\n631 1 Female\n632 2 Male\n633 4 Male\n634 4 Male\n635 2 Female\n636 4 Female\n637 5 Male\n638 3 Female\n639 3 Male\n640 6 Female\n641 11 Female\n642 9 Female\n643 7 Female\n644 8 Male\n645 NA Female\n646 8 Female\n647 14 Female\n648 10 Male\n649 10 Male\n650 11 Female\n651 13 Female\n```\n\n\n:::\n:::\n\n\nWe can remove select columns using indexing as well, OR by simply changing the column to `NULL`\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, -5] #remove column 5, \"slum\" variable\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$slum <- NULL # this is the same as above\n```\n:::\n\n\nWe can also grab the `age` column using the `$` operator, again this is selecting the variable for all of the rows.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age\n```\n:::\n\n\n\n\n## Using indexing to subset by rows\n\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n100 8122 0.1818182 5 Female Non slum\n```\n\n\n:::\n:::\n\n\nAnd, here we pull the `age` of the 100th observation/row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,\"age\"] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5\n```\n\n\n:::\n:::\n\n\n \n\n## Logical operators\n\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\noperator | operator option |description\n-----|-----|-----:\n`<`|%l%|less than\n`<=`|%le%|less than or equal to\n`>`|%g%|greater than\n`>=`|%ge%|greater than or equal to\n`==`||equal to\n`!=`||not equal to\n`x&y`||x and y\n`x|y`||x or y\n`%in%`||match\n`%!in%`||do not match\n\n\n## Logical operators examples\n\nLet's practice. First, here is a reminder of what the number.object contains.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\nNow, we will use logical operators to evaluate the object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object<4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object>=3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object!=5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object %in% c(6,7,2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\nWe can use any of these logical operators to subset our data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Overall mean\nmean(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 87.36826\n```\n\n\n:::\n\n```{.r .cell-code}\n# Mean for all children who are not age 3\nmean(df$IgG_concentration[df$age != 3], na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 90.32824\n```\n\n\n:::\n\n```{.r .cell-code}\n# Mean for all children who are between 0 and 3 or between 7 and 10 years old\nmean(df$IgG_concentration[df$age %in% c(0:3, 7:10)], na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 74.0914\n```\n\n\n:::\n:::\n\n\n\n## Using indexing and logical operators to rename columns\n\n1. We can assign the column names from data frame `df` to an object `cn`, then we can modify `cn` directly using indexing and logical operators, finally we reassign the column names, `cn`, back to the data frame `df`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncn <- colnames(df)\ncn\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\ncn==\"IgG_concentration\"\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE TRUE FALSE FALSE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration_mIU\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n:::\n\n\n\n
\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset\n```\n:::\n\n\n\n\n## Using indexing and logical operators to subset data\n\n\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to `df_lt10`. Note that the logical operators `df$age<=10` is before the comma because I want to subset by rows (the first dimension).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10 <- df[df$age<=10, ]\n```\n:::\n\n\nLets check that my subsets worked using the `summary()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.0 3.0 4.0 4.8 7.0 10.0 9 \n```\n\n\n:::\n:::\n\n\n\n
\n\nIn the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n```\n:::\n\n\nLets check that my subsets worked using the `summary()` function. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte5_gt10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.00 2.50 4.00 6.08 11.00 15.00 9 \n```\n\n\n:::\n:::\n\n\n\n\n## Missing values \n\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\n\nTypes of \"missing\" values:\n\n- `NA` - **N**ot **A**pplicable general missing data\n- `NaN` - stands for \"**N**ot **a** **N**umber\", happens when you do 0/0.\n- `Inf` and `-Inf` - Infinity, happens when you divide a positive number (or negative number) by 0.\n- blank space - sometimes when data is read it, there is a blank space left\n- an empty string (e.g., `\"\"`) \n- `NULL`- undefined value that represents something that does not exist\n\n## Logical operators to help identify and missing data\n\noperator |description\n-----|-----|-----:\n`is.na`|is NAN or NA\n`is.nan`|is NAN\n`!is.na`|is not NAN or NA\n`!is.nan`|is not NAN\n`is.infinite`|is infinite\n`any`|are any TRUE\n`all`|all are TRUE\n`which`|which are TRUE\n\n## More logical operators examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- c(0,NA, -1)/0\ntest\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NaN NA -Inf\n```\n\n\n:::\n\n```{.r .cell-code}\nis.na(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE TRUE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.nan(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.infinite(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE FALSE TRUE\n```\n\n\n:::\n:::\n\n\n\n## More logical operators examples\n\n`any(is.na(x))` means do we have any `NA`'s in the object `x`?\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\n`which(is.na(x))` means which of the elements in object `x` are `NA`'s?\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwhich(is.na(df$IgG_concentration)) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 13 55 57 72 182 406 414 478 488 595\n```\n\n\n:::\n\n```{.r .cell-code}\nwhich(is.na(df$slum)) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ninteger(0)\n```\n\n\n:::\n:::\n\n\n\n## `subset()` function\n\nThe Base R `subset()` function is a slightly easier way to select variables and observations.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?subset\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nSubsetting Vectors, Matrices and Data Frames\n\nDescription:\n\n Return subsets of vectors, matrices or data frames which meet\n conditions.\n\nUsage:\n\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n\n x: object to be subsetted.\n\n subset: logical expression indicating elements or rows to keep:\n missing values are taken as false.\n\n select: expression, indicating columns to select from a data frame.\n\n drop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\n\nDetails:\n\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\n\nValue:\n\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\n\nAuthor(s):\n\n Peter Dalgaard and Brian Ripley\n\nSee Also:\n\n '[', 'transform' 'droplevels'\n\nExamples:\n\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)\n\n\n\n## Subsetting use the `subset()` function\n\nHere are a few examples using the `subset()` function\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))\n```\n:::\n\n\n\n## `subset()` function vs logical operators\n\n`subset()` automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age) #created with indexing\n```\n\n::: {.cell-output-display}\n\n\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.| NA's|\n|----:|-------:|------:|----:|-------:|----:|----:|\n| 1| 3| 4| 4.8| 7| 10| 9|\n:::\n\n```{.r .cell-code}\nsummary(df_lte10_v2$age) #created with the subset function\n```\n\n::: {.cell-output-display}\n\n\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.|\n|----:|-------:|------:|----:|-------:|----:|\n| 1| 3| 4| 4.8| 7| 10|\n:::\n:::\n\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrow(df_lte10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 504\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_lte10_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 495\n```\n\n\n:::\n:::\n\n\n\n\n\n## Summary\n\n- `colnames()`, `str()` and `summary()`functions from Base R are functions to assess the data type and some summary statistics\n- There are three basic indexing syntax: `[`, `[[` and `$`\n- Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\n- There are 7 “types” of missing values, the most common being “NA”\n- Logical operators meant to determine missing values are very helpful for data cleaning\n- The Base R `subset()` function is a slightly easier way to select variables and observations.\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- [\"Indexing\" CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)\n- [\"Logical operators\" CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)\n\n", - "supporting": [], + "markdown": "---\ntitle: \"Module 6: Get to Know Your Data and Subsetting\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n#execute: \n# echo: true\n---\n\n\n\n\n## Learning Objectives\n\nAfter module 6, you should be able to...\n\n- Use basic functions to get to know you data\n- Use three indexing approaches\n- Rely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Describe what logical operators are and how to use them\n- Use on the `subset()` function to subset data\n\n\n## Getting to know our data\n\nThe `dim()`, `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. \n\nLet's first read in the data from the previous module.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(df) # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651 5\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df) # number of rows\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 651\n```\n\n\n:::\n\n```{.r .cell-code}\nncol(df) # number of columns\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5\n```\n\n\n:::\n:::\n\n\n\n\n## Quick summary of data\n\nThe `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\nstr(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n```\n\n\n:::\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n```\n\n\n:::\n:::\n\n\n\n\nNote, if you have a very large dataset with 15+ variables, `summary()` is not so efficient. \n\n## Description of data\n\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop.\n\n## View the data as a whole dataframe\n\nThe `View()` function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(df)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ViewTab.png){width=100%}\n:::\n:::\n\n\n\n\n## View the data as a whole dataframe\n\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/View.png){width=90%}\n:::\n:::\n\n\n\n\nYou can also hold down `Cmd` or `CTRL` and click on the name of a data frame in your code.\n\n## Indexing\n\nR contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: `[`, `[[` and `$`. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list\n```\n:::\n\n\n\n\n## Vectors and multi-dimensional objects\n\nTo index a vector, `vector[i]` select the ith element. To index a multi-dimensional objects such as a matrix, `matrix[i, j]` selects the element in row i and column j, where as in a three dimensional `array[k, i, j]` selects the element in matrix k, row i, and column j. \n\nLet's practice by first creating the same objects as we did in Module 1.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\n\n\nHere is a reminder of what these objects look like.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2 3 4 5\n```\n\n\n:::\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\nFinally, let's use indexing to pull out elements of the objects. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1[2] #pulling the second element\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n\n```{.r .cell-code}\nmatrix.object[1,2] #pulling the element in row 1 column 2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\n\n\n## List objects\n\nFor lists, one generally uses `list[[p]]` to select any single element p.\n\nLet's practice by creating the same list as we did in Module 1.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[[3]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\nWhat happens if we use a single square bracket?\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\nThe `[[` operator is called the \"extract\" operator and gives us the element\nfrom the list. The `[` operator is called the \"subset\" operator and gives\nus a subset of the list, that is still a list.\n\n## $ for indexing for data frame\n\n`$` allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration\n```\n:::\n\n\n\n\nNote, if you have spaces in your variable name, you will need to use back ticks \\` after the `$`. This is a good reason to not create variables / column names with spaces.\n\n## $ for indexing with lists\n\n`$` allows only a literal character string or a symbol as the index. For a list it extracts a named element.\n\nList elements can be named\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\nIf list elements are named, than you can reference data from list using `$` or using double square brackets, `[[`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named$uga \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlist.object.named[[\"uga\"]] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n\n\n:::\n:::\n\n\n\n\n\n## Using indexing to rename columns\n\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration_IU/mL\", \"age_year\") # reassigns\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age_year\" \"gender\" \n[5] \"slum\" \n```\n\n\n:::\n:::\n\n\n\n\n
\n\nFor the sake of the module, I am going to reassign them back to the original variable names\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration\", \"age\") #reset\n```\n:::\n\n\n\n\n## Using indexing to subset by columns\n\nWe can also subset data frames and matrices (2-dimensional objects) using the bracket `[ row , column ]`. We can subset by columns and pull the `x` column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.\n\nFor example, here I am pulling the 3rd column, which has the variable name `age`, for all of rows.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[ , \"age\"] #same as df[ , 3]\n```\n:::\n\n\n\nWe can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n age gender\n1 2 Female\n2 4 Female\n3 4 Male\n4 4 Male\n5 1 Male\n6 4 Male\n7 4 Female\n8 NA Female\n9 4 Male\n10 2 Male\n11 3 Male\n12 15 Female\n13 8 Male\n14 12 Male\n15 15 Male\n16 9 Male\n17 8 Male\n18 7 Female\n19 11 Female\n20 10 Male\n21 8 Male\n22 11 Female\n23 2 Male\n24 2 Female\n25 3 Female\n26 5 Male\n27 1 Male\n28 3 Female\n29 5 Female\n30 5 Female\n31 3 Male\n32 1 Male\n33 4 Female\n34 3 Male\n35 2 Female\n36 11 Female\n37 7 Male\n38 8 Male\n39 6 Male\n40 6 Male\n41 11 Female\n42 10 Male\n43 6 Female\n44 12 Male\n45 11 Male\n46 10 Male\n47 11 Male\n48 13 Female\n49 3 Female\n50 4 Female\n51 3 Male\n52 1 Male\n53 2 Female\n54 2 Female\n55 4 Male\n56 2 Male\n57 2 Male\n58 3 Female\n59 3 Female\n60 4 Male\n61 1 Female\n62 13 Female\n63 13 Female\n64 6 Male\n65 13 Male\n66 5 Female\n67 13 Female\n68 14 Male\n69 13 Male\n70 8 Female\n71 7 Male\n72 6 Female\n73 13 Male\n74 3 Male\n75 4 Male\n76 2 Male\n77 NA Male\n78 5 Female\n79 3 Male\n80 3 Male\n81 14 Male\n82 11 Female\n83 7 Female\n84 7 Male\n85 11 Female\n86 9 Female\n87 14 Male\n88 13 Female\n89 1 Male\n90 1 Male\n91 4 Male\n92 1 Female\n93 2 Male\n94 3 Female\n95 2 Male\n96 1 Male\n97 2 Male\n98 2 Female\n99 4 Female\n100 5 Female\n101 5 Male\n102 6 Female\n103 14 Female\n104 14 Male\n105 10 Male\n106 6 Female\n107 6 Male\n108 8 Male\n109 6 Female\n110 12 Female\n111 12 Male\n112 14 Female\n113 15 Male\n114 12 Female\n115 4 Female\n116 4 Male\n117 3 Female\n118 NA Male\n119 2 Female\n120 3 Male\n121 NA Female\n122 3 Female\n123 3 Male\n124 2 Female\n125 4 Female\n126 10 Female\n127 7 Female\n128 11 Female\n129 6 Female\n130 11 Male\n131 9 Male\n132 6 Male\n133 13 Female\n134 10 Female\n135 6 Female\n136 11 Female\n137 7 Male\n138 6 Female\n139 4 Female\n140 4 Female\n141 4 Male\n142 4 Female\n143 4 Male\n144 4 Male\n145 3 Male\n146 4 Female\n147 3 Male\n148 3 Male\n149 13 Female\n150 7 Female\n151 10 Male\n152 6 Male\n153 10 Female\n154 12 Female\n155 10 Male\n156 10 Male\n157 13 Male\n158 13 Female\n159 5 Female\n160 3 Female\n161 4 Male\n162 1 Male\n163 3 Female\n164 4 Male\n165 4 Male\n166 1 Male\n167 5 Female\n168 6 Female\n169 14 Female\n170 6 Male\n171 13 Female\n172 9 Male\n173 11 Male\n174 10 Male\n175 5 Female\n176 14 Male\n177 7 Male\n178 10 Male\n179 6 Male\n180 5 Male\n181 3 Female\n182 4 Male\n183 2 Female\n184 3 Male\n185 3 Female\n186 2 Female\n187 3 Male\n188 5 Female\n189 2 Male\n190 3 Female\n191 14 Female\n192 9 Female\n193 14 Female\n194 9 Female\n195 8 Female\n196 7 Male\n197 13 Male\n198 8 Female\n199 6 Male\n200 12 Female\n201 14 Female\n202 15 Female\n203 2 Female\n204 4 Female\n205 3 Male\n206 3 Female\n207 3 Male\n208 4 Female\n209 3 Male\n210 14 Female\n211 8 Male\n212 7 Male\n213 14 Female\n214 13 Female\n215 13 Female\n216 7 Male\n217 8 Female\n218 10 Female\n219 9 Male\n220 9 Female\n221 3 Female\n222 4 Male\n223 4 Female\n224 4 Male\n225 2 Female\n226 1 Female\n227 3 Female\n228 2 Male\n229 3 Male\n230 5 Male\n231 2 Female\n232 2 Male\n233 9 Male\n234 13 Male\n235 10 Female\n236 6 Male\n237 13 Female\n238 11 Male\n239 10 Male\n240 8 Female\n241 9 Female\n242 10 Male\n243 14 Male\n244 1 Female\n245 2 Male\n246 3 Female\n247 2 Male\n248 3 Female\n249 2 Female\n250 3 Female\n251 5 Female\n252 10 Female\n253 7 Male\n254 13 Female\n255 15 Male\n256 11 Female\n257 10 Female\n258 3 Female\n259 2 Male\n260 3 Male\n261 3 Female\n262 3 Female\n263 4 Male\n264 3 Male\n265 2 Male\n266 4 Male\n267 2 Female\n268 8 Male\n269 11 Male\n270 6 Male\n271 14 Female\n272 14 Male\n273 5 Female\n274 5 Male\n275 10 Female\n276 13 Male\n277 6 Male\n278 5 Male\n279 12 Male\n280 2 Male\n281 3 Female\n282 1 Female\n283 1 Male\n284 1 Female\n285 2 Female\n286 5 Female\n287 5 Male\n288 4 Female\n289 2 Male\n290 NA Female\n291 6 Female\n292 8 Male\n293 15 Male\n294 11 Male\n295 14 Male\n296 6 Male\n297 10 Female\n298 12 Male\n299 14 Male\n300 10 Male\n301 1 Female\n302 3 Male\n303 2 Male\n304 3 Female\n305 4 Male\n306 3 Male\n307 4 Female\n308 4 Male\n309 1 Female\n310 7 Male\n311 11 Female\n312 7 Female\n313 5 Female\n314 10 Male\n315 9 Female\n316 13 Male\n317 11 Female\n318 13 Male\n319 9 Female\n320 15 Female\n321 7 Female\n322 4 Male\n323 1 Male\n324 1 Male\n325 2 Female\n326 2 Female\n327 3 Male\n328 2 Male\n329 3 Male\n330 4 Female\n331 7 Female\n332 11 Female\n333 10 Female\n334 5 Male\n335 8 Male\n336 15 Male\n337 14 Male\n338 2 Male\n339 2 Female\n340 2 Male\n341 5 Male\n342 4 Female\n343 3 Male\n344 5 Female\n345 4 Female\n346 2 Female\n347 1 Female\n348 7 Male\n349 8 Female\n350 NA Male\n351 9 Male\n352 8 Female\n353 5 Male\n354 14 Male\n355 14 Male\n356 7 Female\n357 13 Female\n358 2 Male\n359 1 Female\n360 1 Male\n361 4 Female\n362 3 Male\n363 4 Female\n364 3 Male\n365 1 Male\n366 5 Female\n367 4 Female\n368 4 Female\n369 4 Male\n370 11 Male\n371 15 Female\n372 12 Female\n373 11 Female\n374 8 Female\n375 13 Male\n376 10 Female\n377 10 Female\n378 15 Male\n379 8 Female\n380 14 Male\n381 4 Male\n382 1 Male\n383 5 Female\n384 2 Male\n385 2 Female\n386 4 Male\n387 4 Male\n388 2 Female\n389 3 Male\n390 11 Male\n391 10 Female\n392 6 Male\n393 12 Female\n394 10 Female\n395 8 Male\n396 8 Male\n397 13 Male\n398 10 Male\n399 13 Female\n400 10 Male\n401 2 Male\n402 4 Female\n403 3 Female\n404 2 Female\n405 1 Female\n406 3 Male\n407 3 Female\n408 4 Male\n409 5 Female\n410 5 Female\n411 1 Female\n412 11 Male\n413 6 Male\n414 14 Female\n415 8 Male\n416 8 Female\n417 9 Female\n418 7 Male\n419 6 Male\n420 12 Female\n421 8 Male\n422 11 Female\n423 14 Male\n424 3 Female\n425 1 Female\n426 5 Female\n427 2 Female\n428 3 Female\n429 4 Female\n430 2 Male\n431 3 Female\n432 4 Male\n433 1 Female\n434 7 Female\n435 10 Male\n436 11 Male\n437 7 Female\n438 10 Female\n439 14 Female\n440 7 Female\n441 11 Male\n442 12 Male\n443 10 Female\n444 6 Male\n445 13 Male\n446 8 Female\n447 2 Male\n448 3 Female\n449 1 Female\n450 2 Female\n451 NA Male\n452 NA Female\n453 4 Male\n454 4 Male\n455 1 Male\n456 2 Female\n457 2 Male\n458 12 Male\n459 12 Female\n460 8 Female\n461 14 Female\n462 13 Female\n463 6 Male\n464 11 Female\n465 11 Male\n466 10 Female\n467 12 Male\n468 14 Female\n469 11 Female\n470 1 Male\n471 2 Female\n472 3 Male\n473 3 Female\n474 5 Female\n475 3 Male\n476 1 Male\n477 4 Female\n478 4 Female\n479 4 Male\n480 2 Female\n481 5 Female\n482 7 Male\n483 8 Male\n484 10 Male\n485 6 Female\n486 7 Male\n487 10 Female\n488 6 Male\n489 6 Female\n490 15 Female\n491 5 Male\n492 3 Male\n493 5 Male\n494 3 Female\n495 5 Male\n496 5 Male\n497 1 Female\n498 1 Male\n499 7 Female\n500 14 Female\n501 9 Male\n502 10 Female\n503 10 Female\n504 11 Male\n505 11 Female\n506 12 Female\n507 11 Female\n508 12 Male\n509 12 Male\n510 10 Female\n511 1 Male\n512 2 Female\n513 4 Male\n514 2 Male\n515 3 Male\n516 3 Female\n517 2 Male\n518 4 Male\n519 3 Male\n520 1 Female\n521 4 Male\n522 12 Female\n523 6 Male\n524 7 Female\n525 7 Male\n526 13 Female\n527 8 Female\n528 7 Male\n529 8 Female\n530 8 Female\n531 11 Female\n532 14 Female\n533 3 Male\n534 2 Female\n535 2 Male\n536 3 Male\n537 2 Male\n538 2 Female\n539 3 Female\n540 2 Male\n541 5 Male\n542 10 Female\n543 14 Male\n544 9 Male\n545 6 Male\n546 7 Male\n547 14 Female\n548 7 Female\n549 7 Male\n550 9 Male\n551 14 Male\n552 10 Female\n553 13 Female\n554 5 Male\n555 4 Female\n556 4 Female\n557 5 Female\n558 4 Female\n559 4 Male\n560 4 Male\n561 3 Female\n562 1 Female\n563 4 Male\n564 1 Male\n565 1 Female\n566 7 Male\n567 13 Female\n568 10 Female\n569 14 Male\n570 12 Female\n571 14 Male\n572 8 Male\n573 7 Male\n574 11 Female\n575 8 Male\n576 12 Male\n577 9 Female\n578 5 Female\n579 4 Male\n580 3 Female\n581 2 Male\n582 2 Male\n583 3 Male\n584 4 Female\n585 4 Male\n586 4 Female\n587 5 Male\n588 3 Female\n589 6 Female\n590 3 Male\n591 11 Female\n592 11 Male\n593 7 Male\n594 8 Male\n595 6 Female\n596 10 Female\n597 8 Female\n598 8 Male\n599 9 Female\n600 8 Male\n601 13 Male\n602 11 Male\n603 8 Female\n604 2 Female\n605 4 Male\n606 2 Male\n607 2 Female\n608 4 Male\n609 2 Male\n610 4 Female\n611 2 Female\n612 4 Female\n613 1 Female\n614 4 Female\n615 12 Female\n616 7 Female\n617 11 Male\n618 6 Male\n619 8 Male\n620 14 Male\n621 11 Male\n622 7 Female\n623 14 Female\n624 6 Male\n625 13 Female\n626 13 Female\n627 3 Male\n628 1 Male\n629 3 Male\n630 1 Female\n631 1 Female\n632 2 Male\n633 4 Male\n634 4 Male\n635 2 Female\n636 4 Female\n637 5 Male\n638 3 Female\n639 3 Male\n640 6 Female\n641 11 Female\n642 9 Female\n643 7 Female\n644 8 Male\n645 NA Female\n646 8 Female\n647 14 Female\n648 10 Male\n649 10 Male\n650 11 Female\n651 13 Female\n```\n\n\n:::\n:::\n\n\n\nWe can remove select columns using indexing as well, OR by simply changing the column to `NULL`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, -5] #remove column 5, \"slum\" variable\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$slum <- NULL # this is the same as above\n```\n:::\n\n\n\nWe can also grab the `age` column using the `$` operator, again this is selecting the variable for all of the rows.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age\n```\n:::\n\n\n\n\n\n## Using indexing to subset by rows\n\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n100 8122 0.1818182 5 Female Non slum\n```\n\n\n:::\n:::\n\n\n\nAnd, here we pull the `age` of the 100th observation/row.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,\"age\"] \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 5\n```\n\n\n:::\n:::\n\n\n\n \n\n## Logical operators\n\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\noperator | operator option |description\n-----|-----|-----:\n`<`|%l%|less than\n`<=`|%le%|less than or equal to\n`>`|%g%|greater than\n`>=`|%ge%|greater than or equal to\n`==`||equal to\n`!=`||not equal to\n`x&y`||x and y\n`x|y`||x or y\n`%in%`||match\n`%!in%`||do not match\n\n\n## Logical operators examples\n\nLet's practice. First, here is a reminder of what the number.object contains.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3\n```\n\n\n:::\n:::\n\n\n\n\nNow, we will use logical operators to evaluate the object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object<4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object>=3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object!=5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nnumber.object %in% c(6,7,2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\n\nWe can use any of these logical operators to subset our data.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Overall mean\nmean(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 87.36826\n```\n\n\n:::\n\n```{.r .cell-code}\n# Mean for all children who are not age 3\nmean(df$IgG_concentration[df$age != 3], na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 90.32824\n```\n\n\n:::\n\n```{.r .cell-code}\n# Mean for all children who are between 0 and 3 or between 7 and 10 years old\nmean(df$IgG_concentration[df$age %in% c(0:3, 7:10)], na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 74.0914\n```\n\n\n:::\n:::\n\n\n\n\n## Using indexing and logical operators to rename columns\n\n1. We can assign the column names from data frame `df` to an object `cn`, then we can modify `cn` directly using indexing and logical operators, finally we reassign the column names, `cn`, back to the data frame `df`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncn <- colnames(df)\ncn\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n\n\n:::\n\n```{.r .cell-code}\ncn==\"IgG_concentration\"\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE TRUE FALSE FALSE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_IU/mL\" #rename cn to \"IgG_concentration_IU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age\" \"gender\" \n[5] \"slum\" \n```\n\n\n:::\n:::\n\n\n\n\n
\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[colnames(df)==\"IgG_concentration_IU/mL\"] <- \"IgG_concentration\" #reset\n```\n:::\n\n\n\n\n\n## Using indexing and logical operators to subset data\n\n\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to `df_lt10`. Note that the logical operators `df$age<=10` is before the comma because I want to subset by rows (the first dimension).\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10 <- df[df$age<=10, ]\n```\n:::\n\n\n\nLets check that my subsets worked using the `summary()` function. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.0 3.0 4.0 4.8 7.0 10.0 9 \n```\n\n\n:::\n:::\n\n\n\n\n
\n\nIn the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n```\n:::\n\n\n\nLets check that my subsets worked using the `summary()` function. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte5_gt10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.00 2.50 4.00 6.08 11.00 15.00 9 \n```\n\n\n:::\n:::\n\n\n\n\n\n## Missing values \n\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\n\nTypes of \"missing\" values:\n\n- `NA` - **N**ot **A**pplicable general missing data\n- `NaN` - stands for \"**N**ot **a** **N**umber\", happens when you do 0/0.\n- `Inf` and `-Inf` - Infinity, happens when you divide a positive number (or negative number) by 0.\n- blank space - sometimes when data is read it, there is a blank space left\n- an empty string (e.g., `\"\"`) \n- `NULL`- undefined value that represents something that does not exist\n\n## Logical operators to help identify and missing data\n\noperator |description\n-----|-----|-----:\n`is.na`|is NAN or NA\n`is.nan`|is NAN\n`!is.na`|is not NAN or NA\n`!is.nan`|is not NAN\n`is.infinite`|is infinite\n`any`|are any TRUE\n`all`|all are TRUE\n`which`|which are TRUE\n\n## More logical operators examples\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- c(0,NA, -1)/0\ntest\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NaN NA -Inf\n```\n\n\n:::\n\n```{.r .cell-code}\nis.na(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE TRUE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.nan(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.infinite(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE FALSE TRUE\n```\n\n\n:::\n:::\n\n\n\n\n## More logical operators examples\n\n`any(is.na(x))` means do we have any `NA`'s in the object `x`?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n:::\n\n\n\n\n`which(is.na(x))` means which of the elements in object `x` are `NA`'s?\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwhich(is.na(df$IgG_concentration)) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 13 55 57 72 182 406 414 478 488 595\n```\n\n\n:::\n\n```{.r .cell-code}\nwhich(is.na(df$slum)) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\ninteger(0)\n```\n\n\n:::\n:::\n\n\n\n\n## `subset()` function\n\nThe Base R `subset()` function is a slightly easier way to select variables and observations.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?subset\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nSubsetting Vectors, Matrices and Data Frames\n\nDescription:\n\n Return subsets of vectors, matrices or data frames which meet\n conditions.\n\nUsage:\n\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n\n x: object to be subsetted.\n\n subset: logical expression indicating elements or rows to keep:\n missing values are taken as false.\n\n select: expression, indicating columns to select from a data frame.\n\n drop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\n\nDetails:\n\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\n\nValue:\n\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\n\nAuthor(s):\n\n Peter Dalgaard and Brian Ripley\n\nSee Also:\n\n '[', 'transform' 'droplevels'\n\nExamples:\n\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)\n\n\n\n\n## Subsetting use the `subset()` function\n\nHere are a few examples using the `subset()` function\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))\n```\n:::\n\n\n\n\n## `subset()` function vs logical operators\n\n`subset()` automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age) #created with indexing\n```\n\n::: {.cell-output-display}\n\n\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.| NA's|\n|----:|-------:|------:|----:|-------:|----:|----:|\n| 1| 3| 4| 4.8| 7| 10| 9|\n:::\n\n```{.r .cell-code}\nsummary(df_lte10_v2$age) #created with the subset function\n```\n\n::: {.cell-output-display}\n\n\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.|\n|----:|-------:|------:|----:|-------:|----:|\n| 1| 3| 4| 4.8| 7| 10|\n:::\n:::\n\n\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrow(df_lte10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 504\n```\n\n\n:::\n\n```{.r .cell-code}\nnrow(df_lte10_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 495\n```\n\n\n:::\n:::\n\n\n\n\n\n\n## Summary\n\n- `colnames()`, `str()` and `summary()`functions from Base R are functions to assess the data type and some summary statistics\n- There are three basic indexing syntax: `[`, `[[` and `$`\n- Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\n- There are 7 “types” of missing values, the most common being “NA”\n- Logical operators meant to determine missing values are very helpful for data cleaning\n- The Base R `subset()` function is a slightly easier way to select variables and observations.\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- [\"Indexing\" CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)\n- [\"Logical operators\" CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)\n\n", + "supporting": [ + "Module06-DataSubset_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ], diff --git a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json index b98cb35..1d964f3 100644 --- a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json +++ b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "659422f556ed54450a8839eee24c84dd", + "hash": "219f056618b943630a88b5d8b9278252", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat:\n revealjs:\n smaller: true\n scrollable: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`. `head()` allows us to look at the first `n` observations.\n\n\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/head_args.png){fig-align='left' width=100%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n## Adding new columns with `$` operator\n\nYou can add a new column, called `log_IgG` to `df`, using the `$` operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n```\n\n\n:::\n:::\n\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.\n\n## Adding new columns with `transform()`\n\nWe can also add a new column using the `transform()` function:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?transform\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n```\n\n\n:::\n:::\n\n\n\n## Adding new columns with `transform()`\n\nFor example, adding a binary column for seropositivity called `seropos`:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |\n:::\n:::\n\n\n\n\n## Creating conditional variables\n\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE`.\"\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ifelse\n```\n:::\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |young |\n:::\n:::\n\n\n\n## `ifelse` example\nLet's delve into what is actually happening, with a focus on the NA values in `age` variable.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\n## Nesting two `ifelse` statements example\n\n`ifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2))`.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n```\n:::\n\n\n\nLet's use the `table()` function to check if it worked.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n\n\n|age/ | middle| old| young| NA|\n|:----|------:|---:|-----:|--:|\n|1 | 0| 0| 44| 0|\n|2 | 0| 0| 72| 0|\n|3 | 0| 0| 79| 0|\n|4 | 0| 0| 80| 0|\n|5 | 0| 0| 41| 0|\n|6 | 38| 0| 0| 0|\n|7 | 38| 0| 0| 0|\n|8 | 39| 0| 0| 0|\n|9 | 20| 0| 0| 0|\n|10 | 44| 0| 0| 0|\n|11 | 0| 41| 0| 0|\n|12 | 0| 23| 0| 0|\n|13 | 0| 35| 0| 0|\n|14 | 0| 37| 0| 0|\n|15 | 0| 11| 0| 0|\n|NA | 0| 0| 0| 9|\n:::\n:::\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\nNote, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"double\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n:::\n\n\n\n\n## Logical\n\nReminder `logical` is a type that only has three possible elements: `TRUE` and `FALSE` and `NA`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"logical\"\n```\n\n\n:::\n:::\n\n\n\nNote that when creating `logical` object the `TRUE` and `FALSE` are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a another class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"1\" \"4\" \"7\"\n```\n\n\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 7\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\nIn some cases the coercing is not possible; if executed, will return `NA`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 NA\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE NA\n```\n\n\n:::\n:::\n\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"factor\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"middle\" \"old\" \"young\" \n```\n\n\n:::\n:::\n\n\n\nNote 1, that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\nNote 2, we can also make ordered factors using `factor(... ordered=TRUE)`, but we won't talk more about that.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\nLet's look at the `relevel()` help file\n\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\n\n\n
\n\nLet's look at the `factor()` help file\n\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and model-fitting functions may treat the two classes\n quite differently, see 'options(\"contrasts\")'.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n\n## Changing factor reference examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nOR\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\n\n```\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data frame \n\nYou can transform an existing matrix into data frames using `as.data.frame()` \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n\n\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\nYou can create a new data frame out of vectors (and potentially lists, but\nthis is an advanced feature and unusual) by using the `data.frame()` function.\nRecall that all of the vectors that make up a data frame must be the same\nlength.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlotr <- \n data.frame(\n name = c(\"Frodo\", \"Sam\", \"Aragorn\", \"Legolas\", \"Gimli\"),\n race = c(\"Hobbit\", \"Hobbit\", \"Human\", \"Elf\", \"Dwarf\"),\n age = c(53, 38, 87, 2931, 139)\n )\n```\n:::\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n\n-\t`mean()`: takes the mean of x\n-\t`sd()`: takes the standard deviation of x\n-\t`median()`: takes the median of x\n-\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t`range()`: displays the range. Same as `c(min(), max())`\n-\t`sum()`: sum of x\n-\t`max()`: maximum value in x\n-\t`min()`: minimum value in x\n- `colSums()`: get the columns sums of a data frame\n- `rowSums()`: get the row sums of a data frame\n- `colMeans()`: get the columns means of a data frame\n- `rowMeans()`: get the row means of a data frame\n\nNote, all of these functions have an `na.rm` **argument for missing data**.\n\n## Numeric variable data summary\n\nLet's look at a help file for `mean()` to make note of the `na.rm` argument\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?range\n```\n:::\n\nRange of Values\n\nDescription:\n\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\n\nUsage:\n\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n\n ...: any 'numeric' or character objects.\n\n na.rm: logical, indicating if 'NA''s should be omitted.\n\n finite: logical, indicating if all non-finite elements should be\n omitted.\n\nisNumeric: a 'function' returning 'TRUE' or 'FALSE' when called on\n 'c(..., recursive = TRUE)', 'is.numeric()' for the default\n 'range()' method.\n\nDetails:\n\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\n\nS4 methods:\n\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\n\nExamples:\n\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)\n\n\n\n## Numeric variable data summary examples\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | seropos | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:-------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Mode :logical |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |FALSE:360 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |TRUE :281 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA's :10 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA NA\n```\n\n\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 15\n```\n\n\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.665753\n```\n\n\n:::\n:::\n\n\n\n\n## Character variable data summaries\n\nData summarization on character or factor vectors/variables using `table()`\n\n\t\t\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?table\n```\n:::\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: the name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), dnn = NULL) # dimnames() have no names\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n\n## Summary\n\n- You can create new columns/variable to a data frame by using `$` or the `transform()` function\n- One useful function for creating new variables based on existing variables is the `ifelse()` function, which returns a value depending on whether the element of test is `TRUE` or `FALSE`\n- The `class()` function allows you to evaluate the class of an object.\n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `False` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- There are many fairly intuitive data summary functions you can perform on a vector (i.e., `mean()`, `sd()`, `range()`) or on rows or columns of a data frame (i.e., `colSums()`, `colMeans()`, `rowSums()`)\n- The `table()` function builds frequency tables of the counts at each combination of categorical levels\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat:\n revealjs:\n smaller: true\n scrollable: true\n toc: false\n---\n\n\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`. `head()` allows us to look at the first `n` observations.\n\n\n\n\n\n::: {.cell layout-align=\"left\"}\n::: {.cell-output-display}\n![](images/head_args.png){fig-align='left' width=100%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n\n## Adding new columns with `$` operator\n\nYou can add a new column, called `log_IgG` to `df`, using the `$` operator:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum log_IgG\n1 5772 0.3176895 2 Female Non slum -1.146681\n2 8095 3.4368231 4 Female Non slum 1.234548\n3 9784 0.3000000 4 Male Non slum -1.203973\n```\n\n\n:::\n:::\n\n\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.\n\n## Adding new columns with `transform()`\n\nWe can also add a new column using the `transform()` function:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?transform\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n```\n\n\n:::\n:::\n\n\n\n\n## Adding new columns with `transform()`\n\nFor example, adding a binary column for seropositivity called `seropos`:\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |\n:::\n:::\n\n\n\n\n\n## Creating conditional variables\n\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE` or `NA`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ifelse\n```\n:::\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n\n\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |young |\n:::\n:::\n\n\n\n\n## `ifelse` example\nLet's delve into what is actually happening, with a focus on the NA values in `age` variable.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\n\n## Nesting two `ifelse` statements example\n\n`ifelse(test1, yes_to_test1, ifelse(test2, no_to_test2_yes_to_test2, no_to_test1_no_to_test2))`.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\n```\n:::\n\n\n\n\nLet's use the `table()` function to check if it worked.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n\n\n|age/ | middle| old| young| NA|\n|:----|------:|---:|-----:|--:|\n|1 | 0| 0| 44| 0|\n|2 | 0| 0| 72| 0|\n|3 | 0| 0| 79| 0|\n|4 | 0| 0| 80| 0|\n|5 | 0| 0| 41| 0|\n|6 | 38| 0| 0| 0|\n|7 | 38| 0| 0| 0|\n|8 | 39| 0| 0| 0|\n|9 | 20| 0| 0| 0|\n|10 | 44| 0| 0| 0|\n|11 | 0| 41| 0| 0|\n|12 | 0| 23| 0| 0|\n|13 | 0| 35| 0| 0|\n|14 | 0| 37| 0| 0|\n|15 | 0| 11| 0| 0|\n|NA | 0| 0| 0| 9|\n:::\n:::\n\n\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n:::\n\n\n\n\nNote, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"numeric\"\n```\n\n\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"double\"\n```\n\n\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"integer\"\n```\n\n\n:::\n:::\n\n\n\n\n\n## Logical\n\nReminder `logical` is a type that only has three possible elements: `TRUE` and `FALSE` and `NA`\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"logical\"\n```\n\n\n:::\n:::\n\n\n\n\nNote that when creating `logical` object the `TRUE` and `FALSE` are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a another class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] FALSE\n```\n\n\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"1\" \"4\" \"7\"\n```\n\n\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 7\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE FALSE\n```\n\n\n:::\n:::\n\n\n\n\nIn some cases the coercing is not possible; if executed, will return `NA`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: NAs introduced by coercion\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 4 NA\n```\n\n\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE FALSE NA\n```\n\n\n:::\n:::\n\n\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"character\"\n```\n\n\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"factor\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"middle\" \"old\" \"young\" \n```\n\n\n:::\n:::\n\n\n\n\nNote 1, that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\nNote 2, we can also make ordered factors using `factor(... ordered=TRUE)`, but we won't talk more about that.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\nLet's look at the `relevel()` help file\n\n\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\n\n\n\n
\n\nLet's look at the `factor()` help file\n\n\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and model-fitting functions may treat the two classes\n quite differently, see 'options(\"contrasts\")'.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n\n\n## Changing factor reference examples\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\n\nOR\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"young\" \"middle\" \"old\" \n```\n\n\n:::\n:::\n\n\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class). As a reminder, here is the matrix signature function to help remind us how to build a matrix\n\n```\nmatrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)\n```\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n\n\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data frame \n\nYou can transform an existing matrix into data frames using `as.data.frame()` \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n\n\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\n\nYou can create a new data frame out of vectors (and potentially lists, but\nthis is an advanced feature and unusual) by using the `data.frame()` function.\nRecall that all of the vectors that make up a data frame must be the same\nlength.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlotr <- \n data.frame(\n name = c(\"Frodo\", \"Sam\", \"Aragorn\", \"Legolas\", \"Gimli\"),\n race = c(\"Hobbit\", \"Hobbit\", \"Human\", \"Elf\", \"Dwarf\"),\n age = c(53, 38, 87, 2931, 139)\n )\n```\n:::\n\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n\n-\t`mean()`: takes the mean of x\n-\t`sd()`: takes the standard deviation of x\n-\t`median()`: takes the median of x\n-\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t`range()`: displays the range. Same as `c(min(), max())`\n-\t`sum()`: sum of x\n-\t`max()`: maximum value in x\n-\t`min()`: minimum value in x\n- `colSums()`: get the columns sums of a data frame\n- `rowSums()`: get the row sums of a data frame\n- `colMeans()`: get the columns means of a data frame\n- `rowMeans()`: get the row means of a data frame\n\nNote, all of these functions have an `na.rm` **argument for missing data**.\n\n## Numeric variable data summary\n\nLet's look at a help file for `range()` to make note of the `na.rm` argument\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?range\n```\n:::\n\nRange of Values\n\nDescription:\n\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\n\nUsage:\n\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n\n ...: any 'numeric' or character objects.\n\n na.rm: logical, indicating if 'NA''s should be omitted.\n\n finite: logical, indicating if all non-finite elements should be\n omitted.\n\nisNumeric: a 'function' returning 'TRUE' or 'FALSE' when called on\n 'c(..., recursive = TRUE)', 'is.numeric()' for the default\n 'range()' method.\n\nDetails:\n\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\n\nS4 methods:\n\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\n\nExamples:\n\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)\n\n\n\n\n## Numeric variable data summary examples\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n\n\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | seropos | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:-------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Mode :logical |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |FALSE:360 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |TRUE :281 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA's :10 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA NA\n```\n\n\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1 15\n```\n\n\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1.665753\n```\n\n\n:::\n:::\n\n\n\n\n\n## Character variable data summaries\n\nData summarization on character or factor vectors/variables using `table()`\n\n\t\t\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?table\n```\n:::\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: the name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), dnn = NULL) # dimnames() have no names\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n\n\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n\n\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n\n\n## Summary\n\n- You can create new columns/variable to a data frame by using `$` or the `transform()` function\n- One useful function for creating new variables based on existing variables is the `ifelse()` function, which returns a value depending on whether the element of test is `TRUE` or `FALSE`\n- The `class()` function allows you to evaluate the class of an object.\n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `FALSE` or `NA` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- There are many fairly intuitive data summary functions you can perform on a vector (i.e., `mean()`, `sd()`, `range()`) or on rows or columns of a data frame (i.e., `colSums()`, `colMeans()`, `rowSums()`)\n- The `table()` function builds frequency tables of the counts at each combination of categorical levels\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", "supporting": [ "Module07-VarCreationClassesSummaries_files" ], diff --git a/_freeze/modules/Module10-DataVisualization/execute-results/html.json b/_freeze/modules/Module10-DataVisualization/execute-results/html.json index da5033b..a2a9b00 100644 --- a/_freeze/modules/Module10-DataVisualization/execute-results/html.json +++ b/_freeze/modules/Module10-DataVisualization/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "5b8e85381b52e1759a2cb5aa0c191c40", + "hash": "d9183bfceea5026fb81db2ef5b4efdfa", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhelp(package = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n\n\n:::\n:::\n\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\n-\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n## 1. Parameters\n\n\n\n::: {.cell figwidth='100%'}\n::: {.cell-output-display}\n![](images/par.png)\n:::\n:::\n\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n\n## Common parameter options\n\nEight useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `hist()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n\n## `hist()` example\n\nReminder function signature\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n\n## `plot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n\n## `plot()` example\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (IU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n\n## Adding more stuff to the same plot\n\n* We can use the functions `points()` or `lines()` to add additional points\nor additional lines to an existing plot.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(\n\tdf$age[df$slum == \"Non slum\"],\n\tdf$IgG_concentration[df$slum == \"Non slum\"],\n\ttype = \"p\",\n\tmain = \"IgG Concentration vs Age\",\n\txlab = \"Age (years)\",\n\tylab = \"IgG Concentration (IU/mL)\",\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"lightblue\",\n\txlim = range(df$age, na.rm = TRUE),\n\tylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n\tdf$age[df$slum == \"Mixed\"],\n\tdf$IgG_concentration[df$slum == \"Mixed\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"blue\"\n)\npoints(\n\tdf$age[df$slum == \"Slum\"],\n\tdf$IgG_concentration[df$slum == \"Slum\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"darkblue\"\n)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png){width=960}\n:::\n:::\n\n\n\n* The `lines()` function works similarly for connected lines.\n* Note that the `points()` or `lines()` functions must be called with a `plot()`-style function\n* We will show how we could draw a `legend()` in a future section.\n\n\n## `boxplot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers _et\n al_, 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n\n## `boxplot()` example\n\nReminder function signature\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png){width=960}\n:::\n:::\n\n\n\n\n## `barplot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBar Plots\n\nDescription:\n\n Creates a bar plot with vertical or horizontal bars.\n\nUsage:\n\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\n\n height: either a vector or matrix of values describing the bars which\n make up the plot. If 'height' is a vector, the plot consists\n of a sequence of rectangular bars with heights given by the\n values in the vector. If 'height' is a matrix and 'beside'\n is 'FALSE' then each bar of the plot corresponds to a column\n of 'height', with the values in the column giving the heights\n of stacked sub-bars making up the bar. If 'height' is a\n matrix and 'beside' is 'TRUE', then the values in each column\n are juxtaposed rather than stacked.\n\n width: optional vector of bar widths. Re-cycled to length the number\n of bars drawn. Specifying a single value will have no\n visible effect unless 'xlim' is specified.\n\n space: the amount of space (as a fraction of the average bar width)\n left before each bar. May be given as a single number or one\n number per bar. If 'height' is a matrix and 'beside' is\n 'TRUE', 'space' may be specified by two numbers, where the\n first is the space between bars in the same group, and the\n second the space between the groups. If not given\n explicitly, it defaults to 'c(0,1)' if 'height' is a matrix\n and 'beside' is 'TRUE', and to 0.2 otherwise.\n\nnames.arg: a vector of names to be plotted below each bar or group of\n bars. If this argument is omitted, then the names are taken\n from the 'names' attribute of 'height' if this is a vector,\n or the column names if it is a matrix.\n\nlegend.text: a vector of text used to construct a legend for the plot,\n or a logical indicating whether a legend should be included.\n This is only useful when 'height' is a matrix. In that case\n given legend labels should correspond to the rows of\n 'height'; if 'legend.text' is true, the row names of 'height'\n will be used as labels if they are non-null.\n\n beside: a logical value. If 'FALSE', the columns of 'height' are\n portrayed as stacked bars, and if 'TRUE' the columns are\n portrayed as juxtaposed bars.\n\n horiz: a logical value. If 'FALSE', the bars are drawn vertically\n with the first bar to the left. If 'TRUE', the bars are\n drawn horizontally with the first at the bottom.\n\n density: a vector giving the density of shading lines, in lines per\n inch, for the bars or bar components. The default value of\n 'NULL' means that no shading lines are drawn. Non-positive\n values of 'density' also inhibit the drawing of shading\n lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise), for the bars or bar components.\n\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\n\n border: the color to be used for the border of the bars. Use 'border\n = NA' to omit borders. If there are shading lines, 'border =\n TRUE' means use the same colour for the border as for the\n shading lines.\n\nmain,sub: main title and subtitle for the plot.\n\n xlab: a label for the x axis.\n\n ylab: a label for the y axis.\n\n xlim: limits for the x axis.\n\n ylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\n axes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\n\naxisnames: logical. If 'TRUE', and if there are 'names.arg' (see\n above), the other axis is drawn (with 'lty = 0') and labeled.\n\ncex.axis: expansion factor for numeric axis labels (see 'par('cex')').\n\ncex.names: expansion factor for axis names (bar labels).\n\n inside: logical. If 'TRUE', the lines which divide adjacent\n (non-stacked!) bars will be drawn. Only applies when 'space\n = 0' (which it partly is when 'beside = TRUE').\n\n plot: logical. If 'FALSE', nothing is plotted.\n\naxis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to\n the axis and tick marks of the categorical (default\n horizontal) axis. Note that by default the axis is\n suppressed.\n\n offset: a vector indicating how much the bars should be shifted\n relative to the x axis.\n\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\n\nargs.legend: list of additional arguments to pass to 'legend()'; names\n of the list are used as argument names. Only used if\n 'legend.text' is supplied.\n\n formula: a formula where the 'y' variables are numeric data to plot\n against the categorical 'x' variables. The formula can have\n one of three forms:\n\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\n data: a data frame (or list) from which the variables in formula\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\n\nValue:\n\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\n\nAuthor(s):\n\n R Core, with a contribution by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\n\nExamples:\n\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))\n\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder function signature\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png){width=960}\n:::\n:::\n\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n\n\n:::\n:::\n\n\n\n\n\n## Add legend to the plot\n\nReminder function signature\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n## Add legend to the plot\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png){width=960}\n:::\n:::\n\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n## `barplot()` example\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png){width=960}\n:::\n:::\n\n\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n## `barplot()` example\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png){width=960}\n:::\n:::\n\n\n\n## Base R plots vs the Tidyverse ggplot2 package\n\nIt is good to know both b/c they each have their strengths\n\n## Summary\n\n- the Base R 'graphics' package has a ton of graphics options that allow for ultimate flexibility\n- Base R plots typically include setting plot options (`par()`), mapping data to the plot (e.g., `plot()`, `barplot()`, `points()`, `lines()`), and creating a legend (`legend()`). \n- the functions `points()` or `lines()` add additional points or additional lines to an existing plot, but must be called with a `plot()`-style function\n- in Base R plotting the legend is not automatically generated, so be careful when creating it\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", + "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhelp(package = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.4.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.4.1\nNeedsCompilation: yes\nEnhances: vcd\nBuilt: R 4.4.1; x86_64-w64-mingw32; 2024-06-14 08:20:40\n UTC; windows\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n\n\n:::\n:::\n\n\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\n-\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n## 1. Parameters\n\n\n\n\n::: {.cell figwidth='100%'}\n::: {.cell-output-display}\n![](images/par.png)\n:::\n:::\n\n\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n * '\"ask\"',\n\n * '\"fig\"', '\"fin\"',\n\n * '\"lheight\"',\n\n * '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n * '\"new\"',\n\n * '\"oma\"', '\"omd\"', '\"omi\"',\n\n * '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n * '\"usr\"',\n\n * '\"xlog\"', '\"ylog\"',\n\n * '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Hexadecimal colors can be in the long hexadecimal form\n (e.g., '\"#rrggbb\"' or '\"#rrggbbaa\"') or the short form (e.g,\n '\"#rgb\"' or '\"#rgba\"'). The short form is expanded to the long\n form by replicating digits (not by adding zeroes), e.g., '\"#rgb\"'\n becomes '\"#rrggbb\"'. Colors can also be specified by giving an\n index into a small table of colors, the 'palette': indices wrap\n round so with the default palette of size 8, '10' is the same as\n '2'. This provides compatibility with S. Index '0' corresponds\n to the background color. Note that the palette (apart from '0'\n which is per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'pdf', 'postscript' and setting up device regions by\n 'layout' and 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n\n\n## Common parameter options\n\nEight useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `hist()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n * a vector giving the breakpoints between histogram cells,\n\n * a function to compute the vector of breakpoints,\n\n * a single number giving the number of cells for the\n histogram,\n\n * a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n * a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equispaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equispaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n\n\n## `hist()` example\n\nReminder function signature\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n\n\n## `plot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n * '\"p\"' for *p*oints,\n\n * '\"l\"' for *l*ines,\n\n * '\"b\"' for *b*oth,\n\n * '\"c\"' for the lines part alone of '\"b\"',\n\n * '\"o\"' for both '*o*verplotted',\n\n * '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n * '\"s\"' for stair *s*teps,\n\n * '\"S\"' for other *s*teps, see 'Details' below,\n\n * '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n\n\n## `plot()` example\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (IU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n\n\n## Adding more stuff to the same plot\n\n* We can use the functions `points()` or `lines()` to add additional points\nor additional lines to an existing plot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(\n\tdf$age[df$slum == \"Non slum\"],\n\tdf$IgG_concentration[df$slum == \"Non slum\"],\n\ttype = \"p\",\n\tmain = \"IgG Concentration vs Age\",\n\txlab = \"Age (years)\",\n\tylab = \"IgG Concentration (IU/mL)\",\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"lightblue\",\n\txlim = range(df$age, na.rm = TRUE),\n\tylim = range(df$IgG_concentration, na.rm = TRUE)\n)\npoints(\n\tdf$age[df$slum == \"Mixed\"],\n\tdf$IgG_concentration[df$slum == \"Mixed\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"blue\"\n)\npoints(\n\tdf$age[df$slum == \"Slum\"],\n\tdf$IgG_concentration[df$slum == \"Slum\"],\n\tpch = 16,\n\tcex = 0.9,\n\tcol = \"darkblue\"\n)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png){width=960}\n:::\n:::\n\n\n\n\n* The `lines()` function works similarly for connected lines.\n* Note that the `points()` or `lines()` functions must be called with a `plot()`-style function\n* We will show how we could draw a `legend()` in a future section.\n\n\n## `boxplot()` Help File\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers et\n al., 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n\n\n## `boxplot()` example\n\nReminder function signature\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png){width=960}\n:::\n:::\n\n\n\n\n\n## `barplot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBar Plots\n\nDescription:\n\n Creates a bar plot with vertical or horizontal bars.\n\nUsage:\n\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\n\n height: either a vector or matrix of values describing the bars which\n make up the plot. If 'height' is a vector, the plot consists\n of a sequence of rectangular bars with heights given by the\n values in the vector. If 'height' is a matrix and 'beside'\n is 'FALSE' then each bar of the plot corresponds to a column\n of 'height', with the values in the column giving the heights\n of stacked sub-bars making up the bar. If 'height' is a\n matrix and 'beside' is 'TRUE', then the values in each column\n are juxtaposed rather than stacked.\n\n width: optional vector of bar widths. Re-cycled to length the number\n of bars drawn. Specifying a single value will have no\n visible effect unless 'xlim' is specified.\n\n space: the amount of space (as a fraction of the average bar width)\n left before each bar. May be given as a single number or one\n number per bar. If 'height' is a matrix and 'beside' is\n 'TRUE', 'space' may be specified by two numbers, where the\n first is the space between bars in the same group, and the\n second the space between the groups. If not given\n explicitly, it defaults to 'c(0,1)' if 'height' is a matrix\n and 'beside' is 'TRUE', and to 0.2 otherwise.\n\nnames.arg: a vector of names to be plotted below each bar or group of\n bars. If this argument is omitted, then the names are taken\n from the 'names' attribute of 'height' if this is a vector,\n or the column names if it is a matrix.\n\nlegend.text: a vector of text used to construct a legend for the plot,\n or a logical indicating whether a legend should be included.\n This is only useful when 'height' is a matrix. In that case\n given legend labels should correspond to the rows of\n 'height'; if 'legend.text' is true, the row names of 'height'\n will be used as labels if they are non-null.\n\n beside: a logical value. If 'FALSE', the columns of 'height' are\n portrayed as stacked bars, and if 'TRUE' the columns are\n portrayed as juxtaposed bars.\n\n horiz: a logical value. If 'FALSE', the bars are drawn vertically\n with the first bar to the left. If 'TRUE', the bars are\n drawn horizontally with the first at the bottom.\n\n density: a vector giving the density of shading lines, in lines per\n inch, for the bars or bar components. The default value of\n 'NULL' means that no shading lines are drawn. Non-positive\n values of 'density' also inhibit the drawing of shading\n lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise), for the bars or bar components.\n\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\n\n border: the color to be used for the border of the bars. Use 'border\n = NA' to omit borders. If there are shading lines, 'border =\n TRUE' means use the same colour for the border as for the\n shading lines.\n\nmain, sub: main title and subtitle for the plot.\n\n xlab: a label for the x axis.\n\n ylab: a label for the y axis.\n\n xlim: limits for the x axis.\n\n ylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\n axes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\n\naxisnames: logical. If 'TRUE', and if there are 'names.arg' (see\n above), the other axis is drawn (with 'lty = 0') and labeled.\n\ncex.axis: expansion factor for numeric axis labels (see 'par('cex')').\n\ncex.names: expansion factor for axis names (bar labels).\n\n inside: logical. If 'TRUE', the lines which divide adjacent\n (non-stacked!) bars will be drawn. Only applies when 'space\n = 0' (which it partly is when 'beside = TRUE').\n\n plot: logical. If 'FALSE', nothing is plotted.\n\naxis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to\n the axis and tick marks of the categorical (default\n horizontal) axis. Note that by default the axis is\n suppressed.\n\n offset: a vector indicating how much the bars should be shifted\n relative to the x axis.\n\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\n\nargs.legend: list of additional arguments to pass to 'legend()'; names\n of the list are used as argument names. Only used if\n 'legend.text' is supplied.\n\n formula: a formula where the 'y' variables are numeric data to plot\n against the categorical 'x' variables. The formula can have\n one of three forms:\n\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\n data: a data frame (or list) from which the variables in formula\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\n\nValue:\n\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\n\nAuthor(s):\n\n R Core, with a contribution by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\n\nExamples:\n\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))\n\n\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder function signature\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png){width=960}\n:::\n:::\n\n\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n\n\n:::\n:::\n\n\n\n\n\n\n## Add legend to the plot\n\nReminder function signature\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n\n## Add legend to the plot\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png){width=960}\n:::\n:::\n\n\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n## `barplot()` example\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png){width=960}\n:::\n:::\n\n\n\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n\n\n## `barplot()` example\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png){width=960}\n:::\n:::\n\n\n\n\n## Saving plots to file\n\nIf you want to include your graphic in a paper or anything else, you need to\nsave it as an image. One limitation of base R graphics is that the process for\nsaving plots is a bit annoying.\n\n1. Open a graphics device connection with a graphics function -- examples\ninclude `pdf()`, `png()`, and `tiff()` for the most useful.\n1. Run the code that creates your plot.\n1. Use `dev.off()` to close the graphics device connection.\n\nLet's do an example.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Open the graphics device\npng(\n\t\"my-barplot.png\",\n\twidth = 800,\n\theight = 450,\n\tunits = \"px\"\n)\n# Set the plot layout -- this is an alternative to par(mfrow = ...)\nlayout(matrix(c(1, 2), ncol = 2))\n# Make the plot\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n# Close the graphics device\ndev.off()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\npng \n 2 \n```\n\n\n:::\n\n```{.r .cell-code}\n# Reset the layout\nlayout(1)\n```\n:::\n\n\n\n\nNote: after you do an interactive graphics session, it is often helpful to\nrestart R or run the function `graphics.off()` before opening the graphics\nconnection device.\n\n## Base R plots vs the Tidyverse ggplot2 package\n\nIt is good to know both b/c they each have their strengths\n\n## Summary\n\n- the Base R 'graphics' package has a ton of graphics options that allow for ultimate flexibility\n- Base R plots typically include setting plot options (`par()`), mapping data to the plot (e.g., `plot()`, `barplot()`, `points()`, `lines()`), and creating a legend (`legend()`). \n- the functions `points()` or `lines()` add additional points or additional lines to an existing plot, but must be called with a `plot()`-style function\n- in Base R plotting the legend is not automatically generated, so be careful when creating it\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", "supporting": [ "Module10-DataVisualization_files" ], diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png index baf3c4b..554b146 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png index 5535ebf..7771ab5 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-12-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png index 24d0d37..b7ef384 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png index 4e5c9c8..faacd3e 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png index bf214c3..ee446a7 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-16-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png index ca0e2f6..66ec6a4 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png index ccb4316..b02c913 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-19-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png index a7e02e6..8784b6a 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png index 57c867f..554d371 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-22-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png index edfae88..a83659f 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png index 232d44e..8bfbbe3 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png index c6eb02c..a7fecb4 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-31-1.png differ diff --git a/docs/modules/Module06-DataSubset.html b/docs/modules/Module06-DataSubset.html index d60ee95..57c418e 100644 --- a/docs/modules/Module06-DataSubset.html +++ b/docs/modules/Module06-DataSubset.html @@ -1478,18 +1478,19 @@

Using indexing and logical operators to rename columns

[1] FALSE  TRUE FALSE FALSE FALSE
-
cn[cn=="IgG_concentration"] <-"IgG_concentration_mIU" #rename cn to "IgG_concentration_mIU" when cn is "IgG_concentration"
+
cn[cn=="IgG_concentration"] <-"IgG_concentration_IU/mL" #rename cn to "IgG_concentration_IU" when cn is "IgG_concentration"
 colnames(df) <- cn
 colnames(df)
-
[1] "observation_id"        "IgG_concentration_mIU" "age"                  
-[4] "gender"                "slum"                 
+
[1] "observation_id"          "IgG_concentration_IU/mL"
+[3] "age"                     "gender"                 
+[5] "slum"                   


Note, I am resetting the column name back to the original name for the sake of the rest of the module.

-
colnames(df)[colnames(df)=="IgG_concentration_mIU"] <- "IgG_concentration" #reset
+
colnames(df)[colnames(df)=="IgG_concentration_IU/mL"] <- "IgG_concentration" #reset
diff --git a/docs/modules/Module07-VarCreationClassesSummaries.html b/docs/modules/Module07-VarCreationClassesSummaries.html index 5357a4a..eea2a31 100644 --- a/docs/modules/Module07-VarCreationClassesSummaries.html +++ b/docs/modules/Module07-VarCreationClassesSummaries.html @@ -616,7 +616,7 @@

Adding new columns with transform()

Creating conditional variables

-

One frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE.”

+

One frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE or NA.

?ifelse
@@ -1584,7 +1584,7 @@

Numeric variable data summary

Numeric variable data summary

-

Let’s look at a help file for mean() to make note of the na.rm argument

+

Let’s look at a help file for range() to make note of the na.rm argument

?range
@@ -2067,7 +2067,7 @@

Summary

  • One useful function for creating new variables based on existing variables is the ifelse() function, which returns a value depending on whether the element of test is TRUE or FALSE
  • The class() function allows you to evaluate the class of an object.
  • There are two types of numeric class objects: integer and double
  • -
  • Logical class objects only have TRUE or False (without quotes)
  • +
  • Logical class objects only have TRUE or FALSE or NA (without quotes)
  • is.CLASS_NAME(x) can be used to test the class of an object x
  • as.CLASS_NAME(x) can be used to change the class of an object x
  • Factors are a special character class that has levels
  • diff --git a/docs/modules/Module10-DataVisualization.html b/docs/modules/Module10-DataVisualization.html index 68dd8c9..2430d8a 100644 --- a/docs/modules/Module10-DataVisualization.html +++ b/docs/modules/Module10-DataVisualization.html @@ -461,7 +461,7 @@

    Base R data visualizattion functions

    Description: Package: graphics -Version: 4.3.1 +Version: 4.4.1 Priority: base Title: The R Graphics Package Author: R Core Team and contributors worldwide @@ -469,10 +469,11 @@

    Base R data visualizattion functions

    Contact: R-help mailing list <r-help@r-project.org> Description: R functions for base graphics. Imports: grDevices -License: Part of R 4.3.1 +License: Part of R 4.4.1 NeedsCompilation: yes -Built: R 4.3.1; aarch64-apple-darwin20; 2023-06-16 - 21:53:01 UTC; unix +Enhances: vcd +Built: R 4.4.1; x86_64-w64-mingw32; 2024-06-14 08:20:40 + UTC; windows Index: @@ -618,25 +619,25 @@

    Lots of parameters options

    Several parameters can only be set by a call to 'par()': - • '"ask"', + * '"ask"', - • '"fig"', '"fin"', + * '"fig"', '"fin"', - • '"lheight"', + * '"lheight"', - • '"mai"', '"mar"', '"mex"', '"mfcol"', '"mfrow"', '"mfg"', + * '"mai"', '"mar"', '"mex"', '"mfcol"', '"mfrow"', '"mfg"', - • '"new"', + * '"new"', - • '"oma"', '"omd"', '"omi"', + * '"oma"', '"omd"', '"omi"', - • '"pin"', '"plt"', '"ps"', '"pty"', + * '"pin"', '"plt"', '"ps"', '"pty"', - • '"usr"', + * '"usr"', - • '"xlog"', '"ylog"', + * '"xlog"', '"ylog"', - • '"ylbias"' + * '"ylbias"' The remaining parameters can also be set as arguments (often via '...') to high-level plot functions such as 'plot.default', @@ -1121,12 +1122,16 @@

    Lots of parameters options

    directly in terms of their RGB components with a string of the form '"#RRGGBB"' where each of the pairs 'RR', 'GG', 'BB' consist of two hexadecimal digits giving a value in the range '00' to - 'FF'. Colors can also be specified by giving an index into a - small table of colors, the 'palette': indices wrap round so with - the default palette of size 8, '10' is the same as '2'. This - provides compatibility with S. Index '0' corresponds to the - background color. Note that the palette (apart from '0' which is - per-device) is a per-session setting. + 'FF'. Hexadecimal colors can be in the long hexadecimal form + (e.g., '"#rrggbb"' or '"#rrggbbaa"') or the short form (e.g, + '"#rgb"' or '"#rgba"'). The short form is expanded to the long + form by replicating digits (not by adding zeroes), e.g., '"#rgb"' + becomes '"#rrggbb"'. Colors can also be specified by giving an + index into a small table of colors, the 'palette': indices wrap + round so with the default palette of size 8, '10' is the same as + '2'. This provides compatibility with S. Index '0' corresponds + to the background color. Note that the palette (apart from '0' + which is per-device) is a per-session setting. Negative integer colours are errors. @@ -1173,8 +1178,8 @@

    Lots of parameters options

    See Also:

     'plot.default' for some high-level plotting parameters; 'colors';
      'clip'; 'options' for other setup parameters; graphic devices
    - 'x11', 'postscript' and setting up device regions by 'layout' and
    - 'split.screen'.
    + 'x11', 'pdf', 'postscript' and setting up device regions by + 'layout' and 'split.screen'.

    Examples:

     op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot
                pty = "s")       # square plotting region,
    @@ -1297,17 +1302,17 @@ 

    hist() Help File

    Arguments:

       x: a vector of values for which the histogram is desired.

    breaks: one of:

    -
            • a vector giving the breakpoints between histogram cells,
    +
            * a vector giving the breakpoints between histogram cells,
     
    -        • a function to compute the vector of breakpoints,
    +        * a function to compute the vector of breakpoints,
     
    -        • a single number giving the number of cells for the
    +        * a single number giving the number of cells for the
               histogram,
     
    -        • a character string naming an algorithm to compute the
    +        * a character string naming an algorithm to compute the
               number of cells (see 'Details'),
     
    -        • a function to compute the number of cells.
    +        * a function to compute the number of cells.
     
           In the last three cases the number is a suggestion only; as
           the breakpoints will be set to 'pretty' values, the number is
    @@ -1351,13 +1356,13 @@ 

    hist() Help File

    TRUE').

    Details:

     The definition of _histogram_ differs by source (with
    - country-specific biases).  R's default with equi-spaced breaks
    + country-specific biases).  R's default with equispaced breaks
      (also the default) is to plot the counts in the cells defined by
      'breaks'.  Thus the height of a rectangle is proportional to the
      number of points falling into the cell, as is the area _provided_
      the breaks are equally-spaced.
     
    - The default with non-equi-spaced breaks is to give a plot of area
    + The default with non-equispaced breaks is to give a plot of area
      one, in which the _area_ of the rectangles is the fraction of the
      data points falling in the cells.
     
    @@ -1513,30 +1518,30 @@ 

    plot() Help File

    y: the y coordinates of points in the plot, _optional_ if 'x' is an appropriate structure. - ...: Arguments to be passed to methods, such as graphical + ...: arguments to be passed to methods, such as graphical parameters (see 'par'). Many methods will accept the following arguments: 'type' what type of plot should be drawn. Possible types are - • '"p"' for *p*oints, + * '"p"' for *p*oints, - • '"l"' for *l*ines, + * '"l"' for *l*ines, - • '"b"' for *b*oth, + * '"b"' for *b*oth, - • '"c"' for the lines part alone of '"b"', + * '"c"' for the lines part alone of '"b"', - • '"o"' for both '*o*verplotted', + * '"o"' for both '*o*verplotted', - • '"h"' for '*h*istogram' like (or 'high-density') + * '"h"' for '*h*istogram' like (or 'high-density') vertical lines, - • '"s"' for stair *s*teps, + * '"s"' for stair *s*teps, - • '"S"' for other *s*teps, see 'Details' below, + * '"S"' for other *s*teps, see 'Details' below, - • '"n"' for no plotting. + * '"n"' for no plotting. All other 'type's give a warning or an error; using, e.g., 'type = "punkte"' being equivalent to 'type = "p"' @@ -1709,7 +1714,7 @@

    boxplot() Help File

    range: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.

    width: a vector giving the relative widths of the boxes making up the plot.

    varwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.

    -

    notch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See ‘boxplot.stats’ for the calculations used.

    +

    notch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al., 1983, p. 62). See ‘boxplot.stats’ for the calculations used.

    outline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).

    names: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).

    boxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.

    @@ -1904,7 +1909,7 @@

    barplot() Help File

    gamma-corrected grey palette if 'height' is a matrix; see 'grey.colors'.

    border: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.

    -

    main,sub: main title and subtitle for the plot.

    +

    main, sub: main title and subtitle for the plot.

    xlab: a label for the x axis.
     
     ylab: a label for the y axis.
    @@ -2029,7 +2034,6 @@ 

    barplot() Help File

    # Border color barplot(VADeaths, border = "dark blue") - # Log scales (not much sense here) barplot(tN, col = heat.colors(12), log = "y") barplot(tN, col = gray.colors(20), log = "xy") @@ -2512,6 +2516,46 @@

    barplot() example

    barplot() example

    +
    +

    Saving plots to file

    +

    If you want to include your graphic in a paper or anything else, you need to save it as an image. One limitation of base R graphics is that the process for saving plots is a bit annoying.

    +
      +
    1. Open a graphics device connection with a graphics function – examples include pdf(), png(), and tiff() for the most useful.
    2. +
    3. Run the code that creates your plot.
    4. +
    5. Use dev.off() to close the graphics device connection.
    6. +
    +

    Let’s do an example.

    +
    +
    # Open the graphics device
    +png(
    +    "my-barplot.png",
    +    width = 800,
    +    height = 450,
    +    units = "px"
    +)
    +# Set the plot layout -- this is an alternative to par(mfrow = ...)
    +layout(matrix(c(1, 2), ncol = 2))
    +# Make the plot
    +barplot(prop.column.percentages, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group")
    +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    +legend("topright",
    +             fill=c("darkblue","red"), 
    +             legend = c("seronegative", "seropositive"))
    +
    +barplot(prop.column.percentages2, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Residence")
    +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    +legend("topright", fill=c("darkblue","red"),  legend = c("seronegative", "seropositive"))
    +# Close the graphics device
    +dev.off()
    +
    +
    png 
    +  2 
    +
    +
    # Reset the layout
    +layout(1)
    +
    +

    Note: after you do an interactive graphics session, it is often helpful to restart R or run the function graphics.off() before opening the graphics connection device.

    +

    Base R plots vs the Tidyverse ggplot2 package

    It is good to know both b/c they each have their strengths

    diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png index baf3c4b..554b146 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png index 5535ebf..7771ab5 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png index 24d0d37..b7ef384 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png index 4e5c9c8..faacd3e 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png index bf214c3..ee446a7 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-16-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png index ca0e2f6..66ec6a4 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png index ccb4316..b02c913 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-19-2.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png index a7e02e6..8784b6a 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png index 57c867f..554d371 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-22-2.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png index edfae88..a83659f 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-26-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png index 232d44e..8bfbbe3 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-28-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png index c6eb02c..a7fecb4 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-31-1.png differ diff --git a/docs/references.html b/docs/references.html index 6e76978..001c2aa 100644 --- a/docs/references.html +++ b/docs/references.html @@ -317,6 +317,7 @@

    Page Items

    @@ -365,6 +366,19 @@

    Data and Exercise downloads

  • Course GitHub where all materials can be found (to download the entire course as a zip file click the green “Code” button): https://github.com/UGA-IDD/SISMID-2024.
  • +
    +

    Useful (+ Free) Resources

    +
      +
    • R for Data Science: http://r4ds.had.co.nz/
      +(great general information)
    • +
    • Fundamentals of Data Visualization: https://clauswilke.com/dataviz/
    • +
    • R for Epidemiology: https://www.r4epi.com/
    • +
    • The Epidemiologist R Handbook: https://epirhandbook.com/en/
    • +
    • R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html (great general information)
    • +
    • Open Case Studies: https://www.opencasestudies.org/
      +(resource for specific public health cases with statistical implementation and interpretation)
    • +
    +

    Need help?

      @@ -1036,19 +1050,31 @@

      Other references

      - And the rendered HTML file is here: [click to download](./modules/Module11-Rmarkdown-Demo.html){target="_blank"} * Course GitHub where all materials can be found (to download the entire course as a zip file click the green "Code" button): [https://github.com/UGA-IDD/SISMID-2024](https://github.com/UGA-IDD/SISMID-2024){target="_blank"}. -# Need help? +# Useful (+ Free) Resources -- Various "Cheat Sheets": [https://github.com/rstudio/cheatsheets/](https://github.com/rstudio/cheatsheets/) -- R reference card: [http://cran.r-project.org/doc/contrib/Short-refcard.pdf](http://cran.r-project.org/doc/contrib/Short-refcard.pdf) -- R jargon: [https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf) -- R vs Stata: [https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf) -- R terminology: [https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf](https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf) - - -# Other references {.unnumbered} - -::: {#refs} -:::
      +- R for Data Science: http://r4ds.had.co.nz/ +(great general information) +- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ +- R for Epidemiology: https://www.r4epi.com/ +- The Epidemiologist R Handbook: https://epirhandbook.com/en/ +- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html +(great general information) +- Open Case Studies: https://www.opencasestudies.org/ +(resource for specific public health cases with statistical implementation and interpretation) + +# Need help? + +- Various "Cheat Sheets": [https://github.com/rstudio/cheatsheets/](https://github.com/rstudio/cheatsheets/) +- R reference card: [http://cran.r-project.org/doc/contrib/Short-refcard.pdf](http://cran.r-project.org/doc/contrib/Short-refcard.pdf) +- R jargon: [https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf) +- R vs Stata: [https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf) +- R terminology: [https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf](https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf) + + +# Other references {.unnumbered} + +::: {#refs} +:::
      diff --git a/docs/schedule.html b/docs/schedule.html index 2607d73..b1f3068 100644 --- a/docs/schedule.html +++ b/docs/schedule.html @@ -440,15 +440,15 @@

      Day 02 – Tuesday

      08:30 am - 09:00 am -exercise review and questions / catchup +exercise review and questions / catchup (Zane) 09:00 am - 09:15 am -Module 8 +Module 8 (Amy) 09:15 am - 10:00 am -Exercise 3 work time +Data analysis walkthrough (Zane and Amy) 10:00 am - 10:30 am @@ -456,15 +456,15 @@

      Day 02 – Tuesday

      10:30 am - 10:45 am -Exercise review +Exercise 3 work time 10:45 am - 11:15 am -Module 9 +Exercise review (Zane) 11:15 am - 12:00 pm -Data analysis walkthrough +Module 9 (Amy) 12:00 pm - 01:30 pm @@ -476,11 +476,11 @@

      Day 02 – Tuesday

      02:00 pm - 02:30 pm -Exercise 4 review +Exercise 4 review (Zane) 02:30 pm - 03:00 pm -Module 10 +Module 10 (Amy) 03:00 pm - 03:30 pm @@ -492,11 +492,11 @@

      Day 02 – Tuesday

      04:00 pm - 04:30 pm -Review exercise 5 +Review exercise 5 (Zane) 04:30 pm - 05:00 pm -Module 11 +Module 11 (Zane) @@ -1193,21 +1193,21 @@

      Day 03 – Wednesday

      | Time | Section | |:--------------------|:--------| -| 08:30 am - 09:00 am | exercise review and questions / catchup | -| 09:00 am - 09:15 am | Module 8 | -| 09:15 am - 10:00 am | Exercise 3 work time | +| 08:30 am - 09:00 am | exercise review and questions / catchup (Zane) | +| 09:00 am - 09:15 am | Module 8 (Amy) | +| 09:15 am - 10:00 am | Data analysis walkthrough (Zane and Amy) | | 10:00 am - 10:30 am | Coffee break | -| 10:30 am - 10:45 am | Exercise review | -| 10:45 am - 11:15 am | Module 9 | -| 11:15 am - 12:00 pm | Data analysis walkthrough | +| 10:30 am - 10:45 am | Exercise 3 work time | +| 10:45 am - 11:15 am | Exercise review (Zane) | +| 11:15 am - 12:00 pm | Module 9 (Amy) | | 12:00 pm - 01:30 pm | Lunch (2nd floor lobby); **Lunch and Learn!** | | 01:30 pm - 02:00 pm | Exercise 4 | -| 02:00 pm - 02:30 pm | Exercise 4 review | -| 02:30 pm - 03:00 pm | Module 10 | +| 02:00 pm - 02:30 pm | Exercise 4 review (Zane) | +| 02:30 pm - 03:00 pm | Module 10 (Amy) | | 03:00 pm - 03:30 pm | Coffee break | | 03:30 pm - 04:00 pm | Exercise 5 | -| 04:00 pm - 04:30 pm | Review exercise 5 | -| 04:30 pm - 05:00 pm | Module 11 | +| 04:00 pm - 04:30 pm | Review exercise 5 (Zane) | +| 04:30 pm - 05:00 pm | Module 11 (Zane) | : {.striped .hover tbl-colwidths="[25,75]"} diff --git a/docs/search.json b/docs/search.json index 6274756..ffb718d 100644 --- a/docs/search.json +++ b/docs/search.json @@ -334,7 +334,7 @@ "href": "modules/Module07-VarCreationClassesSummaries.html#creating-conditional-variables", "title": "Module 7: Variable Creation, Classes, and Summaries", "section": "Creating conditional variables", - "text": "Creating conditional variables\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE.”\n\n?ifelse\n\nConditional Element Selection\nDescription:\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\nUsage:\n ifelse(test, yes, no)\n \nArguments:\ntest: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\nDetails:\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\nValue:\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\nWarning:\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'if'.\nExamples:\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1", + "text": "Creating conditional variables\nOne frequently used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R ifelse() function, which “returns a value depending on whether the element of test is TRUE or FALSE or NA.\n\n?ifelse\n\nConditional Element Selection\nDescription:\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\nUsage:\n ifelse(test, yes, no)\n \nArguments:\ntest: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\nDetails:\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\nValue:\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\nWarning:\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'if'.\nExamples:\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1", "crumbs": [ "Day 1", "Module 7: Variable Creation, Classes, and Summaries" @@ -565,7 +565,7 @@ "href": "modules/Module07-VarCreationClassesSummaries.html#numeric-variable-data-summary-1", "title": "Module 7: Variable Creation, Classes, and Summaries", "section": "Numeric variable data summary", - "text": "Numeric variable data summary\nLet’s look at a help file for mean() to make note of the na.rm argument\n\n?range\n\nRange of Values\nDescription:\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\nUsage:\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n ...: any 'numeric' or character objects.\nna.rm: logical, indicating if ‘NA’’s should be omitted.\nfinite: logical, indicating if all non-finite elements should be omitted.\nisNumeric: a ‘function’ returning ‘TRUE’ or ‘FALSE’ when called on ‘c(…, recursive = TRUE)’, ‘is.numeric()’ for the default ‘range()’ method.\nDetails:\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\nS4 methods:\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\nExamples:\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)", + "text": "Numeric variable data summary\nLet’s look at a help file for range() to make note of the na.rm argument\n\n?range\n\nRange of Values\nDescription:\n 'range' returns a vector containing the minimum and maximum of all\n the given arguments.\nUsage:\n range(..., na.rm = FALSE)\n ## Default S3 method:\n range(..., na.rm = FALSE, finite = FALSE)\n ## same for classes 'Date' and 'POSIXct'\n \n .rangeNum(..., na.rm, finite, isNumeric)\n \nArguments:\n ...: any 'numeric' or character objects.\nna.rm: logical, indicating if ‘NA’’s should be omitted.\nfinite: logical, indicating if all non-finite elements should be omitted.\nisNumeric: a ‘function’ returning ‘TRUE’ or ‘FALSE’ when called on ‘c(…, recursive = TRUE)’, ‘is.numeric()’ for the default ‘range()’ method.\nDetails:\n 'range' is a generic function: methods can be defined for it\n directly or via the 'Summary' group generic. For this to work\n properly, the arguments '...' should be unnamed, and dispatch is\n on the first argument.\n\n If 'na.rm' is 'FALSE', 'NA' and 'NaN' values in any of the\n arguments will cause 'NA' values to be returned, otherwise 'NA'\n values are ignored.\n\n If 'finite' is 'TRUE', the minimum and maximum of all finite\n values is computed, i.e., 'finite = TRUE' _includes_ 'na.rm =\n TRUE'.\n\n A special situation occurs when there is no (after omission of\n 'NA's) nonempty argument left, see 'min'.\nS4 methods:\n This is part of the S4 'Summary' group generic. Methods for it\n must use the signature 'x, ..., na.rm'.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\nSee Also:\n 'min', 'max'.\n\n The 'extendrange()' utility in package 'grDevices'.\nExamples:\n (r.x <- range(stats::rnorm(100)))\n diff(r.x) # the SAMPLE range\n \n x <- c(NA, 1:3, -1:1/0); x\n range(x)\n range(x, na.rm = TRUE)\n range(x, finite = TRUE)", "crumbs": [ "Day 1", "Module 7: Variable Creation, Classes, and Summaries" @@ -609,7 +609,7 @@ "href": "modules/Module07-VarCreationClassesSummaries.html#summary", "title": "Module 7: Variable Creation, Classes, and Summaries", "section": "Summary", - "text": "Summary\n\nYou can create new columns/variable to a data frame by using $ or the transform() function\nOne useful function for creating new variables based on existing variables is the ifelse() function, which returns a value depending on whether the element of test is TRUE or FALSE\nThe class() function allows you to evaluate the class of an object.\nThere are two types of numeric class objects: integer and double\nLogical class objects only have TRUE or False (without quotes)\nis.CLASS_NAME(x) can be used to test the class of an object x\nas.CLASS_NAME(x) can be used to change the class of an object x\nFactors are a special character class that has levels\nThere are many fairly intuitive data summary functions you can perform on a vector (i.e., mean(), sd(), range()) or on rows or columns of a data frame (i.e., colSums(), colMeans(), rowSums())\nThe table() function builds frequency tables of the counts at each combination of categorical levels", + "text": "Summary\n\nYou can create new columns/variable to a data frame by using $ or the transform() function\nOne useful function for creating new variables based on existing variables is the ifelse() function, which returns a value depending on whether the element of test is TRUE or FALSE\nThe class() function allows you to evaluate the class of an object.\nThere are two types of numeric class objects: integer and double\nLogical class objects only have TRUE or FALSE or NA (without quotes)\nis.CLASS_NAME(x) can be used to test the class of an object x\nas.CLASS_NAME(x) can be used to change the class of an object x\nFactors are a special character class that has levels\nThere are many fairly intuitive data summary functions you can perform on a vector (i.e., mean(), sd(), range()) or on rows or columns of a data frame (i.e., colSums(), colMeans(), rowSums())\nThe table() function builds frequency tables of the counts at each combination of categorical levels", "crumbs": [ "Day 1", "Module 7: Variable Creation, Classes, and Summaries" @@ -1819,7 +1819,7 @@ "href": "references.html", "title": "Course Resources", "section": "", - "text": "Data and Exercise downloads\n\nDownload all datasets here: click to download.\nDownload all exercises and solution files here: click to download\nDownload all slide decks here: click to download\nGet the example R Markdown document for Module 11 here: click to download\n\nAnd the sample bibligraphy “bib” file is here: click to download\nAnd the rendered HTML file is here: click to download\n\nCourse GitHub where all materials can be found (to download the entire course as a zip file click the green “Code” button): https://github.com/UGA-IDD/SISMID-2024.\n\n\n\nNeed help?\n\nVarious “Cheat Sheets”: https://github.com/rstudio/cheatsheets/\nR reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf\n\nR jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n\nOther references\n\n\nBatra, Neale, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Florence Isaac, Natalie Fischer, et al. 2021. epiR Handbook. Edited by Neale Batra. https://epirhandbook.com/; Applied Epi Incorporated.\n\n\nCarchedi, Nick, and Sean Kross. 2024. “Learn r, in r.” Swirl. https://swirlstats.com/.\n\n\nKeyes, David. 2024. R for the Rest of Us: A Statistics-Free Introduction. San Francisco, CA: No Starch Press.\n\n\nMatloff, Norman. 2011. The Art of R Programming. San Francisco, CA: No Starch Press.\n\n\nR Core team. 2024. An Introduction to R. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. Sebastopol, CA: https://r4ds.hadley.nz/; O’Reilly Media.\n\n\n\n\n\n\n\n\nReuseCC BY-NC 4.0", + "text": "Data and Exercise downloads\n\nDownload all datasets here: click to download.\nDownload all exercises and solution files here: click to download\nDownload all slide decks here: click to download\nGet the example R Markdown document for Module 11 here: click to download\n\nAnd the sample bibligraphy “bib” file is here: click to download\nAnd the rendered HTML file is here: click to download\n\nCourse GitHub where all materials can be found (to download the entire course as a zip file click the green “Code” button): https://github.com/UGA-IDD/SISMID-2024.\n\n\n\nUseful (+ Free) Resources\n\nR for Data Science: http://r4ds.had.co.nz/\n(great general information)\nFundamentals of Data Visualization: https://clauswilke.com/dataviz/\nR for Epidemiology: https://www.r4epi.com/\nThe Epidemiologist R Handbook: https://epirhandbook.com/en/\nR basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html (great general information)\nOpen Case Studies: https://www.opencasestudies.org/\n(resource for specific public health cases with statistical implementation and interpretation)\n\n\n\nNeed help?\n\nVarious “Cheat Sheets”: https://github.com/rstudio/cheatsheets/\nR reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf\n\nR jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf\nR terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n\nOther references\n\n\nBatra, Neale, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Florence Isaac, Natalie Fischer, et al. 2021. epiR Handbook. Edited by Neale Batra. https://epirhandbook.com/; Applied Epi Incorporated.\n\n\nCarchedi, Nick, and Sean Kross. 2024. “Learn r, in r.” Swirl. https://swirlstats.com/.\n\n\nKeyes, David. 2024. R for the Rest of Us: A Statistics-Free Introduction. San Francisco, CA: No Starch Press.\n\n\nMatloff, Norman. 2011. The Art of R Programming. San Francisco, CA: No Starch Press.\n\n\nR Core team. 2024. An Introduction to R. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.\n\n\nWickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. Sebastopol, CA: https://r4ds.hadley.nz/; O’Reilly Media.\n\n\n\n\n\n\n\n\nReuseCC BY-NC 4.0", "crumbs": [ "Course Resources" ] @@ -1978,7 +1978,7 @@ "href": "schedule.html#day-02-tuesday", "title": "Course Schedule", "section": "Day 02 – Tuesday", - "text": "Day 02 – Tuesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 09:00 am\nexercise review and questions / catchup\n\n\n09:00 am - 09:15 am\nModule 8\n\n\n09:15 am - 10:00 am\nExercise 3 work time\n\n\n10:00 am - 10:30 am\nCoffee break\n\n\n10:30 am - 10:45 am\nExercise review\n\n\n10:45 am - 11:15 am\nModule 9\n\n\n11:15 am - 12:00 pm\nData analysis walkthrough\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby); Lunch and Learn!\n\n\n01:30 pm - 02:00 pm\nExercise 4\n\n\n02:00 pm - 02:30 pm\nExercise 4 review\n\n\n02:30 pm - 03:00 pm\nModule 10\n\n\n03:00 pm - 03:30 pm\nCoffee break\n\n\n03:30 pm - 04:00 pm\nExercise 5\n\n\n04:00 pm - 04:30 pm\nReview exercise 5\n\n\n04:30 pm - 05:00 pm\nModule 11", + "text": "Day 02 – Tuesday\n\n\n\n\n\n\n\nTime\nSection\n\n\n\n\n08:30 am - 09:00 am\nexercise review and questions / catchup (Zane)\n\n\n09:00 am - 09:15 am\nModule 8 (Amy)\n\n\n09:15 am - 10:00 am\nData analysis walkthrough (Zane and Amy)\n\n\n10:00 am - 10:30 am\nCoffee break\n\n\n10:30 am - 10:45 am\nExercise 3 work time\n\n\n10:45 am - 11:15 am\nExercise review (Zane)\n\n\n11:15 am - 12:00 pm\nModule 9 (Amy)\n\n\n12:00 pm - 01:30 pm\nLunch (2nd floor lobby); Lunch and Learn!\n\n\n01:30 pm - 02:00 pm\nExercise 4\n\n\n02:00 pm - 02:30 pm\nExercise 4 review (Zane)\n\n\n02:30 pm - 03:00 pm\nModule 10 (Amy)\n\n\n03:00 pm - 03:30 pm\nCoffee break\n\n\n03:30 pm - 04:00 pm\nExercise 5\n\n\n04:00 pm - 04:30 pm\nReview exercise 5 (Zane)\n\n\n04:30 pm - 05:00 pm\nModule 11 (Zane)", "crumbs": [ "Course Schedule" ] @@ -2724,7 +2724,7 @@ "href": "modules/Module10-DataVisualization.html#base-r-data-visualizattion-functions", "title": "Module 10: Data Visualization", "section": "Base R data visualizattion functions", - "text": "Base R data visualizattion functions\nThe Base R ‘graphics’ package has a ton of graphics options.\n\nhelp(package = \"graphics\")\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\n Information on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team <do-use-Contact-address@r-project.org>\nContact: R-help mailing list <r-help@r-project.org>\nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline", + "text": "Base R data visualizattion functions\nThe Base R ‘graphics’ package has a ton of graphics options.\n\nhelp(package = \"graphics\")\n\n\n\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n\n\n Information on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.4.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team <do-use-Contact-address@r-project.org>\nContact: R-help mailing list <r-help@r-project.org>\nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.4.1\nNeedsCompilation: yes\nEnhances: vcd\nBuilt: R 4.4.1; x86_64-w64-mingw32; 2024-06-14 08:20:40\n UTC; windows\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2768,7 +2768,7 @@ "href": "modules/Module10-DataVisualization.html#lots-of-parameters-options", "title": "Module 10: Data Visualization", "section": "Lots of parameters options", - "text": "Lots of parameters options\nHowever, there are many more parameter options that can be specified in the ‘global’ settings or specific to a certain plot option.\n\n?par\n\nSet or Query Graphical Parameters\nDescription:\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\nUsage:\n par(..., no.readonly = FALSE)\n \n <highlevel plot> (...., <tag> = <value>)\n \nArguments:\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\nno.readonly: logical; if ‘TRUE’ and there are no other arguments, only parameters are returned which can be set by a subsequent ‘par()’ call on the same device.\nDetails:\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\nValue:\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\nGraphical Parameters:\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\nColor Specification:\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\nLine Type Specification:\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\nNote:\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\nExamples:\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))", + "text": "Lots of parameters options\nHowever, there are many more parameter options that can be specified in the ‘global’ settings or specific to a certain plot option.\n\n?par\n\nSet or Query Graphical Parameters\nDescription:\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\nUsage:\n par(..., no.readonly = FALSE)\n \n <highlevel plot> (...., <tag> = <value>)\n \nArguments:\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\nno.readonly: logical; if ‘TRUE’ and there are no other arguments, only parameters are returned which can be set by a subsequent ‘par()’ call on the same device.\nDetails:\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n * '\"ask\"',\n\n * '\"fig\"', '\"fin\"',\n\n * '\"lheight\"',\n\n * '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n * '\"new\"',\n\n * '\"oma\"', '\"omd\"', '\"omi\"',\n\n * '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n * '\"usr\"',\n\n * '\"xlog\"', '\"ylog\"',\n\n * '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\nValue:\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\nGraphical Parameters:\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\nColor Specification:\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Hexadecimal colors can be in the long hexadecimal form\n (e.g., '\"#rrggbb\"' or '\"#rrggbbaa\"') or the short form (e.g,\n '\"#rgb\"' or '\"#rgba\"'). The short form is expanded to the long\n form by replicating digits (not by adding zeroes), e.g., '\"#rgb\"'\n becomes '\"#rrggbb\"'. Colors can also be specified by giving an\n index into a small table of colors, the 'palette': indices wrap\n round so with the default palette of size 8, '10' is the same as\n '2'. This provides compatibility with S. Index '0' corresponds\n to the background color. Note that the palette (apart from '0'\n which is per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\nLine Type Specification:\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\nNote:\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'pdf', 'postscript' and setting up device regions by\n 'layout' and 'split.screen'.\nExamples:\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2812,7 +2812,7 @@ "href": "modules/Module10-DataVisualization.html#hist-help-file", "title": "Module 10: Data Visualization", "section": "hist() Help File", - "text": "hist() Help File\n\n?hist\n\nHistograms\nDescription:\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\nUsage:\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n x: a vector of values for which the histogram is desired.\nbreaks: one of:\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\nfreq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\nprobability: an alias for ‘!freq’, for S compatibility.\ninclude.lowest: logical; if ‘TRUE’, an ‘x[i]’ equal to the ‘breaks’ value will be included in the first (or last, for ‘right = FALSE’) bar. This will be ignored (with a warning) unless ‘breaks’ is a vector.\nright: logical; if ‘TRUE’, the histogram cells are right-closed (left open) intervals.\nfuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\ndensity: the density of shading lines, in lines per inch. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise).\n col: a colour to be used to fill the bars.\nborder: the color of the border around the bars. The default is to use the standard foreground color.\nmain, xlab, ylab: main title and axis labels: these arguments to ‘title()’ get “smart” defaults here, e.g., the default ‘ylab’ is ‘“Frequency”’ iff ‘freq’ is true.\nxlim, ylim: the range of x and y values with sensible defaults. Note that ‘xlim’ is not used to define the histogram (breaks), but only for plotting (when ‘plot = TRUE’).\naxes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\nplot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\nlabels: logical or character string. Additionally draw labels on top of bars, if not ‘FALSE’; see ‘plot.histogram’.\nnclass: numeric (integer). For S(-PLUS) compatibility only, ‘nclass’ is equivalent to ‘breaks’ for a scalar or character argument.\nwarn.unused: logical. If ‘plot = FALSE’ and ‘warn.unused = TRUE’, a warning will be issued when graphical parameters are passed to ‘hist.default()’.\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\nDetails:\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\nValue:\n an object of class '\"histogram\"' which is a list with components:\nbreaks: the n+1 cell boundaries (= ‘breaks’ if that was a vector). These are the nominal breaks, not with the boundary fuzz.\ncounts: n integers; for each cell, the number of ‘x[]’ inside.\ndensity: values f^(x[i]), as estimated density values. If ‘all(diff(breaks) == 1)’, they are the relative frequencies ‘counts/n’ and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = ‘breaks[i]’.\nmids: the n cell midpoints.\nxname: a character string with the actual ‘x’ argument name.\nequidist: logical, indicating if the distances between ‘breaks’ are all the same.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\nSee Also:\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\nExamples:\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)", + "text": "hist() Help File\n\n?hist\n\nHistograms\nDescription:\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\nUsage:\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n x: a vector of values for which the histogram is desired.\nbreaks: one of:\n * a vector giving the breakpoints between histogram cells,\n\n * a function to compute the vector of breakpoints,\n\n * a single number giving the number of cells for the\n histogram,\n\n * a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n * a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\nfreq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\nprobability: an alias for ‘!freq’, for S compatibility.\ninclude.lowest: logical; if ‘TRUE’, an ‘x[i]’ equal to the ‘breaks’ value will be included in the first (or last, for ‘right = FALSE’) bar. This will be ignored (with a warning) unless ‘breaks’ is a vector.\nright: logical; if ‘TRUE’, the histogram cells are right-closed (left open) intervals.\nfuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\ndensity: the density of shading lines, in lines per inch. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise).\n col: a colour to be used to fill the bars.\nborder: the color of the border around the bars. The default is to use the standard foreground color.\nmain, xlab, ylab: main title and axis labels: these arguments to ‘title()’ get “smart” defaults here, e.g., the default ‘ylab’ is ‘“Frequency”’ iff ‘freq’ is true.\nxlim, ylim: the range of x and y values with sensible defaults. Note that ‘xlim’ is not used to define the histogram (breaks), but only for plotting (when ‘plot = TRUE’).\naxes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\nplot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\nlabels: logical or character string. Additionally draw labels on top of bars, if not ‘FALSE’; see ‘plot.histogram’.\nnclass: numeric (integer). For S(-PLUS) compatibility only, ‘nclass’ is equivalent to ‘breaks’ for a scalar or character argument.\nwarn.unused: logical. If ‘plot = FALSE’ and ‘warn.unused = TRUE’, a warning will be issued when graphical parameters are passed to ‘hist.default()’.\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\nDetails:\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equispaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equispaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\nValue:\n an object of class '\"histogram\"' which is a list with components:\nbreaks: the n+1 cell boundaries (= ‘breaks’ if that was a vector). These are the nominal breaks, not with the boundary fuzz.\ncounts: n integers; for each cell, the number of ‘x[]’ inside.\ndensity: values f^(x[i]), as estimated density values. If ‘all(diff(breaks) == 1)’, they are the relative frequencies ‘counts/n’ and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = ‘breaks[i]’.\nmids: the n cell midpoints.\nxname: a character string with the actual ‘x’ argument name.\nequidist: logical, indicating if the distances between ‘breaks’ are all the same.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\nSee Also:\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\nExamples:\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2834,7 +2834,7 @@ "href": "modules/Module10-DataVisualization.html#plot-help-file", "title": "Module 10: Data Visualization", "section": "plot() Help File", - "text": "plot() Help File\n\n?plot\n\nGeneric X-Y Plotting\nDescription:\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\nUsage:\n plot(x, y, ...)\n \nArguments:\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\nDetails:\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\nNote:\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\nSee Also:\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\nExamples:\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")", + "text": "plot() Help File\n\n?plot\n\nGeneric X-Y Plotting\nDescription:\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\nUsage:\n plot(x, y, ...)\n \nArguments:\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n * '\"p\"' for *p*oints,\n\n * '\"l\"' for *l*ines,\n\n * '\"b\"' for *b*oth,\n\n * '\"c\"' for the lines part alone of '\"b\"',\n\n * '\"o\"' for both '*o*verplotted',\n\n * '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n * '\"s\"' for stair *s*teps,\n\n * '\"S\"' for other *s*teps, see 'Details' below,\n\n * '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\nDetails:\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\nNote:\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\nSee Also:\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\nExamples:\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2867,7 +2867,7 @@ "href": "modules/Module10-DataVisualization.html#boxplot-help-file", "title": "Module 10: Data Visualization", "section": "boxplot() Help File", - "text": "boxplot() Help File\n\n?boxplot\n\nBox Plots\nDescription:\n Produce box-and-whisker plot(s) of the given (grouped) values.\nUsage:\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\nformula: a formula, such as ‘y ~ grp’, where ‘y’ is a numeric vector of data values to be split into groups according to the grouping variable ‘grp’ (usually a factor). Note that ‘~ g1 + g2’ is equivalent to ‘g1:g2’.\ndata: a data.frame (or list) from which the variables in 'formula'\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used for plotting.\nna.action: a function which indicates what should happen when the data contain ’NA’s. The default is to ignore missing values in either the response or the group.\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ‘ann=FALSE’.\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\ndrop, sep, lex.order: passed to ‘split.default’, see there.\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\nrange: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.\nwidth: a vector giving the relative widths of the boxes making up the plot.\nvarwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.\nnotch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See ‘boxplot.stats’ for the calculations used.\noutline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).\nnames: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).\nboxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.\nstaplewex: staple line width expansion, proportional to box width.\noutwex: outlier line width expansion, proportional to box width.\nplot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\nborder: an optional vector of colors for the outlines of the boxplots. The values in ‘border’ are recycled if the length of ‘border’ is less than the number of plots.\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\npars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\nhorizontal: logical indicating if the boxplots should be horizontal; default ‘FALSE’ means vertical boxes.\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\nDetails:\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\nValue:\n List with the following components:\nstats: a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.\n n: a vector with the number of (non-'NA') observations in each\n group.\n\nconf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\ngroup: a vector of the same length as ‘out’ whose elements indicate to which group the outlier belongs.\nnames: a vector of names for the groups.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\nSee Also:\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\nExamples:\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)", + "text": "boxplot() Help File\n\n?boxplot\n\nBox Plots\nDescription:\n Produce box-and-whisker plot(s) of the given (grouped) values.\nUsage:\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\nformula: a formula, such as ‘y ~ grp’, where ‘y’ is a numeric vector of data values to be split into groups according to the grouping variable ‘grp’ (usually a factor). Note that ‘~ g1 + g2’ is equivalent to ‘g1:g2’.\ndata: a data.frame (or list) from which the variables in 'formula'\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used for plotting.\nna.action: a function which indicates what should happen when the data contain ’NA’s. The default is to ignore missing values in either the response or the group.\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ‘ann=FALSE’.\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\ndrop, sep, lex.order: passed to ‘split.default’, see there.\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\nrange: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.\nwidth: a vector giving the relative widths of the boxes making up the plot.\nvarwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.\nnotch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al., 1983, p. 62). See ‘boxplot.stats’ for the calculations used.\noutline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).\nnames: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).\nboxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.\nstaplewex: staple line width expansion, proportional to box width.\noutwex: outlier line width expansion, proportional to box width.\nplot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\nborder: an optional vector of colors for the outlines of the boxplots. The values in ‘border’ are recycled if the length of ‘border’ is less than the number of plots.\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\npars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\nhorizontal: logical indicating if the boxplots should be horizontal; default ‘FALSE’ means vertical boxes.\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\nDetails:\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\nValue:\n List with the following components:\nstats: a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.\n n: a vector with the number of (non-'NA') observations in each\n group.\n\nconf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\ngroup: a vector of the same length as ‘out’ whose elements indicate to which group the outlier belongs.\nnames: a vector of names for the groups.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\nSee Also:\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\nExamples:\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2889,7 +2889,7 @@ "href": "modules/Module10-DataVisualization.html#barplot-help-file", "title": "Module 10: Data Visualization", "section": "barplot() Help File", - "text": "barplot() Help File\n\n?barplot\n\nBar Plots\nDescription:\n Creates a bar plot with vertical or horizontal bars.\nUsage:\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\nheight: either a vector or matrix of values describing the bars which make up the plot. If ‘height’ is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If ‘height’ is a matrix and ‘beside’ is ‘FALSE’ then each bar of the plot corresponds to a column of ‘height’, with the values in the column giving the heights of stacked sub-bars making up the bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, then the values in each column are juxtaposed rather than stacked.\nwidth: optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless ‘xlim’ is specified.\nspace: the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, ‘space’ may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to ‘c(0,1)’ if ‘height’ is a matrix and ‘beside’ is ‘TRUE’, and to 0.2 otherwise.\nnames.arg: a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the ‘names’ attribute of ‘height’ if this is a vector, or the column names if it is a matrix.\nlegend.text: a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when ‘height’ is a matrix. In that case given legend labels should correspond to the rows of ‘height’; if ‘legend.text’ is true, the row names of ‘height’ will be used as labels if they are non-null.\nbeside: a logical value. If ‘FALSE’, the columns of ‘height’ are portrayed as stacked bars, and if ‘TRUE’ the columns are portrayed as juxtaposed bars.\nhoriz: a logical value. If ‘FALSE’, the bars are drawn vertically with the first bar to the left. If ‘TRUE’, the bars are drawn horizontally with the first at the bottom.\ndensity: a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\nborder: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.\nmain,sub: main title and subtitle for the plot.\nxlab: a label for the x axis.\n\nylab: a label for the y axis.\n\nxlim: limits for the x axis.\n\nylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\naxes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\naxisnames: logical. If ‘TRUE’, and if there are ‘names.arg’ (see above), the other axis is drawn (with ‘lty = 0’) and labeled.\ncex.axis: expansion factor for numeric axis labels (see ‘par(’cex’)’).\ncex.names: expansion factor for axis names (bar labels).\ninside: logical. If ‘TRUE’, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when ‘space = 0’ (which it partly is when ‘beside = TRUE’).\nplot: logical. If 'FALSE', nothing is plotted.\naxis.lty: the graphics parameter ‘lty’ (see ‘par(’lty’)’) applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.\noffset: a vector indicating how much the bars should be shifted relative to the x axis.\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\nargs.legend: list of additional arguments to pass to ‘legend()’; names of the list are used as argument names. Only used if ‘legend.text’ is supplied.\nformula: a formula where the ‘y’ variables are numeric data to plot against the categorical ‘x’ variables. The formula can have one of three forms:\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\ndata: a data frame (or list) from which the variables in formula\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’ values. The default is to ignore missing values in the given variables.\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\nValue:\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\nAuthor(s):\n R Core, with a contribution by Arni Magnusson.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\nExamples:\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))", + "text": "barplot() Help File\n\n?barplot\n\nBar Plots\nDescription:\n Creates a bar plot with vertical or horizontal bars.\nUsage:\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\nheight: either a vector or matrix of values describing the bars which make up the plot. If ‘height’ is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If ‘height’ is a matrix and ‘beside’ is ‘FALSE’ then each bar of the plot corresponds to a column of ‘height’, with the values in the column giving the heights of stacked sub-bars making up the bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, then the values in each column are juxtaposed rather than stacked.\nwidth: optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless ‘xlim’ is specified.\nspace: the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, ‘space’ may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to ‘c(0,1)’ if ‘height’ is a matrix and ‘beside’ is ‘TRUE’, and to 0.2 otherwise.\nnames.arg: a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the ‘names’ attribute of ‘height’ if this is a vector, or the column names if it is a matrix.\nlegend.text: a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when ‘height’ is a matrix. In that case given legend labels should correspond to the rows of ‘height’; if ‘legend.text’ is true, the row names of ‘height’ will be used as labels if they are non-null.\nbeside: a logical value. If ‘FALSE’, the columns of ‘height’ are portrayed as stacked bars, and if ‘TRUE’ the columns are portrayed as juxtaposed bars.\nhoriz: a logical value. If ‘FALSE’, the bars are drawn vertically with the first bar to the left. If ‘TRUE’, the bars are drawn horizontally with the first at the bottom.\ndensity: a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.\nangle: the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\nborder: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.\nmain, sub: main title and subtitle for the plot.\nxlab: a label for the x axis.\n\nylab: a label for the y axis.\n\nxlim: limits for the x axis.\n\nylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\naxes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\naxisnames: logical. If ‘TRUE’, and if there are ‘names.arg’ (see above), the other axis is drawn (with ‘lty = 0’) and labeled.\ncex.axis: expansion factor for numeric axis labels (see ‘par(’cex’)’).\ncex.names: expansion factor for axis names (bar labels).\ninside: logical. If ‘TRUE’, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when ‘space = 0’ (which it partly is when ‘beside = TRUE’).\nplot: logical. If 'FALSE', nothing is plotted.\naxis.lty: the graphics parameter ‘lty’ (see ‘par(’lty’)’) applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.\noffset: a vector indicating how much the bars should be shifted relative to the x axis.\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\nargs.legend: list of additional arguments to pass to ‘legend()’; names of the list are used as argument names. Only used if ‘legend.text’ is supplied.\nformula: a formula where the ‘y’ variables are numeric data to plot against the categorical ‘x’ variables. The formula can have one of three forms:\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\ndata: a data frame (or list) from which the variables in formula\n should be taken.\nsubset: an optional vector specifying a subset of observations to be used.\nna.action: a function which indicates what should happen when the data contain ‘NA’ values. The default is to ignore missing values in the given variables.\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\nValue:\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\nAuthor(s):\n R Core, with a contribution by Arni Magnusson.\nReferences:\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\nSee Also:\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\nExamples:\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))", "crumbs": [ "Day 2", "Module 10: Data Visualization" @@ -2983,6 +2983,17 @@ "Module 10: Data Visualization" ] }, + { + "objectID": "modules/Module10-DataVisualization.html#saving-plots-to-file", + "href": "modules/Module10-DataVisualization.html#saving-plots-to-file", + "title": "Module 10: Data Visualization", + "section": "Saving plots to file", + "text": "Saving plots to file\nIf you want to include your graphic in a paper or anything else, you need to save it as an image. One limitation of base R graphics is that the process for saving plots is a bit annoying.\n\nOpen a graphics device connection with a graphics function – examples include pdf(), png(), and tiff() for the most useful.\nRun the code that creates your plot.\nUse dev.off() to close the graphics device connection.\n\nLet’s do an example.\n\n# Open the graphics device\npng(\n \"my-barplot.png\",\n width = 800,\n height = 450,\n units = \"px\"\n)\n# Set the plot layout -- this is an alternative to par(mfrow = ...)\nlayout(matrix(c(1, 2), ncol = 2))\n# Make the plot\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n fill=c(\"darkblue\",\"red\"), \n legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n# Close the graphics device\ndev.off()\n\npng \n 2 \n\n# Reset the layout\nlayout(1)\n\nNote: after you do an interactive graphics session, it is often helpful to restart R or run the function graphics.off() before opening the graphics connection device.", + "crumbs": [ + "Day 2", + "Module 10: Data Visualization" + ] + }, { "objectID": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", "href": "modules/Module10-DataVisualization.html#base-r-plots-vs-the-tidyverse-ggplot2-package", @@ -3197,7 +3208,7 @@ "href": "modules/Module06-DataSubset.html#using-indexing-and-logical-operators-to-rename-columns", "title": "Module 6: Get to Know Your Data and Subsetting", "section": "Using indexing and logical operators to rename columns", - "text": "Using indexing and logical operators to rename columns\n\nWe can assign the column names from data frame df to an object cn, then we can modify cn directly using indexing and logical operators, finally we reassign the column names, cn, back to the data frame df:\n\n\ncn <- colnames(df)\ncn\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncn==\"IgG_concentration\"\n\n[1] FALSE TRUE FALSE FALSE FALSE\n\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_mIU\" \"age\" \n[4] \"gender\" \"slum\" \n\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset", + "text": "Using indexing and logical operators to rename columns\n\nWe can assign the column names from data frame df to an object cn, then we can modify cn directly using indexing and logical operators, finally we reassign the column names, cn, back to the data frame df:\n\n\ncn <- colnames(df)\ncn\n\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n\ncn==\"IgG_concentration\"\n\n[1] FALSE TRUE FALSE FALSE FALSE\n\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_IU/mL\" #rename cn to \"IgG_concentration_IU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age\" \"gender\" \n[5] \"slum\" \n\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\ncolnames(df)[colnames(df)==\"IgG_concentration_IU/mL\"] <- \"IgG_concentration\" #reset", "crumbs": [ "Day 1", "Module 6: Get to Know Your Data and Subsetting" diff --git a/modules/Module06-DataSubset.qmd b/modules/Module06-DataSubset.qmd index a2bc2af..dd0f181 100644 --- a/modules/Module06-DataSubset.qmd +++ b/modules/Module06-DataSubset.qmd @@ -282,7 +282,7 @@ colnames(df) Note, I am resetting the column name back to the original name for the sake of the rest of the module. ```{r echo=TRUE} -colnames(df)[colnames(df)=="IgG_concentration_IU"] <- "IgG_concentration" #reset +colnames(df)[colnames(df)=="IgG_concentration_IU/mL"] <- "IgG_concentration" #reset ``` diff --git a/modules/Module10-DataVisualization.qmd b/modules/Module10-DataVisualization.qmd index c58b660..587e84b 100644 --- a/modules/Module10-DataVisualization.qmd +++ b/modules/Module10-DataVisualization.qmd @@ -436,6 +436,49 @@ axis(2, at = c(0.2, 0.4, 0.6, 0.8,1)) legend("topright", fill=c("darkblue","red"), legend = c("seronegative", "seropositive")) ``` +## Saving plots to file + +If you want to include your graphic in a paper or anything else, you need to +save it as an image. One limitation of base R graphics is that the process for +saving plots is a bit annoying. + +1. Open a graphics device connection with a graphics function -- examples +include `pdf()`, `png()`, and `tiff()` for the most useful. +1. Run the code that creates your plot. +1. Use `dev.off()` to close the graphics device connection. + +Let's do an example. + +```{r} +# Open the graphics device +png( + "my-barplot.png", + width = 800, + height = 450, + units = "px" +) +# Set the plot layout -- this is an alternative to par(mfrow = ...) +layout(matrix(c(1, 2), ncol = 2)) +# Make the plot +barplot(prop.column.percentages, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group") +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1)) +legend("topright", + fill=c("darkblue","red"), + legend = c("seronegative", "seropositive")) + +barplot(prop.column.percentages2, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Residence") +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1)) +legend("topright", fill=c("darkblue","red"), legend = c("seronegative", "seropositive")) +# Close the graphics device +dev.off() +# Reset the layout +layout(1) +``` + +Note: after you do an interactive graphics session, it is often helpful to +restart R or run the function `graphics.off()` before opening the graphics +connection device. + ## Base R plots vs the Tidyverse ggplot2 package It is good to know both b/c they each have their strengths diff --git a/modules/my-barplot.png b/modules/my-barplot.png new file mode 100644 index 0000000..2d940fc Binary files /dev/null and b/modules/my-barplot.png differ diff --git a/my-barplot.png b/my-barplot.png new file mode 100644 index 0000000..c1775b6 Binary files /dev/null and b/my-barplot.png differ diff --git a/references.qmd b/references.qmd index 5da13e4..e8a9367 100644 --- a/references.qmd +++ b/references.qmd @@ -14,6 +14,18 @@ nocite: "@*" - And the rendered HTML file is here: [click to download](./modules/Module11-Rmarkdown-Demo.html){target="_blank"} * Course GitHub where all materials can be found (to download the entire course as a zip file click the green "Code" button): [https://github.com/UGA-IDD/SISMID-2024](https://github.com/UGA-IDD/SISMID-2024){target="_blank"}. +# Useful (+ Free) Resources + +- R for Data Science: http://r4ds.had.co.nz/ +(great general information) +- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ +- R for Epidemiology: https://www.r4epi.com/ +- The Epidemiologist R Handbook: https://epirhandbook.com/en/ +- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html +(great general information) +- Open Case Studies: https://www.opencasestudies.org/ +(resource for specific public health cases with statistical implementation and interpretation) + # Need help? - Various "Cheat Sheets": [https://github.com/rstudio/cheatsheets/](https://github.com/rstudio/cheatsheets/) diff --git a/schedule.qmd b/schedule.qmd index d289e7f..630e20e 100644 --- a/schedule.qmd +++ b/schedule.qmd @@ -40,21 +40,21 @@ All times are in Eastern Daylight Time (EDT; UTC-4) | Time | Section | |:--------------------|:--------| -| 08:30 am - 09:00 am | exercise review and questions / catchup | -| 09:00 am - 09:15 am | Module 8 | -| 09:15 am - 10:00 am | Exercise 3 work time | +| 08:30 am - 09:00 am | exercise review and questions / catchup (Zane) | +| 09:00 am - 09:15 am | Module 8 (Amy) | +| 09:15 am - 10:00 am | Data analysis walkthrough (Zane and Amy) | | 10:00 am - 10:30 am | Coffee break | -| 10:30 am - 10:45 am | Exercise review | -| 10:45 am - 11:15 am | Module 9 | -| 11:15 am - 12:00 pm | Data analysis walkthrough | +| 10:30 am - 10:45 am | Exercise 3 work time | +| 10:45 am - 11:15 am | Exercise review (Zane) | +| 11:15 am - 12:00 pm | Module 9 (Amy) | | 12:00 pm - 01:30 pm | Lunch (2nd floor lobby); **Lunch and Learn!** | | 01:30 pm - 02:00 pm | Exercise 4 | -| 02:00 pm - 02:30 pm | Exercise 4 review | -| 02:30 pm - 03:00 pm | Module 10 | +| 02:00 pm - 02:30 pm | Exercise 4 review (Zane) | +| 02:30 pm - 03:00 pm | Module 10 (Amy) | | 03:00 pm - 03:30 pm | Coffee break | | 03:30 pm - 04:00 pm | Exercise 5 | -| 04:00 pm - 04:30 pm | Review exercise 5 | -| 04:30 pm - 05:00 pm | Module 11 | +| 04:00 pm - 04:30 pm | Review exercise 5 (Zane) | +| 04:30 pm - 05:00 pm | Module 11 (Zane) | : {.striped .hover tbl-colwidths="[25,75]"}