update modules 06-09

UGA-IDD · Jul 14, 2024 · fda1c44 · fda1c44
1 parent 2bf2804
commit fda1c44
Show file tree

Hide file tree

Showing 13 changed files with 3,543 additions and 67 deletions.
diff --git a/_freeze/modules/Module06-DataSubset/execute-results/html.json b/_freeze/modules/Module06-DataSubset/execute-results/html.json
diff --git a/_freeze/site_libs/revealjs/dist/theme/quarto.css b/_freeze/site_libs/revealjs/dist/theme/quarto.css
diff --git a/docs/modules/Module06-DataSubset.html b/docs/modules/Module06-DataSubset.html
diff --git a/docs/modules/images/View.png b/docs/modules/images/View.png
diff --git a/docs/modules/images/ViewTab.png b/docs/modules/images/ViewTab.png
diff --git a/docs/search.json b/docs/search.json
diff --git a/docs/site_libs/quarto-html/quarto-html.min.css b/docs/site_libs/quarto-html/quarto-html.min.css
diff --git a/docs/site_libs/quarto-html/quarto-syntax-highlighting.css b/docs/site_libs/quarto-html/quarto-syntax-highlighting.css
diff --git a/docs/site_libs/revealjs/dist/theme/quarto.css b/docs/site_libs/revealjs/dist/theme/quarto.css
diff --git a/modules/Module06-DataSubset.qmd b/modules/Module06-DataSubset.qmd
@@ -4,22 +4,23 @@ format:
   revealjs:
     scrollable: true
     smaller: true
+    toc: false
 ---
 
 ## Learning Objectives
 
 After module 6, you should be able to...
 
 -   Use basic functions to get to know you data
--   Use two approaches to indexing
+-   Use three indexing approaches
 -   Rely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)
 -   Describe what logical operators are and how to use them
 -   Use on the `subset()` function to subset data
 
 
 ## Getting to know our data
 
-The `dim()` , `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. 
+The `dim()`, `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. 
 
 Let's first read in the data from the previous module.
 
@@ -47,13 +48,17 @@ Note, if you have a very large dataset with 15+ variables, `summary()` is not so
 
 ## Description of data
 
-This is data based on a simulated pathogen X IgG serological survey.  The rows represent individuals. Variables include IgG concentrations, age in years, gender, and residence based on slum characterization.  We will use this dataset for lectures throughout the Workshop.
+This is data based on a simulated pathogen X IgG antibody serological survey.  The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization.  We will use this dataset for lectures throughout the Workshop.
 
 ## View the data as a whole dataframe
 
-The `View()` function, one of the few Base R functions with a capital letter can be used to open a new tab in the Console and view the data as you would in excel, for example.  
+The `View()` function, one of the few Base R functions with a capital letter can be used to open a new tab in the Console and view the data as you would in excel.
 
-```{r, out.width = "50%", echo = FALSE}
+```{r echo=TRUE, eval=FALSE}
+View(df)
+```
+
+```{r, out.width = "100%", echo = FALSE}
 knitr::include_graphics("images/ViewTab.png")
 ```
 
@@ -67,48 +72,53 @@ knitr::include_graphics("images/View.png")
 
 ## Indexing
 
-R contains several constructs which allow access to individual elements or subsets through indexing operations. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).
+R contains several constructs which allow access to individual elements or subsets through indexing operations. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing syntax: `[ ]`, `[[ ]]` and `$`.
 
-```
-x[i]
-x[i, j]
-x[[i]]
-x$a
-x$"a"
+```{r echo=TRUE, eval=FALSE}
+x[i] #if x is a vector
+x[i, j] #if x is a matrix/data frame
+x[[i]] #if x is a list
+x$a #if x is a data frame or list
+x$"a" #if x is a data frame or list
 ```
 
 ## Vectors and multi-dimensional objects
 
 To index a vector, `vector[i]` select the ith element. To index a multi-dimensional objects such as a matrix, `matrix[i, j]` selects the element in row i and column j, where as in a three dimensional `array[k, i, i, j]` selects the element in matrix k, row i, and column j. 
 
-```{r echo=F}
+Let's practice by first creating the same objects as we did in Module 1.
+```{r echo=T}
 number.object <- 3
 character.object <- "blue"
 vector.object1 <- c(2,3,4,5)
 vector.object2 <- c("blue", "red", "yellow")
 matrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)
 ```
 
+Here is a reminder of what these objects look like.
 ```{r echo=T}
 vector.object1
 matrix.object
 ```
 
+Finally, let's use indexing to pull our elements of the objects.  
 ```{r echo=T}
-vector.object1[2]
-matrix.object[1,2]
+vector.object1[2] #pulling the second element
+matrix.object[1,2] #pulling the element in row 1 column 2
 ```
 
 
 ## List objects
 
-For lists, one generally uses `list[[p]] to select any single element p.
+For lists, one generally uses `list[[p]]` to select any single element p.
 
-```{r}
+Let's practice by creating the same list as we did in Module 1.
+```{r echo=TRUE}
 list.object <- list(number.object, vector.object2, matrix.object)
 list.object
 ```
 
+Now we use indexing to pull out the 3rd element in the list.
 ```{r echo=T}
 list.object[[3]]
 ```
@@ -121,11 +131,12 @@ list.object[[3]]
 df$IgG_concentration
 ```
 
+Note, if you have spaces in your variable name, you will need to use back ticks `variable name` after the `$`.  This is a good reason to not create variables / column names with spaces.
+
 ## $ for indexing with lists
 
 List elements can be named
-
-```{r makeListv}
+```{r echo=TRUE}
 list.object.named <- list(
   emory = number.object,
   uga = vector.object2,
@@ -134,11 +145,10 @@ list.object.named <- list(
 list.object.named
 ```
 
-You can reference data from list using `$` (if elements are named) or using double square brackets, `[[ ]]`
-
-```{r}
-list.object.named[["uga"]] 
+If list elements are named, than you can reference data from list using `$` or using double square brackets, `[[ ]]`
+```{r echo=TRUE}
 list.object.named$uga 
+list.object.named[["uga"]] 
 ```
 
 
@@ -153,34 +163,38 @@ colnames(df)
 colnames(df)[1:2] <- c("IgG_concentration", "age") #reset
 ```
 
-##  Using indexing to subset data
+##  Using indexing to subset by columns
 
-We can also subset a data frames and matrices (2-dimensional objects) using the bracket `[, ]`. 
-
-We can subset by columns and pull the `x` column using the index of the column or the column name ("`age`") 
+We can also subset a data frames and matrices (2-dimensional objects) using the bracket `[ row , column ]`.  We can subset by columns and pull the `x` column using the index of the column or the column name. 
 
+For example, here I am pulling the 3nd column, which has the variable name `age`
 ```{r echo=T}
-df[, "age"] #same as df[, 2]
+df[ , "age"] #same as df[ , 3]
 ```
 We can select multiple columns using multiple column names:
 ```{r echo=T}
-df[, c("age", "gender")]
+df[, c("age", "gender")] #same as df[ , c(3,4)]
 ```
-We can remove select columns using column names as well: (xxzane - why - c("slum") not working)
+We can remove select columns using indexing as well, OR by simply changing the column to `NULL`
 ```{r echo=T}
-df[, -3] #remove column 3, "slum" variable
-#Note df$slum <- NULL would also work
+df[, -5] #remove column 5, "slum" variable
+```
+```{r echo=TRUE, eval=FALSE}
+df$slum <- NULL # this is the same as above
 ```
 We can also grab the `age` column using the `$` operator. 
 ```{r echo=T}
 df$age
 ```
 
-Or we can subset by rows and pull the 100th observation/row.
+
+##  Using indexing to subset by rows
+
+We can use indexing to also subset by rows. For example, here we pull the 100th observation/row.
 ```{r echo=T}
 df[100,] 
 ```
- or maybe the age of the 100th observation/row.
+And, here we pull the `age` of the 100th observation/row.
 ```{r echo=T}
 df[100,"age"] 
 ```
@@ -206,10 +220,12 @@ operator | operator option |description
 
 ## Logical operators examples
 
+Let's practice.  First, here is a reminder of what the number.object contains.
 ```{r echo=TRUE}
 number.object
 ```
 
+Now, we will use logical operators to evaluate the object.
 ```{r echo=TRUE}
 number.object<4
 number.object>=3
@@ -220,23 +236,33 @@ number.object %in% c(6,7,2)
 
 ## Using indexing and logical operators to rename columns
 
-We can assign the column names, change the ones we want, and then re-assign the column names:
+1. We can assign the column names from data frame `df` to an object `cn`, then we can modify `cn` directly using indexing and logical operators, finally we reassign the column names, `cn`, back to the data frame `df`:
 
-```{r}
+```{r echo=TRUE}
 cn <- colnames(df)
-cn[cn=="IgG_concentration"] <-"IgG_concentration_mIU" #rename cn to "IgG_concentration" when cn is "IgG_concentration_mIU"
+cn
+cn[cn=="IgG_concentration"] <-"IgG_concentration_mIU" #rename cn to "IgG_concentration_mIU" when cn is "IgG_concentration"
 colnames(df) <- cn
+```
+
+Note, I am resetting the column name back to the original name for the sake of the rest of the module.
+```{r echo=TRUE}
 colnames(df)[colnames(df)=="IgG_concentration_mIU"] <- "IgG_concentration" #reset
 ```
 
+
 ##  Using indexing and logical operators to subset data
 
-Subset by rows and pull only observations with an age of  less than or equal to 10.
+
+In this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to `df_lt10`. Note that the logical operators `df$age<=10` is before the comma because I want to subset by rows (the first dimension).
 ```{r echo=T}
-df_lte10 <- df[df$age<=10,]
-df_lte5_gt10 <- df[df$age<=5 | df$age>10,]
+df_lte10 <- df[df$age<=10, ]
+```
+In this example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.
+```{r echo=TRUE}
+df_lte5_gt10 <- df[df$age<=5 | df$age>10, ]
 ```
-Note that the logical operators `df$age<=10` and `df$age<=5 | df$age>10` are before the comma because I want to subset by rows. I saved the subset data to `df_lt10` and `df_lte5_gt10`. Lets check that my subsets worked using the `summary()` function. 
+Lets check that my subsets worked using the `summary()` function. 
 ```{r echo=T}
 summary(df_lte10$age)
 summary(df_lte5_gt10$age)
@@ -245,6 +271,8 @@ summary(df_lte5_gt10$age)
 
 ## Missing values 
 
+Missing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.
+
 Types of "missing" values:
 
 -   `NA` - general missing data
@@ -254,7 +282,7 @@ Types of "missing" values:
     number (or negative number) by 0.
 -   blank space - sometimes when data is read it, there is a blank space left
 
-## More Logical Operators
+## Logical operators to help identify and missing data
 
 operator | operator option |description
 -----|-----|-----:
@@ -264,10 +292,11 @@ operator | operator option |description
 `!is.nan`||is not NAN
 `is.infinite`||is infinite
 `any`||are any TRUE
+`which`||which are TRUE
 
 ## More logical operators examples
 
-```{r}
+```{r echo=TRUE}
 test <- c(0,NA, -1)/0
 test
 is.na(test)
@@ -280,16 +309,24 @@ is.infinite(test)
 `any(is.na(x))` means do we have any `NA`'s in the object `x`?
 
 ```{r  echo=TRUE}
-A <- c(1, 2, 4, NA)
-B <- c(1, 2, 3, 4)
-any(is.na(A)) # are there any NAs - YES/TRUE
-any(is.na(B)) # are there any NAs- NO/FALSE
+any(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE
+any(is.na(df$slum)) # are there any NAs- NO/FALSE
 ```
 
+`which(is.na(x))` means which of the elements in object `x` are `NA`'s?
+
+```{r  echo=TRUE}
+which(is.na(df$IgG_concentration)) 
+which(is.na(df$slum)) 
+```
 
 ## `subset()` function
 
-The Base R `subset()` function is a slighly easier way to select variables and observations.
+The Base R `subset()` function is a slightly easier way to select variables and observations.
+
+```{r echo=TRUE, eval=FALSE}
+?subset
+```
 
 ```{r, echo = FALSE, results = "asis"}
 library(printr)
@@ -325,13 +362,19 @@ nrow(df_lte10_v2)
 
 ## Summary
 
--   
+- `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics
+- There are three basic indexing syntax: `[ ]`, `[[ ]]` and `$`
+- Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)
+- Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing
+- There are 5 “types” of missing values, the most common being “NA”
+- Logical operators meant to determine missing values are very helpful for data cleaning
+- The Base R `subset()` function is a slightly easier way to select variables and observations.
 
 ## Acknowledgements
 
 These are the materials we looked through, modified, or extracted to complete this module's lecture.
 
 -   ["Introduction to R for Public Health Researchers" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)
--   [CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)
--   [CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)
+-   ["Indexing" CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)
+-   ["Logical operators" CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)
 
diff --git a/modules/Module07-VarCreationClassesSummaries.qmd b/modules/Module07-VarCreationClassesSummaries.qmd
@@ -31,12 +31,14 @@ library(printr)
 
 ## Adding new columns
 
-You can add a new column, called `newcol` to `df`, using the `$` operator:
+You can add a new column, called `log_IgG` to `df`, using the `$` operator:
 ```{r echo=TRUE}
 df$log_IgG <- log(df$IgG_concentration)
 head(df,3)
 ```
 
+Note, my use of the underscore in the variable name rather than a space.  This is good coding practice and make calling variables much less prone to error.
+
 ## Creating conditional variables
 
 One frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which "returns a value depending on whether the element of test is `TRUE` or `FALSE`."

diff --git a/modules/Module09-DataAnalysis.qmd b/modules/Module09-DataAnalysis.qmd
@@ -33,7 +33,7 @@ df$age_group <- ifelse(df$age <= 5, "young",
 df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))
 ```
 
-Create `seropos` binary variable representing seropositivity if antibody concentrations are >10 mIUmL.
+Create `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.
 ```{r echo=TRUE}
 df$seropos <- ifelse(df$IgG_concentration<10, 0, 
 										ifelse(df$IgG_concentration>=10, 1, NA))
@@ -118,7 +118,7 @@ IgG_old <- df$IgG_concentration[df$age_group=="old"]
 t.test(IgG_young, IgG_old)
 ```
 
-The mean IgG concenration of young and old is 45.05 and 129.35 mIU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 mIU/mL.
+The mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.
 
 ## Linear regression fit in R
 

diff --git a/modules/data/serodata.xlsx b/modules/data/serodata.xlsx