Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change levels to unique in vector attributes lesson #342

Open
lisamr opened this issue Feb 25, 2021 · 9 comments
Open

change levels to unique in vector attributes lesson #342

lisamr opened this issue Feb 25, 2021 · 9 comments

Comments

@lisamr
Copy link

lisamr commented Feb 25, 2021

In Explore and Plot by Vector Layer Attributes, the lesson is about seeing unique values and uses levels(lines_HARV$TYPE), which produces NULL because the column is not defined as a factor. I would suggest unique(lines_HARV$TYPE) instead.

@jsta
Copy link
Member

jsta commented Feb 25, 2021

I wonder if this is due to stringsAsFactors being FALSE by default in R > 4.0? #328
I think more than that one command would need to be changed because the surrounding text is all about factors and now lines_HARV$TYPE is no longer a factor :(

@djhunter
Copy link

Yes, I believe that @jsta is correct that this behavior is due to the change in the default value of stringsAsFactors in R version 4.0. I was going to submit a quick pull request, but then I realized that there are some pedagogical choices that need to be made.

Just changing levels() to unique() will fix the NULL output issue, but the larger problem is that there are several places in Episode 7 where lines_HARV$TYPE is referred to as a factor, which leads to a brief discussion of factors. This problem also comes up in Episodes 8 and 10. It seems to me that there are at least two ways to fix this:

  1. Change levels() to unique() in Episodes 7, 8, and 10, and update the exposition in Episodes 7 and 10 to remove any discussion of factors.
  2. Convert the strings to factors, and leave the exposition (mostly) the same.

I'd be happy to take care of this, but I need some advice about which of these options to choose. My inclination would be to go with Option (1), as it will simplify the lesson a little, and there doesn't seem to be any reason to convert the strings to factors for the purposes of visualizing the data. However, if there was a specific pedagogical reason to include a review of factors in this lesson, then Option (2) would be preferable.

@jsta
Copy link
Member

jsta commented Jun 13, 2021

I like option 1 as well. I don't think we have any ggplot code that relies on factors that would be my only hesitation.

@djhunter
Copy link

There is code that relies on the ordering of the factors. It still works if lines_HARV$TYPE is a character variable, because (I believe that) ggplot converts character variables to factors when they are used in aes(). So changing levels() to unique() might be slightly confusing in places like the following:

First we will check how many unique values the TYPE field has:

unique(lines_HARV$TYPE)

[1] "woods road" "footpath"   "stone wall" "boardwalk" 

Then we can create a palette of four colors, one for each feature in our vector object.

road_colors <- c("blue", "green", "navy", "purple")

We can tell ggplot to use these colors when we plot the data.

ggplot() +
  geom_sf(data = lines_HARV, aes(color = TYPE)) + 
  scale_color_manual(values = road_colors) +
  labs(color = 'Road Type') +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails") + 
  coord_sf()

The alert reader will notice that woods road is not colored blue, as might be expected, because the road_colors get assigned to the path types in factor (i.e., alphabetical) order, not in the order given by unique(). The same problem happens later when customizing line widths.

So now I'm starting to lean toward Option 2. It is natural to want to customize the order of things in plots, and you can't do that without grappling with factors.

We can recover the pre-version 4.0 behavior by adding stringsAsFactors = TRUE to all of the st_read commands. This is probably the simplest fix, as it doesn't involve changing as much of the exposition, and it will eliminate the confusion of some learners using pre-4.0 versions.

@lisamr
Copy link
Author

lisamr commented Jun 15, 2021

Thanks all for picking up this issue. It seems like unique() would be a quick and dirty fix, but would lead to issues later on. It would also be a good thing for learners now about using stringsAsFactors = TRUE, since factors accidentally being treated as characters comes up in my own personal code all the time. I like @djhunter's explanation and solution.

@jsta
Copy link
Member

jsta commented Jun 22, 2021

After consideration, PR #353 seems like the "nuclear option" to me. It requires so much more typing on the learners' part. What about using unique to list line types and aes(color = factor(column_name, levels = road_colors)) in the plotting commands?

Then still discuss factors but move it to a better spot somewhere just before factor-plotting.

@drakeasberry
Copy link
Contributor

What if we use options(stringsAsFactors = TRUE) to replicate the pre R 4.0 default? This would allow users running 4.0 to experience the lesson the same as users running pre 4.0 R versions. Then we would not need to add the stringsAsFactors = TRUE to each individual read command, which would reduce the amount of typing on the learner.

@djhunter
Copy link

djhunter commented Jun 25, 2021

According to this post, the stringsAsFactors global option will eventually be phased out, so setting it via the options command could lead to errors later when the phaseout happens.

There are only three read commands in which learners would have to type stringsAsFactors = TRUE: when reading HARV_roads.shp, HARV_PlotLocations.csv and hf001-06-daily-m.csv. All of the other changes in pull request #353 are just repetitions of these, which presumably learners won't have to repeat if they maintain their environments between episodes.

@jonjab
Copy link
Contributor

jonjab commented Aug 23, 2022

We taught this lesson last month. stringsAsFactor = TRUE was not a big deal.

What was a bigger deal was running out of memory in our RStudio hub environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants