-
Notifications
You must be signed in to change notification settings - Fork 2
/
03-mem_03_releasemem.Rmd
268 lines (195 loc) · 10.4 KB
/
03-mem_03_releasemem.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# Release Memory {#releasememory}
Sometimes even after applying all the best techniques to save memory you still need more space. In those cases you need to understand that R has 2 techniques available to you. These techniques are mostly useful in different context. I mostly recommend using `gc()` inside a loop and `rm()` inside an ETL or big function.
## use rm()
`rm` function deletes any object from R and releases some memory that could be used for further operations. It's excellent for cases where you have a lot of intermediate results saved inside an R script that is supposed to run through an ETL. like this
```{r}
value <- length(mtcars$mpg)
sorted_mpg <- sort(mtcars$mpg)
median_values <- sorted_mpg[value/2] + sorted_mpg[(value/2) + 1]
median_mpg <- median_values / 2
rm(list = c("value", "sorted_mpg", "median_values"))
```
In these scenarios where you are running scripts like these and storing intermediate result for further processing it's okay to use `rm` function. However when you wrap it inside a function only the final result stays and everything else is lost anyways so you don't need to use `rm` function at all. The only difference is when your function is huge. I once wrote an ETL script where I compose multiple functions to create a bigger one which was supposed to run like an ETL job. The data created there was huge and it was crashing my RAM everytime. So I had to modify it to make sure I delete the objects created after I am done with it.
```{r}
# ## this is a bad style it will exhaust your RAM way more quickly and you would always need more RAM than necessary to perform this action.
#
# bad_func <- function(){
#
# a <- load_huge_datasets("some file")
# b <- load_huge_datasets("some other file or db")
# c <- load_huge_datasets("some other file or db")
#
# results_a <- long_computaions(a)
# results_b <- other_long_computaions(b)
# results_c <- other_long_computaions(c)
#
# final_results <- results_a + results_b + results_c
#
# return(
# final_results
# )
# }
#
#
# ## This style is only using the RAM that's necessary at a time and then deleting objects to make room for other RAM
#
# good_func <- function(){
#
# a <- load_huge_datasets("some file")
# results_a <- long_computaions(a)
# rm(a)
#
# b <- load_huge_datasets("some other file or db")
# results_b <- other_long_computaions(b)
# rm(b)
#
# c <- load_huge_datasets("some other file or db")
# results_c <- other_long_computaions(c)
# rm(c)
#
# final_results <- results_a + results_b + results_c
#
# return(
# final_results
# )
#
# }
```
Read both the functions mentioned above if you ever faced with a situation where RAM is crucial always make a habit of writing code like good_func where you can use `rm()` inside a function to make room for more objects.
I would not advice anyone to use this function unless you are writing an ETL script on a data that is large enough that you would need space after every iteration. Even in those scenarios you can get away with not using intermediate variables for storing the results. But if you must and only and only in those very rare cases this function should be used. In broader cases R's garbage collection will have you covered.
## use gc()
Garbage collection is a process by which a software or a program returns the memory back to OS or frees it for it's own usage. R has significantly improved it's memory management in recent years. All the blogs you read about R being slow and memory hungry are mostly written during 2014-2017. They aren't true anymore.
### R version 3.5
In R version 3.5 `altrep` was introduced which helps in saving a lot of memory in special cases like loops. Packages like `vroom` have achieved significant speed ups because now they didn't have to load all the data in memory and can only load as much as they need. If you want to read more about it you should visit this page <https://svn.r-project.org/R/branches/ALTREP/ALTREP.html>.
### R version 4.0
In R version 4.0 `ARC` ( **automatic reference counting** )was introduced in R. Previously R used to keep a track of how many copies an object has with `NRC` ( **named reference counting** )and that had only 3 option, 0, 1 and 2; where I think 2 meant many. Now with ARC R can increase and decrease these numbers for keeping a better track of how many object points to the same memory address. You can read more about in a blog post here: <https://msmith.de/2020/05/14/exploring-r-reference-counting.html#:~:text=The%20CRAN%20NEWS%20accompanying%20R,further%20optimizations%20in%20the%20future.>.
So the gist of the matter is that R has been improving performance and memory management for a very long time.
Now coming back to the main point. R has a function by name `gc` to run a garbage collector and release the memory that is not needed anymore. It's useful to run it in mostly 2 situations.
### Inside a heavy loop
Loops are slow in R and sometime when are you doing heavy manipulation in a loop. R might consume more memory than it's needed and will slow the further processes. This is the point I use a gc() inside a loop.
```{r}
gc_after_a_number <- 1e2
for(i in 1:1e4){
### heavy calculations
if( i %% gc_after_a_number == 0){
gc()
}
}
```
Here in the code above I am running a long loop but after every 100 iterations I am freeing up memory by running a gc() function. After how many iterations do you want to use a loop is totally dependent on what are you doing inside a loop. It could go from anywhere between 10 to 10 thousand depending upon the situation. It surely helps and I would recommend everyone to use it but only for very heavy calculations.
### anything that takes more than 30 seconds
If you know a function takes more than 30 second to execute. I would suggest you to run a gc() after it. It helps keep a check on R's memory. It's useful even if you are running a shiny app or an API. If anything takes more than 30 second it's worth running the gc() function because than adding a few milliseconds will not be your main problem for sure.
## Cache / Store calculations
It's a good practice in general to not repeat the same calculations over and over again. let's check that with simple function.
It's a very famous problem of print `foo` when a number is divisible by 3 printing `bar` when it's divisible by 5 and `foobar` when it's divisible by both 3 and 5 or 15.
```{r}
foo <- function(
num
){
x <- data.frame(y = 1:num)
z <- integer(num)
for(i in 1:num){
if(
x$y[[i]] %% 3 == 0 &&
x$y[[i]] %% 5 == 0
){
z <- "foobar"
} else if( x$y[[i]] %% 3 == 0){
z <- "foo"
} else if ( x$y[[i]] %% 5 == 0){
z <- "bar"
} else {
z <- i
}
}
return(z)
}
```
Now look at the code above is it readable? what If I want to change the number of iterations. I will have to change it at 2 different places but that's okay because I have wrapped in a function. But the key point is that I am also calculating `x$y[[i]]` multiple times and If I have make changes to it I will have to change it in `4` places. Let's make it a little readable by storing the calculation and thus making our code readable and make it good enough that future changes can be done easily.
```{r}
bar <- function(
num
){
x <- data.frame(y = 1:num)
z <- integer(num)
for(i in 1:num){
curr_value <- x$y[[i]]
if(
curr_value %% 3 == 0 &&
curr_value %% 5 == 0
){
z <- "foobar"
} else if( curr_value %% 3 == 0){
z <- "foo"
} else if ( curr_value %% 5 == 0){
z <- "bar"
} else {
z <- i
}
}
return(z)
}
```
Let's check which is faster of the above implementations.
```{r}
microbenchmark::microbenchmark(
foo(1e4),
bar(1e4)
)
```
We are saving almost half the time just because we are saving some intermediate calculations. You will be amazed to see how often can you do this. And it makes your code
- more performant
- more readable
- more adaptable
- more memory efficient
Don't force the computer to calculate something it has already calculated. This simple thing has brought into life big concepts like mem-cache or reddis. R too has the ability to cache it's results. There is a package by name `memoise`. It can store the results of a previous calculation in RAM. It's a classical case of RAM vs CPU trade off. You are using some RAM so that your CPU doesn't have to work over and over again.
Let's try to calculate factorials to demonstrate how we can use it.
```{r}
fact <- function(num){
if( num == 0){
return(1)
} else {
return(
num * fact(num - 1)
)
}
}
```
Now because being a recursive function, which you should avoid writing at all costs, It does the same computation over and over again. It goes something like this
1. fact(0) \* fact(1)
1. fact(1) \* fact(2)
1. fact(2) \* fact(3)
1. fact(4) \* fact(5)
1. fact(5) \* fact(6)
1. ...................
now it's calculating same thing over and over again without caching it. It makes it a valid candidate for memoise.
```{r}
mem_fact <- function(num){
if( num == 0){
return(1)
} else {
return(
num * mem_fact(num - 1)
)
}
}
mem_fact <- memoise::memoise(mem_fact)
```
This is how you write a simple memoise function. Let's compare the results.
```{r}
microbenchmark::microbenchmark(
fact(100),
mem_fact(100)
)
```
In this case we haven't achieved much performance over the base code but if we run this function multiple times it will just fetch the results from RAM and save computations in which case it will be worth the effort. We will go through example in next chapters where memoise will be faster too...
memoise is not the only package which does this sort of caching. There is a package by name `cachem` which is used in shiny and you will find some other products like redis which do the same thing but at scale. It's an important technique to know for saving both time and memory.
## Conclusion
In this chapter we learned how to free up memory for more usage in R. This should help you manage your R environment better. Let's see what we have studied so far.
1. Don't use rm unless necessary
2. rm should only be used in an ETL or a function
3. remove objects after you are done with it.
4. Most of the time you can avoid rm
5. use gc() inside heavy loops after every few iteration
6. use gc() after every heavy transaction which takes more than a couple of seconds.
7. Don't repeat calculations