Extract R code from R Markdown HTML file #1811

stevecondylios · 2020-02-11T07:34:25Z

There appears to be no fast and easy way to extract the R code from HTML files generated via R Markdown.

Example

Max and Davis's applied-ml workshop is a good example.

We can easily get the R code for 'Part_1.html', since we have access to the original .Rmd file, and can hence call

knitr::purl("Part_1.Rmd")
readLines("Part_1.R") %>% paste0(collapse="\n\n") %>% cat
# Displays R code...

But we cannot so easily get the R code for parts 2 through 5, as the originating .Rmd is not available.

Possible solution

html_to_r() extracts the R code from R Markdown generated HTML files.

I provide an implementation in a PR.

Using in the applied-ml example

We can now easily retrieve the R code from the .html files, like so

# from inside applied-ml
dir() %>% grep("Part_{1}.*html", ., value = T) %>% sapply(., html_to_r) -> a
dir() %>% grep("Part_{1}.*html", ., value = T) %>% mapply(html_to_r, inc_out=F, .) -> b

# Randomly inspect the second file with / without output to ensure it worked as expected
a[[2]] %>% cat # with output
b[[2]] %>% cat # without output

This can be merged if relevant or disregarded if not relevant.

The text was updated successfully, but these errors were encountered:

atusy · 2020-02-11T08:28:02Z

IMO, using pandoc makes the code simple and applicable to more formats (e.g., gfm).
What do you think?

# purloc = purl + pandoc
purloc = function(x, output = file.path(".", xfun::with_ext(x, "R")), ...) {
  input = tempfile(fileext = xfun::file_ext(x))
  file.copy(x, input)

  knitr::pandoc(input, 'commonmark', ext = 'md')
  
  intermediate_md = xfun::with_ext(input, 'md')
  intermediate_md %>%
    readr::read_lines() %>%
    stringr::str_replace_all("^``` r", '```{r}') %>%
    readr::write_lines(intermediate_md)
  knitr::purl(intermediate_md, output = output, ...)
}

stevecondylios · 2020-02-13T10:10:16Z

@atusy that is a great simplification and improvement on the DIY solution in the original.

The purloc naming is also intuitive and makes sense.

Some questions

Do you think inc_out option is useful? (a quick example of the difference below)

# from inside applied-ml root directory
dir() %>% grep("Part_{1}.*html", ., value = T) %>% sapply(., html_to_r) -> a
dir() %>% grep("Part_{1}.*html", ., value = T) %>% mapply(html_to_r, inc_out=F, .) -> b

a[[2]] %>% cat # with output
b[[2]] %>% cat # without output

For me, it's useful, but maybe not for everyone?

Also, do you agree replacing character entities is useful? I think it is essential (otherwise pipes and some conditionals will appear meaningful in HTML but not in R code)

  replace_character_entities <- function(char_entity){
    xml2::xml_text(xml2::read_html(paste0("<x>", char_entity, "</x>")))
  }

# E.g. 
replace_character_entities("&gt;")
# [1] ">"

Which makes a pipe appear as %>% rather than %>%

I applied this conversion to some test examples but I cannot be certain it will work under all circumstances (one exception that comes to mind is if R code contained some literal >, perhaps in a comment). I think this could be sufficiently rare to not cause too much concern though

atusy · 2020-02-14T03:59:33Z

About inc_out, I think it is relatively less important.
Because results are expected to be reproducible.
Also, a problem arise when the source Rmd contains code blocks.
They are not output, but are considered as output by html_2_r.
If inc_out is really needed, I think they should be commented out in R.

About special characters, we do not have to care as pandoc takes care of them

echo "<pre>%&gt;</pre>" | pandoc --from html --to gfm
# ```
# %>
# ```

stevecondylios linked a pull request Feb 11, 2020 that will close this issue

Add html_to_r() #1812

Open

cderv linked a pull request Jan 29, 2021 that will close this issue

Add html_to_r() #1812

Open

cderv added the feature Feature requests label Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract R code from R Markdown HTML file #1811

Extract R code from R Markdown HTML file #1811

stevecondylios commented Feb 11, 2020

atusy commented Feb 11, 2020 •

edited

Loading

stevecondylios commented Feb 13, 2020

atusy commented Feb 14, 2020 •

edited

Loading

Extract R code from R Markdown HTML file #1811

Extract R code from R Markdown HTML file #1811

Comments

stevecondylios commented Feb 11, 2020

Example

Possible solution

Using in the applied-ml example

atusy commented Feb 11, 2020 • edited Loading

stevecondylios commented Feb 13, 2020

atusy commented Feb 14, 2020 • edited Loading

atusy commented Feb 11, 2020 •

edited

Loading

atusy commented Feb 14, 2020 •

edited

Loading