Skip to content

chuvanan/xml2df

Repository files navigation

How hard is it to work with medium-sized XML file in R?

Prerequisite

  • R-4.0.0 (must)

  • Command line setup:

git clone https://github.com/chuvanan/xml2df.git
cd xml2df
R
renv::restore()

Notes

  • Main packages for working with XML in R: xml2 and XML. xml2 is a R's bindings to libxml2 - actually, a subset of libxml2. XML's been developed by the legendary Duncan Temple Lang (R Core Team) - also based on libxml2. In general, xml2 will be easier to use for simple tasks thanks to its user-friendly API (more on that later) while XML should be preferred for heavier and more complicated tasks.

  • How a XML parser works (for example: libxml2)?

  • Rules for building xpath query:

    • Place two backslashes in front of the top node for jumping across nodes
    • Separate subsequent nodes with a single backslash
    • Add a period before search query to make the search to be local (that's particularly useful when looping over nodesets)
    • When things go wrong, it's likely due to namespace

Strategies for extracting data from HTML/XML documents

  • Always look at the data. What I mean by "look at the data" is to look at the XML file line by line to have a quick overview of structure, nesting depth, elements, attributes, namespaces (if any). For this task, command line tools such as head, tail, less become very handy.

    My favorite utility is less. I can view and naviage a big XML file very quickly with a few keystrokes (G: go to the end of file, g: go to start of file, space: page down, b: page up)

    Don't try to open a big file in editor or browser.

  • Work with a small subset of data first. That's a generally good practice applying well to this situation. There are two main reasons:

    • To save time. Reading and processing large XML files can be time-consuming, especially when repeated several times before reaching the right solution (it happens more often than you think).
    • To facilitate solution design. Usually, XML document's structure has the same layout (branches, elements) at different levels (nodesets). You don't need a whole document (= hundreds of thousands elements), just a couple of elements in each nodesets is enough for programming.

Sample projects

  • dvhc-vietnam: Convert Vietnam's administrative divisions data from XML to CSV/JSON

  • sdmx: Vietnam's National Summary Data

  • vietnam-wdi: Parse WorldBank's Vietnam World Development Indicators.

  • weather-forecast: Parse weather forecast XML service

Resources

About

TIL: R's XML parser and toolkit

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages