-
R-4.0.0 (must)
-
Command line setup:
git clone https://github.com/chuvanan/xml2df.git
cd xml2df
R
renv::restore()
-
Main packages for working with XML in R: xml2 and XML.
xml2
is a R's bindings to libxml2 - actually, a subset oflibxml2
.XML
's been developed by the legendary Duncan Temple Lang (R Core Team) - also based onlibxml2
. In general,xml2
will be easier to use for simple tasks thanks to its user-friendly API (more on that later) whileXML
should be preferred for heavier and more complicated tasks. -
How a XML parser works (for example:
libxml2
)? -
Rules for building xpath query:
- Place two backslashes in front of the top node for jumping across nodes
- Separate subsequent nodes with a single backslash
- Add a period before search query to make the search to be local (that's particularly useful when looping over nodesets)
- When things go wrong, it's likely due to namespace
-
Always look at the data. What I mean by "look at the data" is to look at the XML file line by line to have a quick overview of structure, nesting depth, elements, attributes, namespaces (if any). For this task, command line tools such as
head
,tail
,less
become very handy.My favorite utility is
less
. I can view and naviage a big XML file very quickly with a few keystrokes (G
: go to the end of file,g
: go to start of file,space
: page down,b
: page up)Don't try to open a big file in editor or browser.
-
Work with a small subset of data first. That's a generally good practice applying well to this situation. There are two main reasons:
- To save time. Reading and processing large XML files can be time-consuming, especially when repeated several times before reaching the right solution (it happens more often than you think).
- To facilitate solution design. Usually, XML document's structure has the same layout (branches, elements) at different levels (nodesets). You don't need a whole document (= hundreds of thousands elements), just a couple of elements in each nodesets is enough for programming.
-
dvhc-vietnam: Convert Vietnam's administrative divisions data from XML to CSV/JSON
-
sdmx: Vietnam's National Summary Data
-
vietnam-wdi: Parse WorldBank's Vietnam World Development Indicators.
-
weather-forecast: Parse weather forecast XML service
-
XML in a Nutshell by Elliotte Rusty Harold; W. Scott Means
-
Introduction to Data Technologies by Paul Murrell
-
XML and Web Technologies for Data Sciences with R by Deb Nolan and Duncan Temple Lang