GitHub - chuvanan/xml2df: TIL: R's XML parser and toolkit

How hard is it to work with medium-sized XML file in R?

git clone https://github.com/chuvanan/xml2df.git
cd xml2df
R
renv::restore()

Main packages for working with XML in R: xml2 and XML. xml2 is a R's bindings to libxml2 - actually, a subset of libxml2. XML's been developed by the legendary Duncan Temple Lang (R Core Team) - also based on libxml2. In general, xml2 will be easier to use for simple tasks thanks to its user-friendly API (more on that later) while XML should be preferred for heavier and more complicated tasks.
How a XML parser works (for example: libxml2)?
Rules for building xpath query:
- Place two backslashes in front of the top node for jumping across nodes
- Separate subsequent nodes with a single backslash
- Add a period before search query to make the search to be local (that's particularly useful when looping over nodesets)
- When things go wrong, it's likely due to namespace

Always look at the data. What I mean by "look at the data" is to look at the XML file line by line to have a quick overview of structure, nesting depth, elements, attributes, namespaces (if any). For this task, command line tools such as head, tail, less become very handy.

My favorite utility is less. I can view and naviage a big XML file very quickly with a few keystrokes (G: go to the end of file, g: go to start of file, space: page down, b: page up)

Don't try to open a big file in editor or browser.
Work with a small subset of data first. That's a generally good practice applying well to this situation. There are two main reasons:
- To save time. Reading and processing large XML files can be time-consuming, especially when repeated several times before reaching the right solution (it happens more often than you think).
- To facilitate solution design. Usually, XML document's structure has the same layout (branches, elements) at different levels (nodesets). You don't need a whole document (= hundreds of thousands elements), just a couple of elements in each nodesets is enough for programming.

dvhc-vietnam: Convert Vietnam's administrative divisions data from XML to CSV/JSON
sdmx: Vietnam's National Summary Data
vietnam-wdi: Parse WorldBank's Vietnam World Development Indicators.
weather-forecast: Parse weather forecast XML service

Libxml Tutorial
XML - Namespaces
Quick Guide to XML in R
XML Files
XML Tutorial
xml default namespace rage
Memory Management in the the XML Package
XML in a Nutshell by Elliotte Rusty Harold; W. Scott Means
Introduction to Data Technologies by Paul Murrell
XML and Web Technologies for Data Sciences with R by Deb Nolan and Duncan Temple Lang

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
renv		renv
.Rprofile		.Rprofile
.gitignore		.gitignore
.here		.here
API_VNM_DS2_en_xml_v2_990115.xml		API_VNM_DS2_en_xml_v2_990115.xml
FR_FRAMDST66_2016-01.xml		FR_FRAMDST66_2016-01.xml
README.md		README.md
a-subset.xml		a-subset.xml
api-vnm-subset.xml		api-vnm-subset.xml
forecast.xml		forecast.xml
forecast_hour_by_hour.xml		forecast_hour_by_hour.xml
renv.lock		renv.lock
vietnam-wdi.csv		vietnam-wdi.csv
vietnam-wdi.r		vietnam-wdi.r
xml-weather-forecast.r		xml-weather-forecast.r
xml2df-with-XML-pkg.r		xml2df-with-XML-pkg.r
xml2df-with-xml2-pkg.r		xml2df-with-xml2-pkg.r