Skip to content

Latest commit

 

History

History
132 lines (98 loc) · 5.53 KB

README.md

File metadata and controls

132 lines (98 loc) · 5.53 KB

XML-Flattener

Java 8 / 11 Build on Windows Java 8 / 11 Build on Linux

The XML Flattener converts hierarchical XML documents into table shaped rectangular data sets consisting of rows and columns.

TL;DR - Running the Hello World Example

Assuming you have a version of Java 8+ installed (please see Adopt OpenJDK), you can download the binary and flatten out the Hello World Input XML to a CSV file with:

Linux


git clone https://github.com/DevWorxCo/xml-flattener.git

cd xml-flattener

wget -P target https://www.devworx.co.uk/assets/jars/xml-flattener-exec.jar

java -jar target/xml-flattener-exec.jar examples/Hello-World/hello-world.yml

Windows

git clone https://github.com/DevWorxCo/xml-flattener.git

cd xml-flattener

curl -o target/xml-flattener-exec.jar --create-dirs https://www.devworx.co.uk/assets/jars/xml-flattener-exec.jar

java -jar target/xml-flattener-exec.jar examples/Hello-World/hello-world.yml

The above commands will produce the examples/Hello-World/output/continents-flattened.csv file.

Alt text

Introduction

The XML Flattener is a tool that can assist with the following common data use-cases:

  • Data extraction over directories containing large numbers of XML files to a form which can then be further analysed by tools such as R Studio, Python Pandas or Excel.
  • Transforming deeply nested XML structures to a shape that is consumable by libraries optimised for rectangular data sets (e.g. machine learning libraries or even just end-user spreadsheets).
  • Data loading use-cases where the XML data needs to be flattened such that it can be consumed by relational databases and queried via SQL.

If you wish to build this project locally (rather than using the pre-compiled binary) please consult the README-Building.md guide.

Hello World Example

The "Hello-World" example offers an overview of the functionality that this tool provides.

Consider the Input XML which lists a number of features per continent. It has three levels of nesting and the continent tag has a variable number of attributes.

<?xml version='1.0' encoding='utf-8'?>
<root>
    <title>Hello World</title>
    <source>https://en.wikipedia.org/wiki/Continent</source>
    <description>Simple Example of Features</description>
    <continents>
        <continent area="30370000" population="1287920000" most-populous-city="Lagos, Nigeria">
            <name>Africa</name>
        </continent>
        <continent area="14000000" population="4490" most-populous-city="McMurdo Station" countries="0">
            <name>Antarctica</name>
        </continent>
        <continent area="44579000" population="4545133000" most-populous-city="Shanghai, China">
            <name>Asia</name>
        </continent>
        <continent area="10180000" population="742648000" most-populous-city="Moscow, Russia" demonym="European">
            <name>Europe</name>
        </continent>
        <continent area="24709000" population="587615000" most-populous-city="Mexico City, Mexico">
            <name>North America</name>
        </continent>
        <continent area="8600000" population="41261000" most-populous-city="Sydney, Australia">
            <name>Australia</name>
        </continent>
        <continent area="17840000" population="428240000" most-populous-city="São Paulo, Brazil">
            <name>South America</name>
        </continent>
    </continents>
</root>

In order to perform analysis on this XML it needs to be flattened to the following CSV file:

Alt text

This can be accomplished with the following YAML flattening specification

name: Hello World Example
inputPath: xml
outputTables:
  - name: continents-flattened
    outputFile: output/continents-flattened.csv
    definition:
      - columnName: Title
        sourceType: xpath
        sourceDef: root/title
      - columnName: Source
        sourceType: xpath
        sourceDef: root/source
      - columnName: Description
        sourceType: xpath
        sourceDef: root/description
      - columnName: Continents
        sourceType: xpath
        sourceDef: root/continents/continent
        explode: true # Create a row for each element
        repeatingList:
          - columnName: Continent-Attrb-
            sourceType: dynAttribute
            sourceDef: "."
            attributeFilter: ".*"
          - columnName: Continent-Name
            sourceType: xpath
            sourceDef: name/text()

Summary of Examples