Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unification of XML to dict list tree translation #113

Open
1 of 4 tasks
sem-geologist opened this issue May 3, 2023 · 10 comments
Open
1 of 4 tasks

Unification of XML to dict list tree translation #113

sem-geologist opened this issue May 3, 2023 · 10 comments

Comments

@sem-geologist
Copy link
Contributor

sem-geologist commented May 3, 2023

I have spotted that there are quite many duplication efforts in parsing and translating hierarchical metadata in xml into pythonic dict and list structures. I will keep this updated. So I think XML translator used in bruker api is most built-up and I expand that to take into account most bizzare XML cases (and XML can be really unreadable but valid mess).

Progress:

@sem-geologist
Copy link
Contributor Author

sem-geologist commented May 4, 2023

For whose who are not familiar with "what is so hard with XML translation to python structures" this is a primer.
XML tree node (which has opening and closing tags) can contain data in 3 forms. Lets begin the example with an empty node Called NodeX <NodeX></NodeX>. Now such node can contain other nodes (called children):

<NodeX>
  <subNode1></subNode1>
  <subNode2></subNode2>
</NodeX>

These child nodes can have the same names or differing names. If it is different names then such child translates to python dict, else if it is same names, it translates to list. Above xml could be translated to:

{"NodeX":
  {"subNode1": None,
   "subNode2": None}
}

in case children would have same name subNode:

{"NodeX":
  {"subNode": [None, None]}
}

Looks easy! No?
Then it can contain some content between tags:

<NodeX>
  <subNode>Alpha</subNode>
  <subNode>Beta</subNode>
</NodeX>

which would translate to:

{"NodeX":
  {"subNode": ["Alpha", "Beta"]}
}

But now pay attention - it is absolutely valid to add content alongside the children like this:

<NodeX>This is
  <subNode>Alpha</subNode> and this is
  <subNode>Beta</subNode> subnodes
</NodeX>

text This is is accessible under .text of element tree tag NodeX, other parts will be at .tail of children tags.

The question is where (and how) we put "This is and this is subnodes" - it should be one level below "NodeX" key in the dict. How we should name the key? Content? Text? text? values? Should it be returned in list, or should it concatenate into single string?
Consider then these questions with naming when dealing with such xml:

<NodeX>This is
  <Content>Alpha</Content> and this is
  <Content>Beta</Content> subnodes
</NodeX>

To think that we could use some sensible node name and it would not collide with some of tag name in some of OEM XML formats would be extremely naive. There is some kind like Murphy's law as if there is even such very unlikely possibility, for some bizarre reasons it will be implemented by some of OEM.
Thus in such case such key should start with char which should not be used at the beginning of tags string so there would be no collision between pythonised children and pythonized content keys. Such char is in example a hashtag.
Python dict keys can begin with hashtag. Thus such text can be denoted with like #text, or #value or other hashtaged key depending from XML context:

{"NodeX": {"content": ["Alpha", "Beta"], "#text": "This is ant this is subnodes"}}

But that is not the only complication, because we looked only into 2 from 3 ways of storing data in the XML.

@sem-geologist
Copy link
Contributor Author

sem-geologist commented May 4, 2023

Then there is 3rd way: attributes. I know some proprietary scientific formats which stuffs hierarchical data into flat structure (the XML has root node, and only single level of long list of children) and hierarchy is described only with attributes (It is behind human comprehension to try to resolve hierarchy in head without software in such cases). that would look something like this:

<root>
  <Instance class="Detector" parent="root" \>
  <Instance class="Window" parent="Detector" thickness="10" Type="Thin" \>
  <Instance class="IonGun" parent="root"\>
  ....
</root> 

...while it looks messy and hardly comprehensible for human, such flat structure indeed is easy mappable to python dictionary. Basically it would look like this:

{"root":
  { "Instance": [
      {"class": "Detector", "parent"="root"},
      {"class": "Window", "parent"="Detector", "thickness"=10, "Type"="Thin"},
      {"class": "IonGun", "parent"="root"},
    ]
  }
}

Making it hierarchical from that point would need some effort.
However that straight forward mapping can be thrown out the window the moment children and content appears under such tags.
The arguments are then often used as some additional metadata important to correctly interpret the content:

<dataNode>
  <size units="bytes">2048</size>
  <createdWith version="1.2.4">SoftwareX</createdWith>
  <comment>From garden</comment>
  <randomMPStuff title="bunch of taxes" class="taxes">
    <title proposed="committee on the new taxation">Thingy</title>
    <title proposed="an old man">An old people standing in the water</title>
    <title proposed="some street advisor">Holiday snaps</title>
  </randomMPStuff>
</dataNode>

first of all under "dataNode" we see first problem:
size tag has attribute units, and content 2048. It is clear that both needs to become children under pythonic dictionary {"size": .... }... how we should name the content? text? value? It was discussed one post above - this is again the similar problem as when content and children are both present, and best is to have this keyword to be flexible to best fit the sense of the file format. IMHO it is best to have the same keyword set for both collision cases (attribute-content; children-content), so that code which will use such translated dictionary from xml, would have less work with matching strings.

Then, if we look to <randomMPStuff> tag, we see another very feasible situation, when children tag name and attribute name have a name collision. Please bear in mind the most important thing: XML used by desktop software for object (de-)serialization and saving/loading to/from some memory storage (What we know as file formats) are very susceptible to attribute and tag name collisions . Such XML structure most often is absolutely not human-designed, but imposed by framework which is used to create such desktop application.

The most simple way to work around this collision problem is to prepend the attribute with some char which is not valid in the XML tag, but valid in python dictionary. I.e. "@".
When to use @ before the attribute name in translation? Then any children tag is present? Maybe we should check for collision and use @ prepending to attributes translated name only if colliding child tag name is detected? I would argue not because: 1) checking for collision name will grow in processing loop count - that will significantly increase translation time, checking argument name against every children tag name has a much bigger penalty than just checking if tag has any children. 2) some of files from same vendor can miss the collisions. It can originate in XML of same vendor by triggering/toggling some specific method or resulting from sequence of actions in that OEM software. So then Target software using such XML to dict translation would need to check twice for prepended and not prepended attribute names. Nicer translation would result in worse performance at later stages.

Considering those, the prepending string to attribute name should be customisable by developer. Default is @. It could cause some problem with later encapsulation of file with some dictionary "objectify-ers" like Box or DictionaryTreeBrowser. as python attribute name should not start with @. Other non colliding prepending string could be used then (i.e. for Bruker instead of default "@" the "XmlClass" prepending string is being used.
In case developer is aware that there is absolutely no attribute and tag name collisions an empty string "" can be then used, so dictionary children will have same not-prepended names as in original XML.

@ericpre
Copy link
Member

ericpre commented May 6, 2023

@sem-geologist, #111 fixes this issue? At least, for now?

@sem-geologist
Copy link
Contributor Author

@ericpre I want to use this issue as a tracker for XML handling unification. #111 is initial part of what this issue states. I think it will be proper to close this as most of file readers will adapted this (I plan to do one by one review and adaptation) and dev documentation updated how to use that (there is dev docs? no?) for new formats with XML.
Also I want to finish with stating my reasoning why this way of translation and not other - maybe someone will have different view which could suggest a bit different approach. It would be good to get consensus early.

@sem-geologist
Copy link
Contributor Author

I had updated some post above.

This is continuation of discussion and demonstration of the current state (which was merged with #111).

The below XML is also added in #111 as a test file under tests/utils/ToastedBreakFastSDD.xml. While it is absolutely artificial, and can look a bit exaggerating the XML format situation, I tried to include different possible scenarios seen in files from different vendors:

<?xml version="1.0" encoding="UTF-8"?>
<TestXML>
  <Header>
    <ShortDescription>Test XML</ShortDescription>
    <HTMLDescription><![CDATA[This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.]]></HTMLDescription>
  </Header>
  <Main>
    <ClassInstance>
       <Detector>
          <ClassInstance>
            <Angle>15.345</Angle>
            <Type>SDD</Type>
            <Model>BreakFast™</Model>
            <PulseProcessor>FPGAv11</PulseProcessor>
            <BufferSize units="bytes">2048</BufferSize>
          </ClassInstance>
       </Detector>
       <Instrument ClassInstance="Analytical">
          <Type class="chasis">Toaster</Type>
          <SerialNumber>1234-5</SerialNumber>
          <Dim axes="width, depth, height">33.3,27.4,25.2</Dim>
          <IsCoated/>
          <IsToasted>not today</IsToasted>
          <IsToasting>affirmative</IsToasting>
       </Instrument>
       <Sample name="breakfast test" number="23">With one of these components
          <Components>
            <ComponentChildren>
              <Instance name="Eggs" calories="345.2" breaking-speed="5.2"></Instance>
              <Instance name="Bacon" calories="5000" breaking-speed="11"></Instance>
              <Instance name="Spam" calories="0.1" breaking-speed="24.6"></Instance>
            </ComponentChildren>
          </Components>
          <Project>BreakFast</Project> SDD risks to be Toasted.
       </Sample>
    </ClassInstance>
  </Main>
</TestXML>

The XML is intentionally over-verbose with many name collisions.

Xml to dictionary translator is also using literal_eval form ast or can be expanded if there is a need. Because as discussed above, some parts of translator is customisable, and evaluation can be expanded - thus Xml To Dictionary translator is defined as a class and not just as simple function.

Could it be just a function? Actually it could, and probably that would be more simplier for simple cases.
However consider such an XML:

<Main>
  <DetectorHeader\>
  <EPMAHeader\>
  <SEMHeader\>
  <Data\>
  <RedundantColors\>
  <UnicornsOnTheScreen\>
  <LittleMermaid\>
  <SelectionOfGardenBakedClayDwarfStatues\>
</Main>

It contains 4 first nodes with useful scientific (meta-)data, and 4 last nodes with kind of visualization instructions useful explicitly for OEM software, but absolute useless for anything outside that.
So the idea with class-based translator is that after initialization of translator we reuse its .dictionarize method only on useful XML etree nodes. In such approach the irrelevant parts need to be read into memory only once by ElementTree, and irrelevant parts can be ignored by further methods and after getting relevant parts discarded as whole and "garbage-collected". It is important to not have any references to any element in the etree, so it could be deleted as a whole, thus evaluation of data into proper python object further guaranties/enforces "cutting-ties" with etree objects. Translating whole such XML to python dicts and lists would unnecessarily burden memory and processing time (would increase read time and memory footprint).

@sem-geologist
Copy link
Contributor Author

sem-geologist commented May 15, 2023

using new functionality in action:

import xml.etree.ElementTree as ET
from rsciio.utils.tools import XmlToDict

x2d_translator = XmlToDict()

So making the etree object from ToastedBreakFastSDD.xml and converting it to python structure would look like:

toasted_break_fast_sdd_et = ET.fromstring(there_not_shown_loaded_as_python_bytes_or_str_xml)
py_toasted = x2d_translator.dictionarize(toasted_break_fast_sdd_et)

py_toasted will contain such structure:

{
'TestXML': {
  'Header': {
     'ShortDescription': 'Test XML',
     'HTMLDescription': 'This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.'},
  'Main': {
    'ClassInstance': {
      'Detector': {
        'ClassInstance': {
          'Angle': 15.345,
          'Type': 'SDD',
          'Model': 'BreakFast™',
          'PulseProcessor': 'FPGAv11',
          'BufferSize': {'units': 'bytes', '#value': 2048}
        }
      },
      'Instrument': {
        'Type': {'class': 'chasis', '#value': 'Toaster'},
        'SerialNumber': '1234-5',
        'Dim': {'axes': 'width, depth, height', '#value': (33.3, 27.4, 25.2)},
        'IsCoated': None,
        'IsToasted': 'not today',
        'IsToasting': 'affirmative',
        '@ClassInstance': 'Analytical'
      },
      'Sample': {
        'Components': {
          'ComponentChildren': {
            'Instance': [
              {'name': 'Eggs', 'calories': 345.2, 'breaking-speed': 5.2},
              {'name': 'Bacon', 'calories': 5000, 'breaking-speed': 11},
              {'name': 'Spam', 'calories': 0.1, 'breaking-speed': 24.6}
            ]
          }
        },
        'Project': 'BreakFast',
        '@name': 'breakfast test',
        '@number': 23,
        '#value': 'With one of these components'
        }
      }
    }
  }
}

It is verbose! accessing name of first component of breakfast needs this monstrosity:

py_toasted['TestXML']['Main']['ClassInstance']['Sample']['Components']['ComponentChildren']['Instance'][0]['name']

Using Box would make hardly any difference (maybe more pleasant for Java devs... and panoramic(ultra-wide)-screen-friendly):

boxy_obj = Box(py_toasted)
boxy_obj.TestXML.Main.ClassInstance.Sample.Components.ComponentChildren.Instance[0].name

Now it is time to remind again: many XML structures used by OEM are framework designed - not Human designed. If we want this data to be usable in human readable form (box or DataTreeViewer or other helper...) it would be good to shave off some artificially made hierarchical scruff. It is indeed easy to do that while initializing XmlToDict. tags_to_flatten keyword accepts strings which should be removed from hierarchical tree, and children of such node goes then directly under parent of such node.
so As example we can initialize the translator with more options:

better_x2d = XmlToDict(
    dub_text_str="#val",`
    interchild_text_parsing='cat',
    tags_to_flatten=[
        "ClassInstance",
        "ComponentChildren",
        "Instance"
        ]
)
py_better_toasted = better_x2d.dictionarize(toasted_break_fast_sdd_et)

that will have much flatter structure without redundant programming-framework-injected stuff:

{
'TestXML': {
  'Header': {
    'ShortDescription': 'Test XML',
    'HTMLDescription': 'This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.'
  },
  'Main': {
    'Detector': {
      'Angle': 15.345,
      'Type': 'SDD',
      'Model': 'BreakFast™',
      'PulseProcessor': 'FPGAv11',
      'BufferSize': {'units': 'bytes', '#val': 2048}
    },
    'Instrument': {
      'Type': {'class': 'chasis', '#val': 'Toaster'},
      'SerialNumber': '1234-5',
      'Dim': {'axes': 'width, depth, height', '#val': (33.3, 27.4, 25.2)},
      'IsCoated': None,
      'IsToasted': 'not today',
      'IsToasting': 'affirmative',
      '@ClassInstance': 'Analytical'
    },
    'Sample': {
      'Components': {
        'name': ['Eggs', 'Bacon', 'Spam'],
        'calories': [345.2, 5000, 0.1],
        'breaking-speed': [5.2, 11, 24.6]
      },
      'Project': 'BreakFast',
      '@name': 'breakfast test',
      '@number': 23,
      '#interchild_text': 'With one of these componentsSDD risks to be Toasted.'
      }
    }
  }
}

I think everyone will agree that this is better than previous.
Accessing the first component name of breakfast will look like:

py_better_toasted['TestXML']['Main']['Sample']['Components']['name'][0]

or in boxified version:

py_boxy_better.TestXML.Main.Sample.Components.name[0]

Hope this starts to look reasonable.

@CSSFrancis
Copy link
Member

@sem-geologist this is very cool! Admittedly, I attempted to do something similar with #11 but I see now that I was rather naive in my approach. I'll try to do some testing with those XML files later this week but they should be simple enough that this should definitely cover them.

@pietsjoh
Copy link
Contributor

pietsjoh commented Jun 12, 2023

I did implement something similar for the trivista filereader, but I would now switch to this version as it is more sophisticated.
I will also check whether I can use XmlDict for the jobyin-yvon reader.

The trivista filereader doesn't convert the complete ET to a dictionary (data and signal axis information is directly extracted from the ET). For extracting metadata, partly a recursive approach similar to XmlDict.dictionarize() is used. For these cases XmlDict.dictionarize() works without any problem. However, in some cases I want to convert only the attribs of a node to a dict, without converting all the children. To achieve is, I could use something like this.

def _et_node_attrib2dict(node):
    return {node.tag: {k: XmlToDict.eval(v) for k, v in node.attrib.items()}}

Should I keep this in the filereader or add this to XmlDict?
A different option would be to introduce a function parameter to XmlDict.dictionarize(), which can limit the recursion depth.

@sem-geologist What is your opinion on this?

@sem-geologist
Copy link
Contributor Author

sem-geologist commented Jun 13, 2023

@pietsjoh , to begin with, let me ask Why Do You want only attributes (maybe problem lies somewhere else)? If it is due to "@" being added to the name - it can be set with empty string "" during initiation of translator class (if there is warranty of no name clash with children tag names). It is indeed quite simple to add the way to ignore any children (altogether). It is probably wise as You suggested to add that kind of functionality as keyword to .dictionarize(). As limiting number of recursion - I am not sure it is the right way. We want or don't want those node childrens depending from content, and not how deep it is. See below.

BTW how many children such node contains. I am asking, as I currently am preparing some extension to xml2dict to be able to ignore tags by name. I have similar (not same) issue. In Bruker formats sometimes metadata is intermingled with irelevant nodes, relevant metadata, and data at same level like this:

<ClassInstance type="Parent">
  <UsefulMetadata1>SDDtype3</UsefulMetadata1>
  <UsefulMetadata2>
    <ClassInstance type="HardwareHeader">
      <ShappingTime>0.2</ShappingTime>
      <PulserFreq>50000</PulserFreq>
      <Channels>4096</Channels>
      <Size>2</Size>
    </ClassInstance>
  </UsefulMetadata2>
  <NotRelevantData>
    <Count>254</Count>
    <C1>0,0,0<C1>
    <C2>1,1,1<C2>
    <!--up to 254 shades of gray-->
  </NotRelevantData>
  <UsefulMetadata3>4</UsefulMetadata3>
  <UsefulMetadata4>587</UsefulMetadata4>
  <UsefulMetadata5>3E-2</UsefulMetadata5>
  <InterestingMetadata>hh</InterestingMetadata>
  <Data>1,2,3,4,5,6,7,8,9,4,2,21,5</Data>
</ClassInstance>

In above pseudo example, hitherto, my bruker code reads and converts such Useful Metadata 1,3,4,5 one-by-one, where dictionarization is used on UsefulMetadata2. The point is that there are few nodes which I want not to dictionarize for different reasons. I.e. <NotRelevantData> I want just skip and forget that It exists there. <Data> I want postpone. <Data> needs to be converted to numpy array, but for that the shape and datatypes are needs, which are acquirable after reading the XML. Also there is no sense keeping the raw bytestring in original metadata (which indeed is not metadata at all, but is Data). So I am going also add possibility to ignore children nodes by tag name.

@pietsjoh
Copy link
Contributor

@sem-geologist , essentially my situation looks somewhat like this:

<Document Version="2" Label="Intensity" DataLabel="Counts" InfoSerialized="&lt;?xml version=&quot;1.0&quot; ...">
  <NotRelevantMetadata Count="0" />
  <Data>
    <Frame>1;2;3;4;5</Frame>
  </Data>
</Document>

Lots of useful metadata is saved in the attributes of the node <Document>, which I want to convert to a dictionary.
However, the only child node I care about is <Data>, which I read seperately. So I would want to ignore all children.

As limiting number of recursion - I am not sure it is the right way. We want or don't want those node childrens depending from content, and not how deep it is. See below.

Yes , thinking about it again this won't be needed as a generalization.

It is indeed quite simple to add the way to ignore any children (altogether). It is probably wise as You suggested to add that kind of functionality as keyword to .dictionarize().

That is exactly what I did in my original version. And I think that is probably the easiest solution for my case. Another option would be to split up the dictionarize() method:

class XmlDict:
    def read_attributes(self, et_node):
        d_node = {et_node.tag: {} if et_node.attrib else None}
        d_node[et_node.tag].update(
                (self.dub_attr_pre_str + key if children else key, self.eval(val))
                for key, val in et_node.attrib.items()
        )
        return d_node

    def dictionarize(self, et_node):
        d_node = {et_node.tag: {} if et_node.attrib else None}
        children = list(et_node)
        if children:
            ...
        if et_node.attrib:
            d_node[et_node.tag].update(self.read_attributes(et_node))
        if et_node.text:
            ...

That would work in my case. However, in general it would be probably more useful to just ignore the children and make a dictionary out of .attrib and .text, which can be easier implemented with an extra function parameter to dictionarize().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants