Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

Open
robert-v-simon opened this issue May 8, 2015 · 2 comments
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc. topic/performance
Milestone

Comments

@robert-v-simon
Copy link

Environment:

  • Windows 7 64-bit
  • Ruby 2.0.0 p353 32-bit
  • Nokogiri 1.6.1 x86-mingw32

I've a 5GB XML file with the following structure:

<BATCH>
    <BATCH_TYPE>ALL</BATCH_TYPE>
    <BATCH_UPDATE>RELOAD</BATCH_UPDATE>
    <BATCH_ID>0815</BATCH_ID>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_A">
        <CONTENT_A>
        ...
        </CONTENT_A>
    </BATCH_CHANGE>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_B">
        <CONTENT_B OBJECT_A="abcdefg" OBJECT_B="0123456" BEGIN="000000000" END="000000500">
        ...
        </CONTENT_B>
    </BATCH_CHANGE>
<BATCH>

total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587

I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.

The following reader code will explode when it reaches the first <BATCH_CHANGE node:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
...

The following reader code works fine for the entire document but doesn't keep the key of the attribute which would be essential for my analysis:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                g = 0
                while g < node.attribute_count do
                    @nodeAttrib[g] = node.attribute_at(g)
                end
            else 
                @nodeAttrib = {} 
            end
            ...

It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.

As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? && (node.name != "BATCH_CHANGE")
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
            ...

As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.

@robert-v-simon robert-v-simon changed the title potential memory leak in READER node.attributes too large memory footprint in READER node.attributes when millions of child nodes are present May 8, 2015
@ccutrer
Copy link
Contributor

ccutrer commented Aug 3, 2015

this is because Reader#attr_nodes (and Reader#namespaces) use xmlTextReaderExpand (which reads all children nodes), and then just pull the properties (attributes) off the root node to return the array. it really should be using xmlTextReaderMoveToNextAttribute and constructing its own XML::Attr object

@flavorjones flavorjones added the topic/memory Segfaults, memory leaks, valgrind testing, etc. label Feb 2, 2020
@flavorjones
Copy link
Member

The action item here is to explore re-implementing Reader#attribute_hash to use xmlTextReaderMoveToNextAttribute to assemble the attribute hash (and also do this for Reader#namespaces).

This would be a good time to also deal with #3102

@flavorjones flavorjones added this to the v1.18.0 milestone Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/memory Segfaults, memory leaks, valgrind testing, etc. topic/performance
Projects
None yet
Development

No branches or pull requests

3 participants