Implement a pull parser #10

hadley · 2015-02-20T15:49:26Z

For large xml files, or when you just want to extract a little bit of data without loading the entire file into memory, e.g. http://lxml.de/parsing.html#incremental-event-parsing

If you need this, please 👍 this issue with a brief use case

cboettig · 2015-10-26T21:53:36Z

👍 This would be awesome (if I've understood it correctly). I often work with some large dump of XML files, like this: https://github.com/TreeBASE/supertreebase/tree/master/data/treebase

The standard use case is that I want to filter the collection to isolate only those files that have certain attributes, e.g. identifying the documents that include my species of interest as one of the taxa (e.g. matches the <otu label="species name"> value).

Loading all the documents into memory first is often prohibitive in terms of memory, forcing the user to parse, search, and then remove each object, and at least with XML this can all be very slow.

jimhester · 2016-12-22T11:30:31Z

libxml2 documentation of the SAX interface is http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html

The relevant Cython code from lxml is https://github.com/lxml/lxml/blob/572e10843774a5d6300125d89bdc423d53c92971/src/lxml/saxparser.pxi

Implementing this is clearly non-trivial and has entirely different semantics than we are currently using (SAX callbacks vs tree / DOM based).

This feature is a long way away and may be better implemented in a separate package entirely.

hadley · 2016-12-22T14:22:24Z

Let's close for now.

randomgambit · 2018-03-24T00:46:26Z

hey mister @jimhester , any news on this? ultra large XML files are becoming the norm today... SAX looks like a necessity even tho I dont like saxophones (haha)

Thanks!

jimhester · 2018-03-24T01:08:46Z

This feature is a long way away and may be better implemented in a separate package entirely.

randomgambit · 2018-03-24T01:10:58Z

dude its been more than a year!!! :) Im kidding. thanks anyway for the headsup

wkumler · 2024-09-22T20:11:14Z

Upvoting this issue! I know it's been a long while but a new instrument we've got in the lab produces 1-10 GB XML files that need to be parsed and while I can currently use read_xml without any major problems, trying to find specific nodes with find_xml_all or find_xml_first rapidly consumes all available memory and errors out. Currently switching to the XML library for this but I've found the syntax is a lot messier there.

This was referenced Oct 2, 2015

Recovering from errors ropensci/oai#17

Closed

Handling XML errors ropensci/oai#24

Closed

hadley mentioned this issue May 4, 2016

SAX parsing #83

Closed

hadley closed this as completed Dec 22, 2016

cboettig mentioned this issue May 12, 2018

eml_validate doesn't check all EML validity rules ropensci/EML#244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a pull parser #10

Implement a pull parser #10

hadley commented Feb 20, 2015

cboettig commented Oct 26, 2015

jimhester commented Dec 22, 2016

hadley commented Dec 22, 2016

randomgambit commented Mar 24, 2018

jimhester commented Mar 24, 2018

randomgambit commented Mar 24, 2018

wkumler commented Sep 22, 2024

Implement a pull parser #10

Implement a pull parser #10

Comments

hadley commented Feb 20, 2015

cboettig commented Oct 26, 2015

jimhester commented Dec 22, 2016

hadley commented Dec 22, 2016

randomgambit commented Mar 24, 2018

jimhester commented Mar 24, 2018

randomgambit commented Mar 24, 2018

wkumler commented Sep 22, 2024