Convert PyROOT TH1/TH2 to numpy arrays #392

FlorianBury · 2021-07-09T09:33:38Z

FlorianBury
Jul 9, 2021

Hi all,

As uproot is completely decoupled from ROOT (and therefore PyROOT) I acknowledge I am probably slightly off-topic but I could use some guidance.

I am working with PyROOT TH1s and TH2s and at some point need to convert back and forth to numpy arrays. Until now I was using a plain python loop with the usual GetBinContent, relatively slow but it was fine so far. Unfortunately I have now reached the point where I have to read few hundred TH2s that can have many bins and my computation time skyrocketed.

I was using in the past root_numpy.hist2array and was happy with it I but could not in the current project because I also needed the bin errors which was not supported until recently. If that is my only option then I would go for it but as root_numpy is currently deprecated I am looking for something more stable.

Uproot looks very interesting and I would gladly use it but from I understand there is no way to use a PyROOT object, I tried to dig a bit into uproot-method but I could not find a workaround.

The other alternatives that are proposed do not seem implementable either : converting from TTrees does not fit my purpose and the proposition to use RDataFrame as here completely puzzles me.

Could anyone provide some guidance as to what option seems more suitable, or whether I should stick with the deprecated root_numpy ? It would be very much appreciated.

Best,
Florian

jpivarski · 2021-07-09T11:25:08Z

jpivarski
Jul 9, 2021
Maintainer

If you have histograms in a ROOT file, load one in Uproot and call .to_numpy(). (There are other methods like values, errors, and axis(0).edges for not control—see the documentation.)

If your objects are in-memory PyROOT objects, then to use this method, you'd have to save them to a file. Uproot only recognizes serialized ROOT objects. We had talked about making a PyROOT-to-Uproot bridge by temporarily serializing them in in-memory files, but that seems circuitous, and we haven't tried it yet.

Another option is to write a C++ function with gInterpreter.Declare that calls GetBinContent in a compiled loop, filling an array. That would be a fast ROOT-only option.

RDataFrame is only for non-binned data, like TTrees, so it didn't fit your case at all.

1 reply

FlorianBury Jul 9, 2021
Author

Forgot to mention that my objects are in-memory, saving them to root files only to reload them with uproot would require too much shuffling around.

Being able to go back and forth between PyROOT and Uproot would make my life much easier but I understand it was probably not meant to be that way from the start, and it might not be such a popular feature anyway.

I could try the gInterpreter way though, thanks for the tip !

jpivarski · 2021-07-09T14:14:05Z

jpivarski
Jul 9, 2021
Maintainer

I've been nerd-sniped: I had to find out if the in-memory file method would work. This should probably be turned into an easier-to-use, high-level function in Uproot, but there could be sharp edges we haven't identified yet—it would be easier to find them by doing this manually than if it's wrapped up and hidden in a function.

Suppose that we have a PyROOT object like this:

import ROOT
import numpy as np
import uproot

h = ROOT.TH1F("h", "", 10, -5, 5)

We can serialize the object into an in-memory file (no disk involved), like this:

ROOT.gInterpreter.Declare('''
void copy_buffer_for_uproot(char* destination, TMessage& message) {
    memcpy(destination, message.Buffer(), message.Length());
}
''')

message = ROOT.TMessage(ROOT.kMESS_OBJECT)
message.WriteObject(h)

buffer = np.empty(message.Length(), np.uint8)
ROOT.copy_buffer_for_uproot(memoryview(buffer), message)

A ROOT TMessage is an in-memory buffer, as though it were a file, but it can be for a single object, not on disk, and not compressed, which will make it easier to read back in Uproot. The above implementation lets the TMessage manage its own buffer and copy it into a NumPy array, buffer at the end. Alternatively, we could have provided the NumPy array as TMessage's own buffer with

message = ROOT.TMessage(ROOT.kMESS_OBJECT)
buffer = np.empty(1000, np.uint8)
message.SetBuffer(buffer, len(buffer), False)

message.WriteObject(h)

and then look at buffer[:message.Length()] immediately after message.WriteObject(h). That method would avoid copying the serialized buffer, but then we'd have to guess an appropriate size for the buffer; if 1000 weren't large enough, I don't know what would happen (error message? segfault?), so the technique of copying is certainly safer.

You might also be wondering why we had to define a copy_buffer_for_uproot C++ function to do that copy. It's because PyROOT interprets functions that return char*, such as TMessage::Buffer, as Python strings. We really need it to be a pointer (void* would have been a better type than char*), and the only way I could find to do that is to write it in C++.

Now that we have the raw bytes of a TH1F object in a NumPy buffer, we can read that back in Uproot with

class FakeFile(object):
    def class_named(self, classname, version=None):
        return uproot.class_named(classname, version=version)

fakefile = FakeFile()

chunk = uproot.source.chunk.Chunk.wrap(None, buffer)
cursor = uproot.source.cursor.Cursor(8)

h2 = uproot.deserialization.read_object_any(chunk, cursor, {}, fakefile, fakefile, None)

The h2 is an Uproot histogram with the same content as h, and it's a detached copy—the original h can be deleted, changed, etc. without affecting h2. The read_object_any function reads any type of standalone ROOT object (anything that could be written into a TDirectory), not just histograms, though if you tried this with a TTree, you'd only get the TTree metadata, not any of the event data, so it should be used on self-contained things like histograms. The TMessage itself has an 8-byte header, so the cursor steps past that to start on the actual data.

Why do we need a FakeFile? The deserialization function has to look up TStreamerInfo to know how to read a serialized class, for a specified version of that class. It first queries the file the object came from because ROOT files should ship with TStreamerInfos for all the classes they contain (that "should" is a complicated story). But in this case, the object does not come from a file, and the TMessage doesn't contain any TStreamerInfos, as messages are supposed to be small and lightweight. So we have to defer to any globally defined class models in uproot.classes, which either came from files that have already been opened in Uproot or they were hard-coded and shipped with Uproot.

If the class model isn't there, the above will fail with a DeserializationError—that's another rough edge, and a properly wrapped up function should check to be sure we have a class model for the exact version that this ROOT will write into the TMessage—it matters which ROOT is imported in import ROOT. The above example worked because this ROOT (6.24/02) is writing TH1F version 3 into the TMessage:

>>> ROOT.TClass.GetClass("TH1F").GetClassVersion()
3

and Uproot recognizes version 3:

>>> uproot.class_named("TH1F").known_versions
{3: <class 'uproot.models.TH.Model_TH1F_v3'>}

That sort of thing could be automatically checked, though the error message would have a lot of explaining to do if ROOT writes version 2 and Uproot reads version 3!

3 replies

FlorianBury Jul 9, 2021
Author

I certainly did not expect you would go down the rabbit hole, thanks !

In the meantime I have implemented the C++ equivalent with the gInterpreter, do you think I could apply your snippets above or better wait for a more safe implementation within uproot ?
Note that I am currently running ROOT 6.12 (I can upgrade my version but being on a HPC it might not be instantaneous), hence

>>> ROOT.TClass.GetClass("TH1F").GetClassVersion()
2

Also, as I am currently doing PyROOT -> numpy as well as numpy -> PyROOT, but I don't think the latter could use your IO method. Disclaimer : this is not nerd-sniping ! There must be more funny things to implement in uproot than this, I am just wondering if there is a relatively easy way.

jpivarski Jul 9, 2021
Maintainer

We should probably have this PyROOT → Uproot functionality anyway. I know that we've talked about it somewhere—someone else asked for it—but I can't find that discussion anywhere to link it in here.

The method described above is more general than histograms—it should work for any subclass of TObject other than TTree (for the caveat above). However, the class models need to be available because a TMessage doesn't have TStreamerInfos, and your case (ROOT 6.12, whose TH1F version is 2) is an example that wouldn't work. Since it turned out to be pretty easy to find an example of a ROOT distribution with a version that wouldn't work, perhaps this should be addressed in the high-level function. Perhaps the same technique could be used to serialize the ROOT version's TStreamerInfos into TMessages and then deserialize them with Uproot before translating the actual object, but then we'd have to also search for all dependencies of a given class. (For instance, TH1F depends on TH1, TNamed, TObject, TString, TAttLine, TAttFill, TAttMarker, TAxis, TAttAxis, THashList, TList, TSeqCollection, and TCollection; we'd have to keep passing streamers from PyROOT to Uproot until all of these dependencies have been met—not impossible, just more code to write!)

If you have a gInterpreter-defined function that gets values out of a TH1/TH2 into a NumPy array using GetBinContent in the compiled loop, that will work for any version of ROOT, though only for histograms, naturally. Since that's your specific problem, then it sounds like you have a solution and you should go for it.

FlorianBury Jul 9, 2021
Author

Then I am happy to have triggered the thinking on your side !

Thanks again !

jpivarski · 2021-08-24T22:38:33Z

jpivarski
Aug 24, 2021
Maintainer

FYI: this is starting to be implemented in PR #420.

2 replies

jpivarski Aug 25, 2021
Maintainer

https://github.com/scikit-hep/uproot4/blob/ccdec6918620507155cbede024ca92ca251a4f82/tests/test_0420-pyroot-uproot-interoperability.py#L14-L28

jpivarski Aug 26, 2021
Maintainer

PR #420 is set for auto-merging, and that will likely happen in a half-hour. Now it's bidirectional: any PyROOT object can be converted into an Uproot Model using

uproot.from_pyroot(pyroot_object)

and serializable Uproot Models (which consists of only TObjString and histograms, at the moment) can be converted into PyROOT objects using

uproot_object.to_pyroot()

Serializability is key because both directions go through TMessages. All ROOT objects in PyROOT can be serialized into a TMessage, but only a few Uproot Models have been implemented for writing to files, which is what we reuse to make the TMessage.

In principle, all Uproot Models that come from files could be made automatically serializable by keeping a copy of the bytes they were deserialized from, but that would make all objects bigger and I don't know what I think of that.

This also ties into writing files because the above makes all PyROOT objects writable by Uproot:

with uproot.recreate("filename.root") as file:
    file["hist"] = pyroot_histogram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert PyROOT TH1/TH2 to numpy arrays #392

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Convert PyROOT TH1/TH2 to numpy arrays #392

FlorianBury Jul 9, 2021

Replies: 3 comments · 6 replies

jpivarski Jul 9, 2021 Maintainer

FlorianBury Jul 9, 2021 Author

jpivarski Jul 9, 2021 Maintainer

FlorianBury Jul 9, 2021 Author

jpivarski Jul 9, 2021 Maintainer

FlorianBury Jul 9, 2021 Author

jpivarski Aug 24, 2021 Maintainer

jpivarski Aug 25, 2021 Maintainer

jpivarski Aug 26, 2021 Maintainer

FlorianBury
Jul 9, 2021

Replies: 3 comments 6 replies

jpivarski
Jul 9, 2021
Maintainer

FlorianBury Jul 9, 2021
Author

jpivarski
Jul 9, 2021
Maintainer

FlorianBury Jul 9, 2021
Author

jpivarski Jul 9, 2021
Maintainer

FlorianBury Jul 9, 2021
Author

jpivarski
Aug 24, 2021
Maintainer

jpivarski Aug 25, 2021
Maintainer

jpivarski Aug 26, 2021
Maintainer