Add Option to Turn Off _fill_perfdata #168

michaelmckinsey1 · 2024-06-10T19:42:39Z

_fill_perfdata adds empty rows for profiles that do not contain data for a node, such that each node has a row entry for every profile. This can be obnoxious in some cases, as in the performance algorithms dataset, 2,531,150 rows in the performance data table with _fill_perfdata and 163,778 rows with _fill_perfdata off.

This is a small optimization for the reader, as _fill_perdata is only called once, most time is spent in _unify. However, when working with a large amount of files with missing rows, a Thicket with _fill_perfdata off will be easier and faster to work with.

michaelmckinsey1 · 2024-06-11T22:02:09Z

Need to add unit test.

ilumsden

I have a few comments/changes, but this mostly looks good.

The one thing I'd mention is that this will have a notable impact in performance in some cases. As you said in the initial PR comment, this functionality can substantially increase the number of rows in the performance dataframe. Given how slow pandas is and given that pandas uses a minimum of 2x the memory that a single DataFrame should take up, adding a substantial number of nodes will slow down all of Thicket.

Given that, my one remaining comment (which we'll need to discuss) is that it might be better to keep this off by default. Right now in this PR, users have to opt out of this feature, which could destroy Thicket's performance if used in the wrong situation. I'd argue that it's better for users to opt into this feature in situations where they know the performance won't be destroyed. Addressed in the discussion below. This is the default behavior of Thicket right now, so this PR shouldn't change that. However, IMO, this will cause issues in the future.

thicket/ensemble.py

thicket/thicket.py

ilumsden · 2024-06-11T22:13:16Z

thicket/thicket.py

+        unique_perf_indices = set(new_thicket.dataframe.index.droplevel("node"))
+        full_idx = all([idx in unique_perf_indices for idx in new_thicket.profile])
+        if not full_idx:
+            new_thicket = new_thicket.squash()


I'm not sure what this is trying to do. From what I can tell, this is trying to see if the profile list is a subset of the unique profile IDs in the performance dataframe index. If that's true, then the dataframe can store profile IDs that aren't in the Thicket.profile field.

This is detecting if fill_perfdata is off. To make it more clear I thought of a couple options:

I could make it a helper function with a docstring to make it more clear.

Have a boolean thicket attribute for fill_perfdata that can be checked here.

The purpose of this is that there is a case when the index is not full, and a filter is performed on the metadata that nodes disappear from the performance data, if the metadata for those profiles was filtered out. The code above will update the graph using squash if this happens.

So, if I'm understanding right, what you're trying to do is 1) check if any nodes were removed from the performance data as a side effect of filtering profiles from the metadata and 2) fix the graph with squash if any nodes were removed?

If what I wrote above is correct, then this code will not do what you're wanting.

This code (assuming the set(Series) thing works) only checks if Thicket.profiles is a subset of the profile IDs in the performance dataframe index. That tells you nothing about the nodes.

It tells you if there are profiles in the performance dataframe that do not exist in tk.profile. For example if I simulate removing a profile from the dataframe, it will be false which will trigger a squash:

The actual issue I noticed with this is that they will both always be in sync due to _sync_profile_components. So what I should check here instead is if len(tk.graph) != len(tk.dataframe.index.get_level_values("node").unique()) which is more explicit anyway. These sort of things will get caught when I add unit tests.

Your code is not checking if there are profiles in the dataframe that don't exist in tk.profile. It's checking if there are profiles in tk.profile that don't exist in the dataframe. It's a subtle difference, but it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles.

I actually just encountered this same type of thing in Befikir's query_stats PR (#157), and I realized there's a better way to do this check:

sorted(tk.profile) == sorted(unique_perf_indices.tolist())

This will check that both the dataframe and tk.profile agree in terms of contents and length. Node objects are comparable via their _hatchet_nid fields. As a result, the Node objects within different parts of the same Thicket object can be compared in this way. This comparison will not work across Thicket objects though because there's no guarantee that equivalent Node objects will have the same _hatchet_nid values.

It's checking if there are profiles in tk.profile that don't exist in the dataframe.

But that is what I was trying to accomplish with that code. That would indicate that there are missing profiles in the performance data, which is what I was saying. However, it doesn't address

it would allow e.g., the dataframe to contain extra profiles that don't exist in tk.profiles.

which your correct would be an issue, if not for _sync_profile_components being called earlier. Regardless, what I was actually trying to do is check nodes so I'm using a different check now.

For reference, if we are trying to update the state of the profiles in a Thicket we should be using _sync_profile_components. And if we are trying to validate properties about the state of a Thicket we should use validate_profile or validate_dataframe in utils.py

If you're 100% positive that you've ensured the graph and dataframe contain the same node objects, then this is fine.

ilumsden

Accidentally forget to set "Request changes" on my main review. Refer to the main review for comments.

michaelmckinsey1 · 2024-06-11T22:23:43Z

I have a few comments/changes, but this mostly looks good.

The one thing I'd mention is that this will have a notable impact in performance in some cases. As you said in the initial PR comment, this functionality can substantially increase the number of rows in the performance dataframe. Given how slow pandas is and given that pandas uses a minimum of 2x the memory that a single DataFrame should take up, adding a substantial number of nodes will slow down all of Thicket.

Given that, my one remaining comment (which we'll need to discuss) is that it might be better to keep this off by default. Right now in this PR, users have to opt out of this feature, which could destroy Thicket's performance if used in the wrong situation. I'd argue that it's better for users to opt into this feature in situations where they know the performance won't be destroyed.

It's on by default because the default behavior in Thicket has been to fill the index in the performance table. Actually, the default behavior originally was to not fill the index, then there was a design decision made at some point to fill it all the time. My goal with this PR is to at least give the user the option, based on their data.

ilumsden · 2024-06-11T22:27:13Z

Wait, really? In that case, what you have in this PR is fine, but I will second guess that decision. I suspect we haven't run into any major performance issues because we've just used effectively N copies of the same program. But, once we run into less homogeneous data (e.g., MPMD-style data like what my benchmark will create), we will likely start running into major performance issues with this decision.

ilumsden

Looks good, so long as we are positive that the Node objects are the same across thicket components.

michaelmckinsey1 · 2024-06-18T00:17:11Z

Looks good, so long as we are positive that the Node objects are the same across thicket components.

They should be unless there is a bug. utils.validate_nodes is a function that checks this particular property. If we wanted to be 100% sure, I think we would have to check properties we want to enforce about a Thicket before and after operations like filter_metadata, which could be expensive. I am open to this idea though, as it would help catch bugs.

Edit: I just realized we had a basically identical discussion in #173 (comment) so I am referencing it here. I agree that this is a broader issue than both of these PRs.

michaelmckinsey1 added 2 commits June 10, 2024 14:40

Add optional argument to turn off _fill_perfdata

e69ada1

Update docstring

89a8a15

michaelmckinsey1 requested review from ilumsden and pearce8 June 10, 2024 19:50

michaelmckinsey1 self-assigned this Jun 10, 2024

Add squash to filter_metadata if fill perfdata is off

3e87b2e

ilumsden reviewed Jun 11, 2024

View reviewed changes

ilumsden requested changes Jun 11, 2024

View reviewed changes

Fix check

af09ae9

michaelmckinsey1 mentioned this pull request Jun 11, 2024

Check for non-unique multi-index #172

Open

black

e66ba50

michaelmckinsey1 requested a review from ilumsden June 17, 2024 17:04

ilumsden approved these changes Jun 17, 2024

View reviewed changes

pearce8 approved these changes Jun 20, 2024

View reviewed changes

pearce8 merged commit 73de0ce into LLNL:develop Jun 20, 2024
4 checks passed

michaelmckinsey1 mentioned this pull request Jun 20, 2024

Add New Positional Argument to Unit Test #179

Merged

slabasan added this to the 2024.2.0 milestone Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Option to Turn Off _fill_perfdata #168

Add Option to Turn Off _fill_perfdata #168

michaelmckinsey1 commented Jun 10, 2024 •

edited

Loading

michaelmckinsey1 commented Jun 11, 2024

ilumsden left a comment •

edited

Loading

ilumsden Jun 11, 2024

michaelmckinsey1 Jun 11, 2024

ilumsden Jun 11, 2024

ilumsden Jun 11, 2024

ilumsden Jun 11, 2024

michaelmckinsey1 Jun 11, 2024 •

edited

Loading

ilumsden Jun 11, 2024

michaelmckinsey1 Jun 12, 2024

michaelmckinsey1 Jun 12, 2024

ilumsden Jun 17, 2024

ilumsden left a comment

michaelmckinsey1 commented Jun 11, 2024

ilumsden commented Jun 11, 2024

ilumsden left a comment

michaelmckinsey1 commented Jun 18, 2024 •

edited

Loading

Add Option to Turn Off _fill_perfdata #168

Add Option to Turn Off _fill_perfdata #168

Conversation

michaelmckinsey1 commented Jun 10, 2024 • edited Loading

michaelmckinsey1 commented Jun 11, 2024

ilumsden left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelmckinsey1 Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilumsden left a comment

Choose a reason for hiding this comment

michaelmckinsey1 commented Jun 11, 2024

ilumsden commented Jun 11, 2024

ilumsden left a comment

Choose a reason for hiding this comment

michaelmckinsey1 commented Jun 18, 2024 • edited Loading

michaelmckinsey1 commented Jun 10, 2024 •

edited

Loading

ilumsden left a comment •

edited

Loading

michaelmckinsey1 Jun 11, 2024 •

edited

Loading

michaelmckinsey1 commented Jun 18, 2024 •

edited

Loading