Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SentPackets::take_ranges #2242

Open
mxinden opened this issue Nov 21, 2024 · 2 comments · May be fixed by #2245
Open

Optimize SentPackets::take_ranges #2242

mxinden opened this issue Nov 21, 2024 · 2 comments · May be fixed by #2245

Comments

@mxinden
Copy link
Collaborator

mxinden commented Nov 21, 2024

When CPU profiling a test_fixtures::Simulator run transferring 10 MiB from a server to a client, I see the following flamegraph:

flamegraph2

The majority of CPU time is spent in SentPackets::take_ranges:

/// Take values from a specified ranges of packet numbers.
/// The values returned will be reversed, so that the most recent packet appears first.
/// This is because ACK frames arrive with ranges starting from the largest acknowledged
/// and we want to match that.
pub fn take_ranges<R>(&mut self, acked_ranges: R) -> Vec<SentPacket>
where
R: IntoIterator<Item = RangeInclusive<PacketNumber>>,
R::IntoIter: ExactSizeIterator,
{
let mut result = Vec::new();
// Remove all packets. We will add them back as we don't need them.
let mut packets = std::mem::take(&mut self.packets);
for range in acked_ranges {
// For each acked range, split off the acknowledged part,
// then split off the part that hasn't been acknowledged.
// This order works better when processing ranges that
// have already been processed, which is common.
let mut acked = packets.split_off(range.start());
let keep = acked.split_off(&(*range.end() + 1));
self.packets.extend(keep);
result.extend(acked.into_values().rev());
}
self.packets.extend(packets);
result
}

More specifically in the two calls to BTreeMap::extend:

let mut packets = std::mem::take(&mut self.packets);
for range in acked_ranges {
// For each acked range, split off the acknowledged part,
// then split off the part that hasn't been acknowledged.
// This order works better when processing ranges that
// have already been processed, which is common.
let mut acked = packets.split_off(range.start());
let keep = acked.split_off(&(*range.end() + 1));
self.packets.extend(keep);
result.extend(acked.into_values().rev());
}
self.packets.extend(packets);

Adding some logging, the following seems to be the case:

  • Assume we have two nodes, A and B, where A sends 10 MiB to B.
  • A has 100 packets in flight.
  • A receives an ACK from B, acknowledging the first 2 packets.
  • At the end of the first loop iteration:
    • packets.len() will be 0, given the ACK from B acknowledges the first 2 packets.
    • acked.len() will be 2, containing the two new acked SentPackets.
    • keep.len() will be 98, containing all remaining packets
  • We then execute self.packets.extend(keep), e.g. in the above scenario, adding all remaining 98 packets to the now empty self.packets.
  • In other words, we re-insert all remaining packets on each ACK.

Unfortunately BTreeMap does not allow splitting out a full range. The closest to that is BTreeMap::extract_if which is currently Nightly only.

That said, I think the following change, optimizing for the scenario above, would get us a long way already:

modified   neqo-transport/src/recovery/sent.rs
@@ -197,18 +197,17 @@ impl SentPackets {
     {
         let mut result = Vec::new();
         // Remove all packets. We will add them back as we don't need them.
-        let mut packets = std::mem::take(&mut self.packets);
         for range in acked_ranges {
-            // For each acked range, split off the acknowledged part,
-            // then split off the part that hasn't been acknowledged.
-            // This order works better when processing ranges that
-            // have already been processed, which is common.
-            let mut acked = packets.split_off(range.start());
-            let keep = acked.split_off(&(*range.end() + 1));
-            self.packets.extend(keep);
+            let mut packets = std::mem::take(&mut self.packets);
+
+            let mut keep = packets.split_off(&(*range.end()));
+            let acked = packets.split_off(range.start());
+
+            keep.extend(packets);
+            self.packets = keep;
+
             result.extend(acked.into_values().rev());
         }
-        self.packets.extend(packets);
         result
     }

CPU profiling once more, this resolves the hot-spot on SentPackets::take_ranges:

flamegraph

Let me know if I am missing something? E.g. the above scenario not being something worth optimizing for.

See also past discussion: https://github.com/mozilla/neqo/pull/1886/files#r1591830138

@larseggert
Copy link
Collaborator

I think this makes sense. Might want to add a bench for it first, to get a performance baseline for a number of different ACK patterns, before doing a fix PR?

@martinthomson
Copy link
Member

Please, include a pretty solid comment in the code when you do this. We'll need to remember why this is apparently backwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants