Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refac(perf): refactor parallelize polynomialize() #302

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PatStiles
Copy link
Contributor

Took a shot at making polynomialize() less serial as described in #293. After inlining and combining compute_lookup_outputs() and subtable_lookup_indices() I did not observe any improvements in performance.

@moodlezoup moodlezoup requested a review from sragss April 17, 2024 14:35
Copy link
Collaborator

@sragss sragss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see a 20% improvement on the big machine. I've added some ideas in the review. Probably worth creating a little iai-callgrind benchmark.

One other suggestion would be to add more granular tracing. Trace instruction_flag_bitvectors construction, instruction_flag_polys construction, and even within the big loop. NUM_MEMORIES should only be around 80 so we should be able to sort what's going on in the trace.

Looks to me like 50% of e2e time is spend on DensePolynomial::from which isn't going to get much faster so we can consider that a ceiling.

},
)
.reduce(
|| (Vec::new(), Vec::new(), Vec::new()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::with_capacity(preprocessing.num_memories)


let mut final_cts_i = vec![0usize; M];
let mut read_cts_i = vec![0usize; m];
let mut subtable_lookups = vec![F::zero(); m];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try unsafe_allocate_zero_vec::<F>::(m) here.

let mut read_cts_i = vec![0usize; m];
let mut subtable_lookups = vec![F::zero(); m];

for (j, op) in ops.iter().enumerate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be faster if instead of the current pattern:

for memory in memories {
    for op in ops {
         if(op.memories.contains(memory)) { // probably linear in the size of memories (4 -> 12 with current params)
                  // do stuff
         }
    }
}

Doing (bad psuedo code to show idea):

// Precompute list of instruction indices for each memory
let memory_A_op_indices = [];
let memory_B_op_indices = [];
let memory_ops = [memory_A_op_indices, memory_B_op_indices,...];
...
for op in ops {
     for memory in ops.memories_used {
          memory_ops[memory].push(op);
     }
} 

// Now compute the memory counters directly in parallel
for memory in memories {
                    let mut final_cts_i = vec![0usize; M];
                    let mut read_cts_i = vec![0usize; m];
                    let mut subtable_lookups = vec![F::zero(); m];

                  for op in memory_ops[memory] {
                       // update relevant counters
                  }

                 // TODO: construct polynomials
}

I'm thinking this will have better cache performance (less stuff is looked up in the hot loop) and get rid of the memories_used.contains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants