Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sigspec: decrease size with tagged union of bits and chunks #4490

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

widlarizer
Copy link
Collaborator

What are the reasons/motivation for this change?

small is fast

Explain how this is achieved.

This PR reduces sizeof(SigSpec) from 64 bytes to 40 bytes. SigSpec is always strictly either in packed mode (bits_ are empty) or unpacked (chunks_ are empty). A vector is three pointers. This PR puts bits_ and chunks_ into a union tagged with bool packed_. Union UB outside of rtlil.cc and rtlil.h is avoided since already the current implementation of SigSpec has bits_ and chunks_ as private members. This brings lower runtime memory usage and improve performance with improved cache locality

  • TODO: measure impact

If applicable, please suggest to reviewers how they can test the change.

Manually verify that all direct chunks_ accesses happen only in branches where packed_ is true, and bits_ where false. Run benchmarks, build with sanitizer, etc

void RTLIL::SigSpec::pack() const
{
RTLIL::SigSpec *that = (RTLIL::SigSpec*)this;

if (that->bits_.empty())
if (packed_ || that->bits_.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be removing the bits_.empty() check? We need to guarantee accesses to chunks_ are valid once pack() returns, so we need to do the switch in full no matter if bits_ are empty.

check();
}

void RTLIL::SigSpec::unpack() const
{
RTLIL::SigSpec *that = (RTLIL::SigSpec*)this;

if (that->chunks_.empty())
if (!packed_ || that->chunks_.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analogously here

std::vector<RTLIL::SigBit> bits_; // LSB at index 0
union {
std::vector<RTLIL::SigChunk> chunks_; // LSB at index 0
std::vector<RTLIL::SigBit> bits_; // LSB at index 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this with std::variant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, this is a good small but hot case to try it on

kernel/rtlil.cc Outdated
width_ = 0;
hash_ = 0;
append(SigBit(bit));
check();
}

void RTLIL::SigSpec::switch_to_packed() const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the naming alone it would appear this does the same thing as pack(), I suggest we name it switch_union_to_chunks or similar.

@widlarizer widlarizer marked this pull request as draft July 14, 2024 16:55
@widlarizer
Copy link
Collaborator Author

Turns out this gives a 5-10% performance penalty across the board, even with thin LTO. Profiling should show where this is coming from. Could be constructor/destructor or swap calls on pack/unpack. Time to read what STL methods compile down to I guess

@povik
Copy link
Member

povik commented Jul 17, 2024

We can add a 24 byte padding to the struct to check if the penalty is due to cache alignment effects.

@widlarizer
Copy link
Collaborator Author

I tried that earlier - adding alignas(64) to this made perf and memory consumption significantly worse actually

@povik
Copy link
Member

povik commented Jul 19, 2024

@widlarizer Disregarding any switch to a union, please see what kind of a penalty we get from switching to

void RTLIL::SigSpec::unpack() const 
{
	RTLIL::SigSpec *that = (RTLIL::SigSpec*)this;

	if (that->chunks_.empty())
		return;

	cover("kernel.rtlil.sigspec.convert.unpack");
	log_assert(that->bits_.empty());

	std::vector<RTLIL::SigChunk> old_chunks;
	old_chunks.swap(that->chunks_);

	that->bits_.reserve(that->width_);
	for (auto &c : old_chunks)
		for (int i = 0; i < c.width; i++)
			that->bits_.emplace_back(c, i);

	that->hash_ = 0;
}

(Note the swap in place of a clear on that->chunks_.)

That should inform what we do next with this PR.

@povik
Copy link
Member

povik commented Jul 22, 2024

@widlarizer Disregarding any switch to a union, please see what kind of a penalty we get from switching to

@widlarizer reports that the unpack() change alone didn't bring a noticable performance hit, so we can't explain the 5-10 % union penalty on the basis that we are losing the chunks_ capacity.

@widlarizer
Copy link
Collaborator Author

5.88% of the 22k cache misses happen in malloc_consolidate for baseline (1166238) but 9.20% of the 18k for the union sigspec with some fixes (6e3bf76). The sigspec cache misalignment theory lives on, I'd say.

@widlarizer widlarizer added the status-paused Status: Unfinished, not actively worked on, PR author intends to continue PR in the future label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-paused Status: Unfinished, not actively worked on, PR author intends to continue PR in the future
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants