Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sped up Tokenizer::dump() #5009

Merged
merged 9 commits into from
Aug 31, 2023
Merged

sped up Tokenizer::dump() #5009

merged 9 commits into from
Aug 31, 2023

Conversation

firewave
Copy link
Collaborator

Scanning the cli folder with DISABLE_VALUEFLOW=1 Tokenizer::dump() will consume almost 25% of the total Ir count when an addon is specified. This is mainly caused by the usage of std::ostream.

Encountered while profiling #4958.

@firewave firewave force-pushed the tok-dump branch 6 times, most recently from c58378b to e4f46ac Compare April 28, 2023 17:51
@firewave firewave changed the title optimized Tokenizer::dump() sped up Tokenizer::dump() Apr 30, 2023
@firewave
Copy link
Collaborator Author

firewave commented May 4, 2023

It looks like the pointer address is just used as a unique ID for the objects in the dump. @danmar?

If that is the case (and as the pointer string representation is different on Windows and Linux) we could use something more light-weight, simple and portable instead. Or simply use just one representation.

@firewave firewave force-pushed the tok-dump branch 4 times, most recently from fb4bbd7 to c3058d2 Compare August 23, 2023 14:13
@firewave
Copy link
Collaborator Author

Scanning with DISABLE_VALUEFLOW=1 and --addon=misra -I../lib -D__GNUC__ (second values are the % of total IR only used by Tokenizer::dump()):

cli/filelister.cpp
Clang 15 125,212,971 -> 109,163,772 / 20.64% -> 8.87%

cli/cmdlineparser.cpp
Clang 15 1,105,405,465 -> 949,986,812 / 20.28% -> 8.40

@firewave firewave force-pushed the tok-dump branch 2 times, most recently from d95f72c to e184436 Compare August 23, 2023 14:28
@firewave
Copy link
Collaborator Author

firewave commented Aug 23, 2023

I know that the new code is horrible but that's what we have to pay for having overengineered, unusable garbage in the standard (also looking at you std::regex).

I am aware that we might also be able use the {fmt} library but we always try to avoid external dependencies. And we already have Boost as optional one to work around other performance issues in the standard implementations...

@firewave firewave force-pushed the tok-dump branch 8 times, most recently from 66172f1 to 63da8d1 Compare August 25, 2023 21:14
@firewave firewave marked this pull request as ready for review August 25, 2023 21:41
@danmar
Copy link
Owner

danmar commented Aug 28, 2023

It looks like the pointer address is just used as a unique ID for the objects in the dump.

Yes it is a unique ID. In python it can be any arbitrary string, doesn't have to be numeric. But the premium addon expects that it's a hexadecimal value.

I am not against that you change it if you think there is faster approach.

lib/cppcheck.cpp Outdated Show resolved Hide resolved
@firewave
Copy link
Collaborator Author

Yes it is a unique ID. In python it can be any arbitrary string, doesn't have to be numeric. But the premium addon expects that it's a hexadecimal value.

As you see the value is different on several platforms and might not even be a hexadecimal value (0, 0230FB33 - the latter albeit an address it is actually considered octal). Does premium actually work outside of GNU-like compilers?

So I would keep the current format for now and prepare another PR which generates a consistent output across all platform. You could then test that with premium as well. Also if we run into issues we have a separate commit to revert which only contains the changes in the identifier.

@danmar
Copy link
Owner

danmar commented Aug 28, 2023

I know that the new code is horrible but that's what we have to pay for having overengineered, unusable garbage in the standard (also looking at you std::regex).

Personally I don't think it's horrible. The ostream never made me feel excited anyway.

@danmar
Copy link
Owner

danmar commented Aug 28, 2023

Does premium actually work outside of GNU-like compilers?

Yes it does. If there is a "0x" or not does not matter. It always uses base 16 when converting the string. Even if the string says 01234567

@danmar
Copy link
Owner

danmar commented Aug 28, 2023

As far as I see there are many opportunities to "fold" newlines.. if you do that I don't have any negative opinion about this. looks OK to merge then.

Copy link
Owner

@danmar danmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptr_to_string

lib/utils.h Outdated Show resolved Hide resolved
lib/utils.h Outdated Show resolved Hide resolved
lib/utils.h Outdated
// a-f / A-F
c = 55 + temp;
#if !defined(_WIN32) || defined(__MINGW32__)
c += 32; // add 32 for lowercase
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no technical reason to use lower case as far as I know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to match the result of std::ostringstream. As mentioned before I will do the portable approach in another PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really want to lowercase, I think that either c = 'a' + temp; or c += 'a' - 'A'; would be more elegant.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it's important to match std::ostringstream though. the only possible usage I can think of for this function is when generating the dump file or debug output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People might depend on the format. Also I want that change separate so it can easily be reverted if necessary. I will also put it behind a define for a release or two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we really want to lowercase, I think that either c = 'a' + temp; or c += 'a' - 'A'; would be more elegant.

I think the - 10 is better as it highlights that we are dealing with a "base 16"/hex value as input.

lib/utils.h Show resolved Hide resolved
lib/utils.h Show resolved Hide resolved
lib/utils.h Show resolved Hide resolved
line = tok->linenr();
if (!xml) {
ValueFlow::Value::ValueKind valueKind = values->front().valueKind;
const bool same = std::all_of(values->begin(), values->end(), [&](const ValueFlow::Value& value) {
return value.valueKind == valueKind;
});
out << " " << tok->str() << " ";
outs += " ";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use ' ' here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably - if it were a single character... but I guess that is just a typo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the code using a char is much slower similar to the stream insertion case I mentioned in another comment. Will investigate but that is out-of-scope of this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should have been a single character that was a typo 👍

@firewave
Copy link
Collaborator Author

Regarding single character char vs. string literal in stream insertion I encountered something weird - a char is slower: llvm/llvm-project#65040

@danmar
Copy link
Owner

danmar commented Aug 28, 2023

Regarding single character char vs. string literal in stream insertion I encountered something weird - a char is slower:

ok

@danmar danmar merged commit 0fadf9e into danmar:main Aug 31, 2023
72 checks passed
@firewave firewave deleted the tok-dump branch August 31, 2023 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants