Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt at optimizing allocs in replace.c #142

Conversation

lefessan
Copy link
Member

@lefessan lefessan commented Apr 23, 2024

There are two steps to remove allocations:

  • use a circular buffer for the list of tokens
  • remove re-allocation of tokens passing from one stream to another

@lefessan lefessan marked this pull request as draft April 23, 2024 08:33
@lefessan
Copy link
Member Author

This version passes the testsuite on my computer.

@GitMensch You may want to test it on your example to measure the number of allocations. It's supposed to allocate less than the original implementation in gnucobol 3.1, as there is no allocation of intermediate lists.

@GitMensch
Copy link
Collaborator

related to #75

@GitMensch
Copy link
Collaborator

GitMensch commented Apr 23, 2024

This looks quite promising. perf stat, first for 3.2, then for 3.2 + this change:

          2,369.99 msec task-clock:u                     #    0.937 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
           356,148      page-faults:u                    #  150.274 K/sec
     6,269,470,708      cycles:u                         #    2.645 GHz
    14,738,503,488      instructions:u                   #    2.35  insn per cycle
     3,446,098,461      branches:u                       #    1.454 G/sec
         8,069,400      branch-misses:u                  #    0.23% of all branches

       2.528378960 seconds time elapsed

       1.615366000 seconds user
       0.741772000 seconds sys
          1,255.32 msec task-clock:u                     #    0.959 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            91,721      page-faults:u                    #   73.066 K/sec
     3,975,614,937      cycles:u                         #    3.167 GHz
     9,728,980,463      instructions:u                   #    2.45  insn per cycle
     2,260,897,196      branches:u                       #    1.801 G/sec
         4,786,343      branch-misses:u                  #    0.21% of all branches

       1.308611289 seconds time elapsed

       1.039849000 seconds user
       0.208398000 seconds sys

especially look at "instructions" and "page faults".

relation to 3.1.2:

               193      page-faults:u                    #   37.357 /sec
     4,975,843,684      cycles:u                         #    0.963 GHz
     6,891,334,795      instructions:u                   #    1.38  insn per cycle

Numbers from valgrind:

total heap usage allocs frees bytes allocated
GC 3.1.2 5,526 5,526 20,443,861
GC 3.2 21,677,571 21,677,571 893,390,820
GC 3.2 (after this PR) 7,803,907 7,803,903 244,140,683

So still much more than 3.1.2, but definitely a good direction.

Note that you likely have some cobc_malloc somewhere, as only after this patch valgrind outputs "heap in use at exit: 320 bytes in 4 blocks". (with 3.1.2 and 3.2 valgrind outputs "0 bytes").

For comparison to the linked PR, here's the flamegraph and the cost-overview:
flamegraph from this PR
cost-overview

If you don't have an idea how to improve more (if it helps I could drop the inlining, providing a cleaner flamegraph and cost analysis (but of course less similar to production builds of cobc), then I'd be fine with upstreaming this now (with a changelog entry, of course and maybe a minor note to NEWS that cobc's memory usage during parsing in comparison to 3.2 was decreased).

But possibly you see something from the list above leading to even more improvements?

@GitMensch
Copy link
Collaborator

Oops - leads to 4 compile-errors in NIST SM and additional

In file included from /usr/include/string.h:535,
                 from /tmp/cob88280_0.c:8:
In function 'memmove',
    inlined from 'OBIC3A_' at /tmp/cob88280_0.c:1677:5:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:36:10: warning: '__builtin_memmove' reading 20 bytes from a region of size 0 [-Wstringop-overread]
   36 |   return __builtin___memmove_chk (__dest, __src, __len,
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   37 |                                   __glibc_objsize0 (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~

@GitMensch
Copy link
Collaborator

Oh - if possible (for any cases like this - where NIST fails but our internal does not): please add "the issue" for the current failure to the testsuite, as this ensures higher coverage there ("someday" we don't want to run NIST that often, but our coverage is definitely not there yet).

@lefessan
Copy link
Member Author

I think the bug is fixed now, and I added some more optimizations that I would be interested to check their performance.

About the freeing to perform, is there a function that is calling all the freeing functions on exit ?

@GitMensch
Copy link
Collaborator

GitMensch commented Apr 23, 2024

About the freeing to perform, is there a function that is calling all the freeing functions on exit ?

Yes, on different places, as long as you use the right malloc - which is possibly cobc_plex_malloc in replace.c

I think the bug is fixed now

Seems so - could you get a grasp what that triggered and add it to the internal testsuite, please?

and I added some more optimizations that I would be interested to check their performance

Will check in 7 to 17 hours and report back ;-)
Maybe it is useful to check in replace.c if the outline comments in the header are correct and if there are any doxygen comments on functions are missing, as well as crafting the Changelog entry and possibly a minor NEWS one.

Note: the amount of one- or two-liner static function... looks special.
Please check if these are useful and mark them as always inlined.

@GitMensch
Copy link
Collaborator

GitMensch commented Apr 24, 2024

Again: much better (apart from the memory free part better than both 3.1.2 and 3.2


            654.86 msec task-clock:u                     #    0.742 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               180      page-faults:u                    #  274.869 /sec
     2,190,772,514      cycles:u                         #    3.345 GHz
     5,304,004,935      instructions:u                   #    2.42  insn per cycle
     1,309,395,192      branches:u                       #    2.000 G/sec
         4,460,189      branch-misses:u                  #    0.34% of all branches

       0.882801772 seconds time elapsed

       0.583984000 seconds user
       0.071812000 seconds sys

Numbers from valgrind:

total heap usage allocs frees bytes allocated
GC 3.1.2 5,526 5,526 20,443,861
GC 3.2 21,677,571 21,677,571 893,390,820
GC 3.2 (after this PR now) 4,465 4,463 20,390,950

with the flamegraph not showing the replace functions any more.

Therefore I'd say - issue fixed, only needs the "additional" parts (check use of memory-functions, reformat, in-file documentation check; handling TODOs, Changelog, ideally testsuite, possibly minor NEWS entry) then looks good to go.

For the format:

-       if (copy_repls->replace_list == NULL &&
-            copy_repls->current_list == NULL &&
-            replace_repls->replace_list == NULL &&
-            replace_repls->current_list == NULL) {
-		return cb_ppecho_direct (text, token);
-	}
+	if (copy_repls->replace_list == NULL
+	 && copy_repls->current_list == NULL
+	 && replace_repls->replace_list == NULL
+	 && replace_repls->current_list == NULL) {
+		return cb_ppecho_direct (text, token);
+	}

It likely is useful to also compare the actual REPLACE code. I'll add a REPLACE statement + COPY REPLACING then reinspect (later, possibly in an hour).

@lefessan lefessan force-pushed the z-2024-04-22-optimized-replacing branch from 9f44da1 to d210c6e Compare April 24, 2024 15:39
@lefessan lefessan marked this pull request as ready for review April 24, 2024 15:39
@lefessan
Copy link
Member Author

Great. I cleaned the PR with ChangeLog and NEWS sections. Allocations should also be correctly freed at the end.

I did not add a new entry in the testsuite, as the bug was in the data-structure token_queue, and, unless we modify this data structure, which is unlikely, it cannot be triggered anymore. The difference between the testsuite and the NIST tests came from a test in NIST that had a replacement longer than 8 tokens, which seems not to exist in our testsuite (and my token_queue implementation had a bug on that constant...).

Copy link
Collaborator

@GitMensch GitMensch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between the testsuite and the NIST tests came from a test in NIST that had a replacement longer than 8 tokens, which seems not to exist in our testsuite

Please add it (likely by adjusting one of the existing tests) - this way we ensure things like that are catched easier in the future.

I may do a last try for getting the numbers in this issue correct, both with and without REPLACE later this evening.

Comment on lines +60 to +64
comparison between a numeric DISPLAY variable to another or to a
literal comparison between numeric DISPLAY or BCD variable to zero
INSPECT CONVERTING (and "simple" INSPECT REPLACING), in general and
especially if both from and to are constants

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old indent is strange, but I think it matches the one from previous versions. If this is the case then either leave it as-is or, potentially better, replace in both cases by indented extra list starting with *

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at NEWS from 3.2 and earlier version, I think the former indent was wrong even for them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for fixing all of these with an * iteration

cobc/replace.c Outdated Show resolved Hide resolved
@lefessan lefessan force-pushed the z-2024-04-22-optimized-replacing branch from d210c6e to 7f60896 Compare April 24, 2024 18:11
@GitMensch
Copy link
Collaborator

make checkall also passed on the dev environment at work updated stats:

            661.16 msec task-clock:u                     #    0.735 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               178      page-faults:u                    #  269.225 /sec
     2,242,123,238      cycles:u                         #    3.391 GHz
     5,304,096,492      instructions:u                   #    2.37  insn per cycle
     1,309,409,904      branches:u                       #    1.980 G/sec
         4,453,625      branch-misses:u                  #    0.34% of all branches

       0.899029345 seconds time elapsed

       0.584002000 seconds user
       0.075081000 seconds sys

now with

HEAP SUMMARY:
   in use at exit: 0 bytes in 0 blocks
 total heap usage: 4,465 allocs, 4,465 frees, 20,390,950 bytes allocated

All heap blocks were freed -- no leaks are possible

The "simple" adjustment did work fine (one of each,. the second nearly globally), resulting in

          1,019.13 msec task-clock:u                     #    0.957 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            50,342      page-faults:u                    #   49.397 K/sec
     3,341,557,957      cycles:u                         #    3.279 GHz
     7,982,333,596      instructions:u                   #    2.39  insn per cycle
     1,889,809,554      branches:u                       #    1.854 G/sec
         4,612,794      branch-misses:u                  #    0.24% of all branches

       1.064843192 seconds time elapsed

       0.899148000 seconds user
       0.115552000 seconds sys
       
 HEAP SUMMARY:
    in use at exit: 0 bytes in 0 blocks
  total heap usage: 4,285,236 allocs, 4,285,236 frees, 144,827,144 bytes allocated

All heap blocks were freed -- no leaks are possible

... but I'm not sure if my more complex test program for COPY REPLACING / REPLACE does work correctly.
I should inspect that tomorrow (but still would be fine if you want to check that in upstream today).

* Add two fast paths that completely avoid the replacement machinery
  (these two fast paths were formerly present in 3.1);
* Use a circular buffer instead of a list for the temporary queue of
  tokens in streams;
* Do not reallocate tokens passed from the copy-replacing stream to the
  replace stream
@lefessan lefessan force-pushed the z-2024-04-22-optimized-replacing branch from 7f60896 to 78ef11d Compare April 25, 2024 06:26
@lefessan
Copy link
Member Author

I added a test case for a long replacement (and a combinaison of COPY and REPLACE on the same tokens), that would trigger the same error as the NIST test.

I am not sure what you mean by your complex program: except for the two fast paths (that were already present in 3.1.2), this PR does not change the behavior of the COPY/REPLACE, so anything that worked with 3.1.2 and 3.2 should also work with this one.

@GitMensch
Copy link
Collaborator

so anything that worked with 3.1.2 and 3.2 should also work with this one

I think so - as noted this is good to be checked in upstream.

As also noted I do need to recheck the example I've created in any case.

@lefessan
Copy link
Member Author

I am thinking about writing a blog post about all this (how we fixed the initial bug and how we optimized it to match the perf of 3.1.2), do you happen to have a flamegraph of the final version also ? It could make a nice illustration of the evolution of performances during the work...

@GitMensch
Copy link
Collaborator

Here you go.

flamegraph
costs

@GitMensch
Copy link
Collaborator

The recordings for the flamegraph were taken by perf record call-graph dwarf,8096 --aio -z -- cobc B BIG.cob -o BIG.i, the analysis was done using Hotspot from KDAB.

@lefessan
Copy link
Member Author

Thanks ! Is big.cob public ?

@GitMensch
Copy link
Collaborator

GitMensch commented Apr 26, 2024

Thanks ! Is big.cob public ?

No chance. But I can share that the file is generated by Oracle Pro*COB from an (also generated) file handling SQL.
After all processing of copybooks and removing any comment or empty lines this source still has 747,203 LOC - and the whole part of preparsing the Pro*COB generated file (.cob -> .i - the part that we inspected here) now takes 661.31 msec task-clock; the complete processing (.cob with -fsyntax-only) takes 4,738.12 msec task-clock and processing + codegen (.cob-> .c) needs 7,693.71 msec task-clock.

@lefessan
Copy link
Member Author

Merged in SVN !

@lefessan lefessan closed this Apr 27, 2024
@GitMensch
Copy link
Collaborator

GitMensch commented Apr 27, 2024

Well done. If you write that blog entry please drop a note in the discussion boards under GnuCOBOL.

Note: while we've inspected the preprocessing here I've worked this week on the parsing speed and memory usage, especially for source files with lots of FILLERs as seen especially in generated programs and EXEC SQL preprocessed ones (speed inspected with perf/Hotspot, memory usage with Heaptrack) and hope to commit this soon as well.

... actually I hope to commit anything of the 95% stuff laying around, those tend to pile up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants