-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split YR_MAX_ATOM_LENGTH for string and re, and optimize the extract algorithm of regular atom to speed up the scan. #2107
base: master
Are you sure you want to change the base?
Conversation
…YR_MAX_ATOM_LENGTH default value
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Did you measure the performance gain after these changes? In experiments I've done in the past I've concluded that increasing the maximum atom length beyond 4 or maybe 6 bytes doesn't increase, but the memory grows considerably. Can you share details about the tests you did? |
The rules used in the test consist of strings and opcode, most of which are strings, and there are about 10,000 rules. About 1.3 million samples were tested. Before modification, the average scan time was 0.594. With YR_MAX_ATOM_LENGTH set to 8 and YR_MAX_ATOM_RE_LENGTH set to 8, the average scan time is 0.095. |
Because YR_MAX_ATOM_LENGTH limits the length of strings and re, if only YR_MAX_ATOM_LENGTH is increased, the re atom may contain multiple wildcards, and the expansion of the re atom may reduce the gain in scan speed, while the memory will increase substantially. In our experiment, we're actually just increasing the length of the string atom. For re atoms, only the length of re atoms that meet the requirements (contains a maximum of one wildcard) is increased, otherwise there is no change compared with before modification. |
I still have some questions. You said that you have tested with a corpus of 10.000 rules, but most of them are plain strings and are not really affected by this change. Therefore there must be some regexps in that corpus which are the really slow ones, and those may have improved with you change. I would like to take a look at those specific regexps that are slow with the current implementation and improve with your change, in order to understand better which type of regexps are benefitting from this change. Also, notice that this change has broken some test cases. |
Maybe my previous expression was not accurate. We increased the atom length of the string(YR_MAX_ATOM_LENGTH), so scan speed of the string rules will increase. Because most of my test rules are string rules, my scanning performance is relatively good. |
Split YR_MAX_ATOM_LENGTH to YR_MAX_ATOM_RE_LENGTH and YR_MAX_ATOM_LENGTH will speed up the scan. For string, its length could increase from 4 to more large e.g. 8, and this change can speed up the scan. For re, if we just increase the length of re atom, the number atoms may increases exponentially because of wildcards and the memory will explode , so we optimized re atom extract algorithm. On the one hand, increase string YR_MAX_ATOM_LENGTH to speed up the scan. On the other hand, control the selected re atom contains at most one wildcard.
After finishing atom extract on line 1164 of atoms.c , we count the number of wildcards in the re atom. If the count above 1, we will further extract from atom, the following is the explain.
By following the above optimization, the scanning speed was increased by 6 times in our test.