Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompiled regex simple diff #413

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

cohml
Copy link

@cohml cohml commented Aug 9, 2023

Overview

This PR contains a first pass at adding simple diff detection for precompiled regex. This fixes the bug referenced in #336. Questions/comments for @seperman in bold.

What this PR adds

deepdiff.DeepDiff now returns a diff if two re.Pattern objects consist of different patterns, flags, or groups.

>>> from deepdiff import DeepDiff
>>> from pprint import pp
>>> import re
>>> pattern_1 = re.compile('foo')
>>> pattern_2 = re.compile('foo')
>>> diff_12 = DeepDiff(pattern_1, pattern_2)
>>> pp(diff_12)
{}
>>> pattern_3 = re.compile('bar')
>>> diff_13 = DeepDiff(pattern_1, pattern_3)
>>> pp(diff_13)
{'values_changed': {'root': {'new_value': re.compile('bar'),
                             'old_value': re.compile('foo')}}}
>>> pattern_4 = re.compile('foo', flags=re.M)
>>> diff_14 = DeepDiff(pattern_1, pattern_4)
>>> pp(diff_14)
{'values_changed': {'root': {'new_value': re.compile('foo', re.MULTILINE),
                             'old_value': re.compile('foo')}}}

What this PR doesn't add

The diff doesn't provide any granular info on the specific difference(s), e.g., "the patterns + groups are the same and the flags are different".

I would like to implement this greater granularity, but perhaps you'd prefer to merge what this PR already does and just add that other stuff as a separate PR? Let me know.

Coverage report

---------- coverage: platform darwin, python 3.9.7-final-0 -----------
Name                        Stmts   Miss  Cover   Missing
---------------------------------------------------------
deepdiff/anyset.py             47      0   100%
deepdiff/base.py               34      0   100%
deepdiff/commands.py          116      3    97%   115, 118-119
deepdiff/deephash.py          326      2    99%   352-353
deepdiff/delta.py             359      2    99%   176-177
deepdiff/diff.py              834      3    99%   472, 793, 1680
deepdiff/distance.py          141      0   100%
deepdiff/helper.py            323      6    98%   93-94, 656-657, 665-668
deepdiff/lfucache.py          154      6    96%   18-19, 74, 107, 142, 218
deepdiff/model.py             408      1    99%   638
deepdiff/operator.py           28      1    96%   27
deepdiff/path.py              102      7    93%   74, 92, 122-126
deepdiff/search.py            148      0   100%
deepdiff/serialization.py     243     11    95%   20, 31, 373, 486-488, 523, 583-586
---------------------------------------------------------
TOTAL                        3263     42    99%

Note 1: _diff_booleans

One thing to note is that the precompiled regex detection relies on DeepDiff._diff_booleans. So now _diff_booleans acts more like a _diff_booleans_and_precompiled_regex.

At first I had created a dedicated DeepDiff._diff_precompiled_regex method, but then realized that it was identical to _diff_booleans which already exists. So I just reused _diff_booleans to avoid bloating the code base.

If you're okay with coopting _diff_booleans like this, we might as well collapse further and just do

if isinstance(level.t1, (booleans, Pattern)):
    self._diff_booleans(level, local_tree=local_tree)

Otherwise, I can recreate my original _diff_precompiled_regex method, or entertain anything else you'd propose.

Note 2: Delta

JFYI, I did not touch the code in deepdiff/delta.py, because it seems to already work as is.

>>> from deepdiff import Delta
>>> diff_bool = DeepDiff(True, False)
>>> Delta(diff_bool)
<Delta: {'values_changed': {'root': {'new_value': False}}}>
>>> Delta(diff_13)
<Delta: {'values_changed': {'root': {'new_value': re.compile('bar')}}}>

If anything looks amiss there, let me know and I will update the PR.

@cohml
Copy link
Author

cohml commented Aug 9, 2023

The diff doesn't provide any granular info on the specific difference(s), e.g., "the patterns + groups are the same and the flags are different". [...] I would like to implement this greater granularity

On this point, I believe the simplest place to add more granular info would be here. Something like the following:

for change in tree['values_changed']:
    if isinstance(change.t1, re.Pattern):
        # compare patterns, flags, and groups between ``change.t1`` and ``change.t2``;
        # for each difference, add an associated entry

The final diff output could then look something like the following:

{'values_changed': {'root': {'new_regex_pattern': 'foo', 'old_regex_pattern': 'bar'}}}
{'values_changed': {'root': {'new_regex_flags': 'MULTILINE, 'old_regex_flags': None}}}
{'values_changed': {'root': {'new_regex_groups': 1, 'old_regex_groups': 0}}}

This has the considerable drawback though of potentially stuffing lots of regex-specific code into model.TextView._from_tree_value_changed, which seems intended to be general-purpose.

More appropriate might be to create a custom method, or perhaps even a simple class, for handling regex-specific considerations/APIs. The question then would be the optimal place to put it...

Curious for your thoughts on this @seperman, as I don't want to completely clobber your original design/vision here.

@seperman
Copy link
Owner

seperman commented Aug 14, 2023

Hi @cohml
Thank you for the very detailed PR description that makes it a breath to follow your thought process!
I don't think we need to add new regex specific attributes to the report. What about treating it like other Python objects and we report the changes like this:

{'values_changed': {
    'root.pattern': {'new_value': 'foo', 'old_value': 'bar'},
    'root.flags': {'new_value': 40, 'old_value': 32}
}}

That way it will also work better when serializing the output to formats that don't accept re such as json.

If you are cool with that, let's take care of it in the same PR.
Open to suggestions!

@cohml
Copy link
Author

cohml commented Aug 15, 2023

{'values_changed': {
    'root.pattern': {'new_value': 'foo', 'old_value': 'bar'},
    'root.flags': {'new_value': 40, 'old_value': 32}
}}

Oh awesome, that is perfect actually. In my own testing I had neglected to discover this as even a possibility. Much cleaner than my original proposal!

Poking around, it actually seems possible to get this kind of output while even simplifying the implementation beyond what I've already pushed.

I will push some updates shortly for review and ping you here.

@cohml cohml force-pushed the precompiled-regex-simple-diff branch 2 times, most recently from 975cd2d to 258ab0b Compare August 15, 2023 21:59
@cohml
Copy link
Author

cohml commented Aug 16, 2023

@seperman The following behavior is now implemented, as we discussed:

>>> from deepdiff import DeepDiff
>>> from pprint import pp
>>> import re
>>> pattern_1 = re.compile('foo')
>>> pattern_2 = re.compile('foo')
>>> diff_12 = DeepDiff(pattern_1, pattern_2)
>>> pp(diff_12)
{}
>>> pattern_3 = re.compile('bar')
>>> diff_13 = DeepDiff(pattern_1, pattern_3)
>>> pp(diff_13)
{'values_changed': {'root.pattern': {'new_value': 'bar', 'old_value': 'foo'}}}
>>> pattern_4 = re.compile('foo', flags=re.M)
>>> diff_14 = DeepDiff(pattern_1, pattern_4)
>>> pp(diff_14)
{'values_changed': {'root.flags': {'new_value': 40, 'old_value': 32}}}
>>> pattern_5 = re.compile('(foo)')
>>> diff_15 = DeepDiff(pattern_1, pattern_5)
>>> pp(diff_15)
{'values_changed': {'root.groups': {'new_value': 1, 'old_value': 0},
                    'root.pattern': {'new_value': '(foo)', 'old_value': 'foo'}}}

Coverage report:

---------- coverage: platform darwin, python 3.9.7-final-0 -----------
Name                        Stmts   Miss  Cover   Missing
---------------------------------------------------------
deepdiff/anyset.py             47      0   100%
deepdiff/base.py               34      0   100%
deepdiff/commands.py          116      3    97%   115, 118-119
deepdiff/deephash.py          326      2    99%   352-353
deepdiff/delta.py             359      2    99%   176-177
deepdiff/diff.py              837      3    99%   477, 798, 1682
deepdiff/distance.py          141      0   100%
deepdiff/helper.py            323      7    98%   93-94, 353, 656-657, 665-668
deepdiff/lfucache.py          154      6    96%   18-19, 74, 107, 142, 218
deepdiff/model.py             411      1    99%   641
deepdiff/operator.py           28      1    96%   27
deepdiff/path.py              102      7    93%   74, 92, 122-126
deepdiff/search.py            148      0   100%
deepdiff/serialization.py     243     11    95%   20, 31, 373, 486-488, 523, 583-586
---------------------------------------------------------
TOTAL                        3269     43    99%

In the process I also caught a bug in your if statements that caused booleans to undergo an additional unintended round of diffing.

Let me know if you'd call for other changes.

@cohml cohml force-pushed the precompiled-regex-simple-diff branch from 258ab0b to 32ec182 Compare August 16, 2023 14:03
@seperman
Copy link
Owner

Perfect! Thanks @cohml
This is great.

@seperman seperman merged commit 5ee7d6c into seperman:dev Aug 31, 2023
5 checks passed
@seperman seperman mentioned this pull request Sep 1, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants