Write image diff to disk even if test passed #234

nedtwigg · 2020-08-21T07:43:25Z

I have been forced to set my image diff threshold pretty high:

  customDiffConfig: {
    threshold: 0.3,
  },
  failureThreshold: 0.1,
  failureThresholdType: "percent",

The thing I'm snapshotting is text rendered by puppeteer. The snapshots are created on mac, but CI runs on linux. Small changes in font rendering (especially font width) add up across the width of the image. I tried ssim and its various modes, but they required me to set threshold even higher, 20-40%.

As a result of the high threshold, I'd like to have the option to manually audit CI builds by looking at the image snapshots as artifacts, even if the tests fell below the 10% threshold needed to fail. Maybe an option like dumpDiffToDiskEvenOnPass=true

The text was updated successfully, but these errors were encountered:

omnisip · 2020-08-21T07:53:08Z

You're testing apples and oranges. At the very least, if I were you I'd set up a docker container on your Mac for the snapshot generation. That's really your best bet is it you want these snapshots to be meaningful. We'll review the request though to see if it's a worthwhile feature though.

…

On Fri, Aug 21, 2020, 01:43 Ned Twigg ***@***.***> wrote: I have been forced to set my image diff threshold pretty high: customDiffConfig: { threshold: 0.3, }, failureThreshold: 0.1, failureThresholdType: "percent", The thing I'm snapshotting is text rendered by puppeteer. The snapshots are created on mac, but CI runs on linux. Small changes in font rendering (especially font width) add up across the width of the image. I tried ssim and its various modes, but they required me to set threshold even higher, 20-40%. As a result of the high threshold, I'd like to have the option to manually audit CI builds by looking at the image snapshots as artifacts, even if the tests fell below the 10% threshold needed to fail. Maybe an option like dumpDiffToDiskEvenOnPass=true — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#234>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKQJRXQUIA2GMCR4N55CTSBYQSXANCNFSM4QHADD5A> .

l-abels · 2020-08-21T21:25:11Z

This describes my situation exactly, including my experience with ssim. I've found myself wanting this feature a few times. I'm pretty close to breaking down and finally throwing a docker container at the problem, though.

nedtwigg · 2020-08-21T21:42:21Z

I wonder about an image diff algorithm like this:

for each horizontal scanline
perform naive per-pixel diff
find and combine segments where diffs were within say 10 pixels of each other
- so each word of text should be one contiguous line segment, from first pixel of word to last pixel of word
this line segment will have different lengths on the left vs the right
rescale both segments to length one, take their dot product, and multiply by their average length to weight into total pixel difference
- should be close to zero for text with the same letters, and much higher for text with changed letters

Obviously that's a huge feature request, and I'll be honest that it's definitely not going to make the top of my todo list. But screenshot comparison which is robust to minor font changes but rejects content changes would be super useful :D. You need something which can drift horizontally across the page, which the algorithm above can do. It would fail on text reflow though :(

Container is definitely easier if you don't mind complexity in the dev workflow.

omnisip · 2020-08-21T22:09:05Z

This describes my situation exactly, including my experience with ssim. I've found myself wanting this feature a few times. I'm pretty close to breaking down and finally throwing a docker container at the problem, though.

I think there's some general confusion about what pixelmatch and SSIM do, and why you're not achieving the desired results. Both metrics are designed to tell how different one image is from a reference image. They fundamentally treat the reference image as a pure signal (think signal to noise ratio) and calculate degradation. This degradation can happen on what seems like identical platforms (Linux Chrome vs. Linux Chrome). For instance, let's say one of the chips is a brand new AMD Ryzen, and the other is an old Intel Xeon. Because the CPU vectorized runtime selected instructions don't match, they produce imperceptibly different output. However, pixel values generated are significantly different because of the way the values were calculated. An alternative case, still on the same operating system, is when Chrome will offload to the GPU. In these cases, SSIM is a far superior metric relative to pixelmatch because the images are no longer apples to apples because the filtering and transformation of the pixels produce different output. SSIM achieves excellent results in these cases because it is a metric derived from the mean, variance, and covariance over pixel windows (say 11x11 squares) from each pixel. As a result, SSIM can now compare two identical images produced through different transformations in terms of the pixels relationships between each other -- restoring the apples to apples comparison it should be.

Now let's compare this to the case you're describing. You're trying to determine not whether or not the two images match -- but whether the outputs are acceptable to the user. This sits somewhere between functional equivalence and a computer vision problem. In an ideal world, you'd use something like a Bayes algorithm (think SPAM analysis) to do a fuzzy match analysis. But how do you do that at scale? This requires extensive training of the algorithm to know whether or not all of the information communicated is communicated equivalently. For this particular case, you might benefit from an OCR comparison derived comparison to ensure all of the characters are extracted, and the extraction is equal -- but that's outside the scope of a pure image comparison function.

omnisip · 2020-08-21T22:22:06Z

I wonder about an image diff algorithm like this:

for each horizontal scanline

perform naive per-pixel diff

find and combine segments where diffs were within say 10 pixels of each other

so each word of text should be one contiguous line segment, from first pixel of word to last pixel of word

this line segment will have different lengths on the left vs the right

rescale both segments to length one, take their dot product, and multiply by their average length to weight into total pixel difference

should be close to zero for text with the same letters, and much higher for text with changed letters

If I understand correctly, you're looking to edge detect subimages inside images and then compare them against what you expect to be subimages inside another image, is that right?

nedtwigg · 2020-08-21T22:36:13Z

Exactly - to take take the very hard problem of the OCR and semantic meaning of the screenshot, and turn it into an image processing problem. SSIM doesn't need neural-net object detection to identify and ignore compression artifacts, and I don't think OCR is required to ignore changes in font spacing.

The reason SSIM works badly on minor font-spacing changes is that it assumes there's no drift, only local artifacts. It works quite well for the first few tiles of text, but past ~100px the minor change in font spacing has caused the two images to be completely uncorrelated.

If you draw a horizontal scanline, find the median RGB and define it as zero, and then count how many times the scanline crosses that zero, then that count alone will be a very good signature for the content of the text. It would be hard to add, remove, or change a letter, without changing that metric. But if you change the spacing or weight of the font, it would not affect the metric at all. The problem with the "zero-crossing" approach is that it's hard to reconcile back into "% pixels" different, which is why the weighted-dot-product approach is probably a more natural fit.

omnisip · 2020-08-21T23:58:24Z

I'm open to suggestions, and I don't particularly care about the percent or pixel threshold. If it needs to be adjusted or changed for circumstances, it's not a big deal.

If you want to make a specific suggestion for how to implement this, please check out weberSsim.ts in ssim.js 3.2. it has my new implemention that can calculate any individual variance covariance or mean in constant time, and it can do that over any size square window of pixels.

If it's doable and it works, I'll implement it and post it here.

github-actions · 2020-09-21T00:19:27Z

This issue is stale because it has been open 30 days with no activity.

github-actions bot added the stale-issue label Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write image diff to disk even if test passed #234

Write image diff to disk even if test passed #234

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020 via email

l-abels commented Aug 21, 2020

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020 •

edited

Loading

omnisip commented Aug 21, 2020

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020

github-actions bot commented Sep 21, 2020

Write image diff to disk even if test passed #234

Write image diff to disk even if test passed #234

Comments

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020 via email

l-abels commented Aug 21, 2020

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020 • edited Loading

omnisip commented Aug 21, 2020

nedtwigg commented Aug 21, 2020

omnisip commented Aug 21, 2020

github-actions bot commented Sep 21, 2020

omnisip commented Aug 21, 2020 •

edited

Loading