Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding QR codes support in the ImageRedactorEngine #1036

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

vpvpvpvp
Copy link

@vpvpvpvp vpvpvpvp commented Feb 17, 2023

Change Description

This PR adds to the Presidio Image Redactor the ability to analyze the content of QR codes on the image.

qr-image-redactor-design

Summary of Changes

  • Added abstract class QRRecognizer for QR code recognizers
  • Added concrete OpenCVQRRecongnizer which uses OpenCV to recognize QR codes
  • Added QRImageAnalyzerEngine which uses QRRecognizer for QR code recognition and AnalyzerEngine to analyze its contents for PII entities
  • Modified ImagePiiVerifyEngine and ImageRedactorEngine to allow using QRImageAnalyzerEngine as an alternative to ImageAnalyzerEngine

Issue reference

This PR fixes issue #1035

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@vpvpvpvp vpvpvpvp requested a review from a team as a code owner February 17, 2023 18:49
@vpvpvpvp
Copy link
Author

@microsoft-github-policy-service agree

@SharonHart
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SharonHart
Copy link
Contributor

@vpvpvpvp
Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

@vpvpvpvp
Copy link
Author

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0
pytesseract - 0.3.10
OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same.
mac_ubuntu_diff

@omri374
Copy link
Contributor

omri374 commented Feb 20, 2023

Hi @vpvpvpvp, before going deeper into the code, what are your thoughts of having the QR code analyzer working potentially in parallel to OCR? something like that:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    redact_image --> return_image
Loading

Then we could always extend it to more types of detectors in the future, similar to the text analyzer architecture, e.g.:

stateDiagram-v2
    read_image
    read_image --> extract_ocr_text
    read_image --> extract_qr_text
    read_image --> extract_faces
    read_image --> extract_license_plates
    extract_ocr_text --> presidio_analyzer
    extract_qr_text --> presidio_analyzer
    presidio_analyzer --> redact_image
    extract_faces --> redact_image
    extract_license_plates --> redact_image
    redact_image --> return_image
Loading

One way to achieve this is to have QRImageAnalyzerEngine extend ImageAnalyzerEngine, and then we could later create a composable ImageAnalyzerEngine which holds multiple image analyzers.
WDYT?

@vpvpvpvp
Copy link
Author

Hi @omri374, that sounds great! In the current PR you can choose between QRImageAnalyzerEngine and ImageAnalyzerEngine, but it would be great to be able to run them and other analyzers in parallel. At first, I wanted to extend ImageAnalyzerEngine a bit, so that it would also accept QRRecognizer as a parameter in addition to OCR. Something like this:

class ImageAnalyzerEngine:
    """ImageAnalyzerEngine class.

    :param analyzer_engine: The Presidio AnalyzerEngine instance
        to be used to detect PII in text
    :param ocr: the OCR object to be used to detect text in images.
    :param qr: the QRRecognizer object to detect and decode text in QR codes
    """
    def __init__(
        self,
        analyzer_engine: Optional[AnalyzerEngine] = None,
        ocr: Optional[OCR] = None,
        qr: Optional[QRRecognizer] = None,
    ):

And then, in the analyze function, extract the text and its coordinates first with self.ocr and then with self.qr. But then I decided not to change ImageAnalyzerEngine this time, to make less edits to the original source code.

@SharonHart
Copy link
Contributor

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2

Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same. mac_ubuntu_diff

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

@vpvpvpvp
Copy link
Author

@vpvpvpvp Seems like the unit tests are failing in the CI, what version of Tesseract OCR did you use to generate the baseline images for the tests?

Tesseract OCR - 5.2.0 pytesseract - 0.3.10 OS - MacOS Ventura 13.2
Indeed, I noticed that in different test environments, the results of ImagePiiVerifyEngine may differ in some pixels. For example, below in the first image is the result on Mac, next on Ubuntu and their difference. At the same time, both the recognized text itself and the box coordinates are the same. mac_ubuntu_diff

I would suggest to use the original image as baseline (not a screenshot of it or of the screen). If its still failing, lets see how to add thresholding to the comparison

Updated the test images, locally the tests passed.

@SharonHart
Copy link
Contributor

/azp run

@SharonHart
Copy link
Contributor

@vpvpvpvp Seems to work now, but failing on another test that should be resolved in #1032 , try to rebase after once merged

@omri374
Copy link
Contributor

omri374 commented Feb 28, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Mar 1, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Mar 1, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SharonHart
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SharonHart
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@SharonHart
Copy link
Contributor

SharonHart commented Mar 2, 2023

@vpvpvpvp you have a green build 🎊
Will try to review the code later today

from tests.integration.methods import get_resource_image


def test_given_qr_image_then_text_entities_are_recognized_correctly(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice scenarios

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add more complex/realistic scenarios?


recognized.append(
QRRecognizerResult(
text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]
Copy link
Contributor

@SharonHart SharonHart Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passed list for bbox but declared tuple. Also in some places, bbox is a dictionary, not sure what is better but at some point I think we should use a common bbox class

Comment on lines 101 to 108
for text, p in zip(decoded, points):
(x, y, w, h) = cv2.boundingRect(p)

recognized.append(
QRRecognizerResult(
text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]
)
)
Copy link
Contributor

@SharonHart SharonHart Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer immutability.

Suggested change
for text, p in zip(decoded, points):
(x, y, w, h) = cv2.boundingRect(p)
recognized.append(
QRRecognizerResult(
text=text, bbox=[x, y, w, h], polygon=[*p.flatten(), *p[0]]
)
)
recognized = [QRRecognizerResult(text=text, bbox=cv2.boundingRect(point), polygon=[*point.flatten(), *point[0]]) for text, point in zip(decoded, points)]

( If you find it too complex, for readability sake, extract into privates = _get_ploygon )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add these changes, thanks for the suggestion.

Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the contribution! Left some points for discussion, hopefully we can simplify the design and decouple the QR code analysis from downstream classes.

import numpy as np


class QRRecognizerResult:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about that. QRRecognizerResult is needed to represent the results of QR code recognition (bboxes and raw text without PII analysis). In this sense, QRRecognizerResult is closer to the dictionary returned by the perform_ocr method of the TesseractOCR(OCR) class. At the same time, ImageRecognizerResult already includes the results of text analysis by the presidio_analyzer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarification

from presidio_image_redactor.qr_recognizer import OpenCVQRRecongnizer


class QRImageAnalyzerEngine:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this class be inherited from ImageAnalyzerEngine? Just a question, to see if we can simplify the design instead of extending it to a new set of independent classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking exactly the same, see below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be inherited from ImageAnalyzerEngine. My concern is that in this case, QRImageAnalyzerEngine will also inherit the logic of working with ocr tools not related to QR code recognition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my concern too. As the package is still in beta, we should (carefully) consider breaking backward compatibility. We'll do some thinking on this and get back to you. We can also have a quick design session together over video if you're interested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds interesting. If you have time, we could do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. To avoid putting personal emails on GH, could you please email [email protected] and we'll continue the discussion over email?

def __init__(self, image_analyzer_engine: ImageAnalyzerEngine = None):
def __init__(
self,
image_analyzer_engine: Union[ImageAnalyzerEngine, QRImageAnalyzerEngine] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If QRImageAnalyzerEngine inherits from ImageAnalyzerEngine, then this class could be independent of the QR implementation

bboxes = self.image_analyzer_engine.analyze(
image, ocr_kwargs, **text_analyzer_kwargs
)
if isinstance(self.image_analyzer_engine, QRImageAnalyzerEngine):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omri374 @vpvpvpvp Any idea on making it more open-close?
Maybe a single ImageAnalyzerEngine that we inherit from with optional **ocr_kwargs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of direct inheritance of QRImageAnalyzerEngine from ImageAnalyzerEngine, it would only need to add ocr_kwars to the analyze method of QRImageAnalyzerEngine. This is probably the easiest way.

Potentially, it seems like the most optimal implementation when ImageAnalyzerEngine is used for orchestrating different recognizers (ocr recognizer, QR recognizer, etc.). In the vein of what was suggested earlier #1036 (comment).

SharonHart
SharonHart previously approved these changes Mar 6, 2023
Copy link
Contributor

@SharonHart SharonHart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left some minor comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants