Fix/various pdf fixes #577

MartijnR · 2023-07-25T18:43:11Z

Offering some improvements that were made to PDF generation in OpenClinica's fork. PDF (record) generation is something their users use extensively.

I have verified this PR works with

pdf generation API endpoint

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

1. Enketo used to instantiate a headless browser for each PDF API call and close this browser at the end of the call. The first commit of this PR changes that to a single browser instance that always remains open (and re-launches automatically if it crashes). API responses should therefore be faster.
1. There is a bug in PDF generation for forms that require authentication. The PDF logic is not able to open the form but does not respond with an error. Instead it produces an 'empty' PDF (actually a text file but with a pdf extension that is shown as an empty PDF by Preview). The 3rd commit in this PR fixes this by properly returning a 401 response.

…ning/closing and adding caching,

…F generation it does not provide an error response, https://github.com/OpenClinica/enketo-express-oc/issues/688

MartijnR · 2023-07-25T18:43:42Z

app/lib/headless-browser.js

+                args,
+                userDataDir,
+            });
+            this.browser.on('disconnected', launchBrowser);


relaunch if the browser is shut down for any reason

MartijnR · 2023-07-25T18:50:55Z

app/lib/pdf.js

            .goto(urlObj.href, { waitUntil: 'networkidle0', timeout })
            .catch((e) => {
                e.status = /timeout/i.test(e.message) ? 408 : 400;
                throw e;
            });

+        // Either a 401 error is thrown or goto succeeds (or encounters a real loading error)
+        await Promise.race([detect401, goToPage]);
+


As you can see, it was a challenge to properly catch a 401 response... I would certainly be impressed if y'all can point to a better solution :).

MartijnR · 2023-07-25T18:53:21Z

app/lib/headless-browser.js

+
+/**
+ * This class approach makes it easy to open multiple browser instances with
+ * different arguments in case that is ever required.


FYI, in OpenClinica's fork they actually do this with custom 'headless' API endpoints that serve to import records, run validation on them, and add comments to questions with errors. Those endpoints use a different headless browser config to optimize performance for that purpose (without stylesheets etc).

eyelidlessness

After a first pass reviewing the functionality and implementation, I have a couple pretty serious concerns about this introducing denial of service risks.

The failure/retry logic is unconditional. In pathological failure cases, this would result in an infinite loop/recursion, only briefly yielding for however long launch failure takes. At best, this scenario would result in severe service degradation.
All browsers, and particularly Chromium, are notorious for consuming huge amounts of resources (especially memory) over time for long running processes. This is true even across multiple tabs (i.e. "page" in Puppeteer's terms), as multiple same-origin requests are generally serviced by a shared process. This is somewhat mitigated by the PDF logic calling page.close, making it more an attack risk than self-DoS, but it's a real concern.

I have a few smaller concerns about implementation details of the change, as well as the existing implementation it augments. But all of these concerns also prompted me to consider the functionality overall—how best to serve the general use case, as well as extensibility for e.g. OpenClinica. My gut instinct:

There is a general use case for PDF generation at a user level, but it may be better served as a client side feature without the machinery of a server orchestrated headless browser doing (roughly) the same thing a user's browser can do. There would be some work there, but I think it's worth considering.
To my understanding, OpenClinica has a more specific use case of batch/automated PDF generation. This may be better served by a headless browser, but I'm not sure it necessarily needs to be handled in the mainline Enketo project[s]. To the extent there's a need to extend the Enketo Express server to address this use case, the underlying express server library is already well suited for extensibility—i.e., OpenClinica or anyone else can add any route/request handlers they wish, to supplement or extend existing server functionality essentially arbitrarily.

I think it's best for now to hold off on this change, so we can put some more strategic thought into how we address both of these use cases in the longer term.

MartijnR · 2023-10-12T19:42:03Z

Thanks a lot for your review!

I think there is indeed a risk with the automatic browser respawning (and it doesn't actually respond to a real issue that was reported). We may just remove that in the OpenClinica fork as well, so thanks for that!

If you're interested in a PR without that feature, just let me know. I think the other changes will reduce the vulnerability to DOS slightly as it won't spawn a browser for each API request (compared to the current code). It will also make it a lot easier to keep sending PRs for PDF bug fixes (and keep the fork up to date). There are currently a couple of PDF fixes to offer including the one in this PR.

There would be some work there, but I think it's worth considering.

I don't think bringing non-batch PDF creation to the client is feasible though, unless you're okay with requiring users to load the form and use the browser's print menu manually. However, if you see no need for users to generate a PDF of an empty form or record with a click of a button in Central/etc, that may be fine. In that case I imagine you may want to consider removing the PDF API endpoints. I'm curious if you have some clever idea that 'some work' to generate a PDF on the client refers to, as I may of course be missing something!

MartijnR added 3 commits July 25, 2023 14:15

changed: Optimizes pdf generation performance by limiting browser ope…

72a1230

…ning/closing and adding caching,

changed: refactored code

8ca51a5

fixed: if a PDF endpoint encounters an authentication error during PD…

181346d

…F generation it does not provide an error response, https://github.com/OpenClinica/enketo-express-oc/issues/688

MartijnR commented Jul 25, 2023

View reviewed changes

MartijnR mentioned this pull request Jul 25, 2023

401 response ton /manifest request not returning error code for API call to Enketo OpenClinica/enketo-oc#34

Open

eyelidlessness reviewed Aug 4, 2023

View reviewed changes

eyelidlessness closed this Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/various pdf fixes #577

Fix/various pdf fixes #577

MartijnR commented Jul 25, 2023 •

edited

Loading

MartijnR Jul 25, 2023

MartijnR Jul 25, 2023

MartijnR Jul 25, 2023 •

edited

Loading

eyelidlessness left a comment

MartijnR commented Oct 12, 2023 •

edited

Loading

Fix/various pdf fixes #577

Fix/various pdf fixes #577

Conversation

MartijnR commented Jul 25, 2023 • edited Loading

I have verified this PR works with

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

MartijnR Jul 25, 2023

Choose a reason for hiding this comment

MartijnR Jul 25, 2023

Choose a reason for hiding this comment

MartijnR Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

eyelidlessness left a comment

Choose a reason for hiding this comment

MartijnR commented Oct 12, 2023 • edited Loading

MartijnR commented Jul 25, 2023 •

edited

Loading

MartijnR Jul 25, 2023 •

edited

Loading

MartijnR commented Oct 12, 2023 •

edited

Loading