Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.

Fix/various pdf fixes #577

Closed
wants to merge 3 commits into from
Closed

Fix/various pdf fixes #577

wants to merge 3 commits into from

Conversation

MartijnR
Copy link
Member

@MartijnR MartijnR commented Jul 25, 2023

Offering some improvements that were made to PDF generation in OpenClinica's fork. PDF (record) generation is something their users use extensively.

I have verified this PR works with

  • pdf generation API endpoint

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

    1. Enketo used to instantiate a headless browser for each PDF API call and close this browser at the end of the call. The first commit of this PR changes that to a single browser instance that always remains open (and re-launches automatically if it crashes). API responses should therefore be faster.
    1. There is a bug in PDF generation for forms that require authentication. The PDF logic is not able to open the form but does not respond with an error. Instead it produces an 'empty' PDF (actually a text file but with a pdf extension that is shown as an empty PDF by Preview). The 3rd commit in this PR fixes this by properly returning a 401 response.

args,
userDataDir,
});
this.browser.on('disconnected', launchBrowser);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relaunch if the browser is shut down for any reason

.goto(urlObj.href, { waitUntil: 'networkidle0', timeout })
.catch((e) => {
e.status = /timeout/i.test(e.message) ? 408 : 400;
throw e;
});

// Either a 401 error is thrown or goto succeeds (or encounters a real loading error)
await Promise.race([detect401, goToPage]);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, it was a challenge to properly catch a 401 response... I would certainly be impressed if y'all can point to a better solution :).


/**
* This class approach makes it easy to open multiple browser instances with
* different arguments in case that is ever required.
Copy link
Member Author

@MartijnR MartijnR Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, in OpenClinica's fork they actually do this with custom 'headless' API endpoints that serve to import records, run validation on them, and add comments to questions with errors. Those endpoints use a different headless browser config to optimize performance for that purpose (without stylesheets etc).

Copy link
Contributor

@eyelidlessness eyelidlessness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a first pass reviewing the functionality and implementation, I have a couple pretty serious concerns about this introducing denial of service risks.

  1. The failure/retry logic is unconditional. In pathological failure cases, this would result in an infinite loop/recursion, only briefly yielding for however long launch failure takes. At best, this scenario would result in severe service degradation.

  2. All browsers, and particularly Chromium, are notorious for consuming huge amounts of resources (especially memory) over time for long running processes. This is true even across multiple tabs (i.e. "page" in Puppeteer's terms), as multiple same-origin requests are generally serviced by a shared process. This is somewhat mitigated by the PDF logic calling page.close, making it more an attack risk than self-DoS, but it's a real concern.

I have a few smaller concerns about implementation details of the change, as well as the existing implementation it augments. But all of these concerns also prompted me to consider the functionality overall—how best to serve the general use case, as well as extensibility for e.g. OpenClinica. My gut instinct:

  • There is a general use case for PDF generation at a user level, but it may be better served as a client side feature without the machinery of a server orchestrated headless browser doing (roughly) the same thing a user's browser can do. There would be some work there, but I think it's worth considering.

  • To my understanding, OpenClinica has a more specific use case of batch/automated PDF generation. This may be better served by a headless browser, but I'm not sure it necessarily needs to be handled in the mainline Enketo project[s]. To the extent there's a need to extend the Enketo Express server to address this use case, the underlying express server library is already well suited for extensibility—i.e., OpenClinica or anyone else can add any route/request handlers they wish, to supplement or extend existing server functionality essentially arbitrarily.

I think it's best for now to hold off on this change, so we can put some more strategic thought into how we address both of these use cases in the longer term.

@MartijnR
Copy link
Member Author

MartijnR commented Oct 12, 2023

Thanks a lot for your review!

I think there is indeed a risk with the automatic browser respawning (and it doesn't actually respond to a real issue that was reported). We may just remove that in the OpenClinica fork as well, so thanks for that!

If you're interested in a PR without that feature, just let me know. I think the other changes will reduce the vulnerability to DOS slightly as it won't spawn a browser for each API request (compared to the current code). It will also make it a lot easier to keep sending PRs for PDF bug fixes (and keep the fork up to date). There are currently a couple of PDF fixes to offer including the one in this PR.

There would be some work there, but I think it's worth considering.

I don't think bringing non-batch PDF creation to the client is feasible though, unless you're okay with requiring users to load the form and use the browser's print menu manually. However, if you see no need for users to generate a PDF of an empty form or record with a click of a button in Central/etc, that may be fine. In that case I imagine you may want to consider removing the PDF API endpoints. I'm curious if you have some clever idea that 'some work' to generate a PDF on the client refers to, as I may of course be missing something!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants