Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background sync for datasets from CommCare HQ #41

Merged
merged 28 commits into from
Mar 18, 2024
Merged

Background sync for datasets from CommCare HQ #41

merged 28 commits into from
Mar 18, 2024

Conversation

kaapstorm
Copy link
Contributor

Jira: SC-3212

This pull request includes

  • A series of small prefactor commits
  • A large commit that adds an API endpoint for accepting dataset change requests from CommCare HQ
  • A large commit that adds an OAuth 2.0 server for authenticating incoming requests
  • A migration for adding the models for the OAuth 2.0 server

Easiest to review 🐡 commit 🐙 by 🐠 commit 🐟

This code has been tested locally.

It will be deployed to Staging on Wed Feb 28.

QA is scheduled.

@kaapstorm kaapstorm added the enhancement New feature or request label Feb 27, 2024
@kaapstorm kaapstorm mentioned this pull request Feb 27, 2024
Copy link
Contributor

@mkangia mkangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the refactor commits @kaapstorm.

I tried reviewing this PR but it was a bit difficult with this PR attending to quite a lot of changes with some big commits.
Plus the part about OAuth that came along with syncing data sources is new to me.

Are you able to break this down further for easier review? Or else if its better the way it is, I will take a more dedicated pass. Please let me know

hq_superset/utils.py Outdated Show resolved Hide resolved
hq_superset/models.py Show resolved Hide resolved
hq_superset/api.py Outdated Show resolved Hide resolved
hq_superset/const.py Outdated Show resolved Hide resolved
@kaapstorm
Copy link
Contributor Author

Tests are failing because they are not configured correctly. I'll fix that.

@mkangia you asked:

Are you able to break this down further for easier review?

I agree that Add OAuth 2.0 server is a monster. All the commits that went into it can be found here but I'm not sure that's any more helpful.

I'll try to unpack them. I'll timebox it and see how I get on. (Rearranging and squashing the commits from the draft PR took hours. Hopefully unpacking might be quicker.)

@mkangia
Copy link
Contributor

mkangia commented Feb 28, 2024

Thanks @kaapstorm

Good to timebox or else then If its easier, we could jump on a call where we can walk through the big commits

@kaapstorm kaapstorm changed the base branch from master to nh/bg_sync_pre February 28, 2024 11:56
@kaapstorm kaapstorm force-pushed the nh/bg_sync_pre branch 2 times, most recently from d8dc390 to f9ea93b Compare February 29, 2024 10:56
Accepts changes to datasets from CommCare HQ data forwarding
"""

MAX_REQUEST_LENGTH = 10 * 1024 * 1024 # reject JSON requests > 10MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaapstorm How was this limit determined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wild guess as to how big a "very big" request might be? I'm open to better ways to determine this value.

Base automatically changed from nh/bg_sync_pre to master February 29, 2024 11:00

@expose('/change/', methods=('POST',))
@handle_api_exception
def post_dataset_change(self) -> FlaskResponse:
Copy link
Contributor

@Charl1996 Charl1996 Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaapstorm Quickly remind me again, is this endpoint going to be hit for every row change in a data source? (it seems like it considering that the DataSetChange has a doc_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint will be hit for every change for a doc_id, where a doc_id originally refers to a case or a form. Sometimes one doc_id can result in multiple rows, e.g. a data source definition that pulls out a question inside a repeat group.

@@ -193,3 +204,52 @@ def array_is_falsy(array_values):
return []

return array_values


def js_to_py_datetime(jsdt, preserve_tz=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is amazing!

cast_functions = {
# 'BIGINT': int,
# 'TEXT': str,
'TIMESTAMP': partial(js_to_py_datetime, preserve_tz=False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I learned: partial functions. (nice!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

hq_superset/models.py Outdated Show resolved Hide resolved
@@ -32,3 +45,71 @@ def update_dataset(self):
rows = list(cast_data_for_table(self.data, table))
insert_stmt = table.insert().values(rows)
connection.execute(insert_stmt)


class HQClient(db.Model, OAuth2ClientMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be nitpicking here...

Can we maybe call this something like HQOAuthClient? The reason is that the current naming made me think this was the request client (don't know if I'm just super pedantic).

Anyway, I'll leave it to your own discretion...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I rebased (before reading this comment) I renamed this class "OAuth2Client", in line with its mixin. I did the same with "Token".

"HQOAuthClient" may have been a better name, considering it's got a "domain" field, but I feel like "OAuth2Client" is close enough, and better than "CommCareHQOAuth2Client".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like "OAuth2Client" is close enough

Perfect!

hq_superset/services.py Outdated Show resolved Hide resolved
hq_superset/services.py Outdated Show resolved Hide resolved

### Setup env
The 'User configurable reports UI' feature flag must be enabled for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more of an HQ discussion, but should we not have a different FF for this workflow so users don't have to be concerned about the User Configurable Reports UI feature flag until they actually want to user the UCRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can definitely pull CommCare Analytics into the current discussion around UCR feature flags.

@Charl1996
Copy link
Contributor

Absolutely great work, @kaapstorm !

Just a couple of comments, nothing big.

There is a lot happening in this commit.

1. Setting up the test database for datasets from HQ no longer needs a
   special function. The `tests.utils` module is renamed `tests.const`
   and the `UnitTestingOnly` exception is no longer needed.

2. The `oauth2_server` module is the heart of the implementation of an
   OAuth 2.0 server, based heavily on the sample implementation given
   at https://github.com/authlib/example-oauth2-server/

3. The OAuth 2.0 server is exposed in `api.py`, and uses models defined
   in `models.py`.

4. The OAuth 2.0 server uses Fernet symmetric encryption so that OAuth
   2.0 client secrets can be read for setting up syncing for multiple
   data sources. It uses a couple of utility functions defined in
   `utils.py`.
@mkangia
Copy link
Contributor

mkangia commented Mar 5, 2024

Do you have a different kind of token-based authentication in mind? What standard are you thinking of?

Resolved offline.
OAuth is the better choice keeping security in mind and the library is already available via superset/flask so no new dependency was added.

README.md Show resolved Hide resolved
@mkangia
Copy link
Contributor

mkangia commented Mar 5, 2024

New commits look good. Will wait for responses on other comments from the last review.

@mkangia
Copy link
Contributor

mkangia commented Mar 7, 2024

New commits look good 👍

@kaapstorm
Copy link
Contributor Author

Do you think there is anything still left to do @Charl1996 @mkangia ?

@mkangia
Copy link
Contributor

mkangia commented Mar 7, 2024

Hey @kaapstorm

I see some open comments from my earlier review. Waiting on those to do a full pass again. I hope these will be quick since they are mostly minor.

Copy link
Contributor

@Charl1996 Charl1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a brief review of the newer commits and all seems fine. I'm happy with this PR.

hq_superset/services.py Outdated Show resolved Hide resolved
@@ -39,6 +40,10 @@ def save_token(token, request):


def config_oauth2(app):
authlib_logger = logging.getLogger('authlib')
authlib_logger.addHandler(logging.StreamHandler(sys.stdout))
authlib_logger.setLevel(logging.DEBUG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗒️

@kaapstorm
Copy link
Contributor Author

49d575a fixed the problem with syncing from CommCare Staging. I don't like it, but the right way to do it didn't work.

"If at first (or second, or third, or fourth) you don't succeed, use a bigger hammer."

@kaapstorm
Copy link
Contributor Author

I think that's everything. If there is anything I've missed, please link to it.

# 2.0 secure transport check despite transport being over HTTPS. I
# hate to do this, but werkzeug.contrib.fixers.ProxyFix didn't fix
# it. So I've run out of better options. (Norman 2024-03-13)
os.environ['AUTHLIB_INSECURE_TRANSPORT'] = '1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kaapstorm

Good find on the proxy redirection.

I can make peace with this change to unblock QA/UAT.

It's a positive that this is on analytics server that we have control/access to and not superset.

I was trying to find where in nginx conf, this is getting set and compare that with what we have for HQ though HQ nginx template has a lot happening.

Couple of things I noticed that could be faulty

  1. we have two "server" blocks with the same server name in nginx conf. I am not sure if that is expected/correct
  2. I don't know what the following block does
if (!-f $request_filename) {
      proxy_pass http://superset;
      break;
    }

@sravfeyn assuming you have more insight when this was setup.
Can you see something faulty with the current nginx conf or point out what could redirecting urls to http?
I don't know what this block does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at the nginx config, and realised it will take me a bit longer to change and test it. I'll merge this PR, and I'll create a follow up ticket to see how we can configure nginx so that we can use werkzeug's "ProxyFix" class (or not have to use it at all) instead of using this environment variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay @kaapstorm, thanks for taking a look.

I am fine with this merge till this stays on staging and is not released to production environment until the secure fix is done.

@kaapstorm kaapstorm merged commit bd46e6e into master Mar 18, 2024
3 checks passed
@kaapstorm kaapstorm deleted the nh/bg_sync branch March 18, 2024 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants