Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: yaml validation files in gcs #977

Merged

Conversation

sundar-mudupalli-work
Copy link
Collaborator

This change fixes 972. When using data-validation configs run -c gs://mudu-plex-pri-myb/db6rows/public.iowa_27rows/0001.yaml was throwing an error.

This fix does not have a test cases. We don't have any GCS based test cases in unit tests. Do the unit tests have access to a GCS bucket ? I tested the fix manually and it works. It would be nice for this to be tested in the test suite

Sundar Mudupalli

@sundar-mudupalli-work sundar-mudupalli-work requested a review from a team as a code owner September 8, 2023 05:06
@sundar-mudupalli-work
Copy link
Collaborator Author

/gcbrun

@sundar-mudupalli-work
Copy link
Collaborator Author

sundar-mudupalli-work commented Sep 13, 2023

Neha,

Good question. I did a bit of research and understand the code better.

Summary: I think the fix I proposed is in the right place and happy to hear another point of view. Here is my reasoning:

  1. The StateManager object upon instantiation decides whether it is working with the native filesystem or GCS. This value is in the object attribute file_system which is either FileSystem.GCS or FileSystem.LOCAL. If StateManager object is instantiated without any parameters, the file_system attribute is FileSystem.LOCAL unless the environment variable PSO_DV_CONFIG_HOME points to a GCS directory.
  2. If we want to read from a file in GCS, we must (highly recommended?) instantiate the StateManager object with the GCS prefix - which is what this code does.
  3. We can instantiate a StateManager object with no parameters and then move this code to get_validation_config. The last line in get_validation_config is _read_file which depends on file_system attribute being set correctly. Because we instantiated StateManager with no parameters (and environment variable PSO_DV_CONFIG_HOME is not set) we will have to reset the file_system attribute in get_validation_config before calling _read_file. Not a great idea in my opinion.

I looked at list_validations_in_dir function. The test startswith("gs://") appears unnecessary, it could have used the self.file_system attribute. This function is called from __main__ and the StateManager is instantiated with the right parameter. This function is identical to the function list_validations right above it!!. These could have been collapsed into one function with config_dir as an optional argument with a default value of None.

It is a good question, hopefully you find it helpful.

Sundar Mudupalli

return mgr.get_validation_config(validation_name)
if validation_name.startswith("gs://"):
obj_depth = len(validation_name.split("/"))
gcs_prefix = "/".join(validation_name.split("/")[: obj_depth - 1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier than getting the depth - 1: "/".join(file.split("/")[:-1])

@nehanene15
Copy link
Collaborator

@sundar-mudupalli-work That makes sense. Seems like we need to clean up the StateManager functions a bit.
Left a small comment on optimizing the gcs_prefix path, but besides that LGTM

@sundar-mudupalli-work sundar-mudupalli-work merged commit bf0fa0a into develop Sep 28, 2023
5 checks passed
@sundar-mudupalli-work sundar-mudupalli-work deleted the 972-yaml-validation-files-in-GCS-not-supported branch September 28, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

YAML validation directories supported in GCS, but not validation files in GCS
2 participants