Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add WRROC models, validators & unit tests #20

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

Karanjot786
Copy link
Member

@Karanjot786 Karanjot786 commented Aug 16, 2024

Hi @uniqueg and @SalihuDickson,

Summary

This PR introduces several key improvements and fixes based on recent feedback:

  • Refactor WRROC Models:

    • Created a base class WRROCDataBase to hold common fields and methods.
    • Refactored WRROCData, WRROCDataTES, and WRROCDataWES to inherit from WRROCDataBase to reduce code duplication and enhance maintainability.
  • Stricter Validation:

    • Replaced extra = "allow" with extra = "forbid" in all WRROC models (WRROCProcess, WRROCWorkflow, WRROCProvenance) to enforce stricter validation.
  • Enhanced Validation and Error Collection:

    • Modified validation functions to collect and report all errors found during validation, ensuring comprehensive feedback.
    • Simplified field validation by removing manual checks for extra fields and leveraging Pydantic's extra = "forbid" configuration.
  • Expanded Unit Tests:

    • Added test cases for edge scenarios such as empty object or result lists, missing optional fields, and incorrect data types.
    • Implemented tests for invalid URLs and other edge cases.
    • Fixed existing tests by allowing empty lists and adding URL validation in the WES conversion validation.

Summary by Sourcery

Add WRROC models with a new base class for shared fields, enhance validation with stricter rules and comprehensive error reporting, and expand unit tests to cover edge cases. Enable test execution in the CI workflow.

New Features:

  • Introduce WRROC models including WRROCData, WRROCDataTES, and WRROCDataWES, inheriting from a new base class WRROCDataBase to encapsulate common fields and methods.

Enhancements:

  • Implement stricter validation in WRROC models by setting extra fields to 'forbid', enhancing data integrity.
  • Enhance validation functions to collect and report all errors found during validation, providing comprehensive feedback.

CI:

  • Enable the execution of unit tests in the CI workflow by uncommenting the test run command.

Tests:

  • Expand unit tests to cover edge scenarios such as empty object or result lists, missing optional fields, and incorrect data types for WRROC models.
  • Add tests for invalid URLs and other edge cases in WRROC models.

Copy link

sourcery-ai bot commented Aug 16, 2024

Reviewer's Guide by Sourcery

This pull request introduces significant improvements to the WRROC models, validators, and unit tests in the CrateGen project. The changes focus on enhancing code structure, improving validation, and expanding test coverage. Key modifications include refactoring WRROC models to use inheritance, implementing stricter validation rules, enhancing error collection during validation, and adding comprehensive unit tests for various scenarios.

File-Level Changes

Files Changes
crategen/models.py Refactored WRROC models to use inheritance, introducing a base class WRROCDataBase to reduce code duplication and improve maintainability
crategen/models.py Implemented stricter validation by replacing 'extra = "allow"' with 'extra = "forbid"' in all WRROC models
crategen/validators.py Enhanced validation functions to collect and report all errors found during validation
tests/unit/test_wrroc_models.py Added comprehensive unit tests for WRROC models and validators, including edge cases and error scenarios
crategen/validators.py Implemented URL validation for WES conversion in the validate_wrroc_wes function
.github/workflows/ci.yml Updated CI workflow to run tests

Tips
  • Trigger a new Sourcery review by commenting @sourcery-ai review on the pull request.
  • Continue your discussion with Sourcery by replying directly to review comments.
  • You can change your review settings at any time by accessing your dashboard:
    • Enable or disable the Sourcery-generated pull request summary or reviewer's guide;
    • Change the review language;
  • You can always contact us if you have any questions or feedback.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Karanjot786 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟡 Testing: 2 issues found
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.

Comment on lines 104 to 235
"""
data = {
"id": "provenance-id",
"name": "Test Provenance",
"provenanceData": "Provenance information"
}
model = validate_wrroc(data)
self.assertIsInstance(model, WRROCProvenance)

def test_validate_wrroc_invalid(self):
"""
Test that validate_wrroc raises a ValueError for invalid WRROC data.
"""
data = {
"unknown_field": "unexpected"
}
with self.assertRaises(ValueError):
validate_wrroc(data)

def test_validate_wrroc_tes(self):
"""
Test that validate_wrroc_tes correctly validates a WRROC entity for TES conversion.
"""
data = {
"id": "process-id",
"name": "Test Process",
"object": [{"id": "https://raw.githubusercontent.com/elixir-cloud-aai/CrateGen/main/README.md", "name": "Input 1"}],
"result": [{"id": "https://github.com/elixir-cloud-aai/CrateGen/blob/main/LICENSE", "name": "Output 1"}]
}
model = validate_wrroc_tes(data)
self.assertEqual(model.id, "process-id")
self.assertEqual(model.name, "Test Process")

def test_validate_wrroc_tes_empty_object_list(self):
"""
Test that validate_wrroc_tes correctly validates a WRROC entity with an empty object list for TES conversion.
"""
data = {
"id": "process-id",
"name": "Test Process",
"object": [],
"result": [{"id": "https://github.com/elixir-cloud-aai/CrateGen/blob/main/LICENSE", "name": "Output 1"}]
}
model = validate_wrroc_tes(data)
self.assertEqual(model.object, [])

def test_validate_wrroc_tes_missing_fields(self):
"""
Test that validate_wrroc_tes raises a ValueError if required fields for TES conversion are missing.
"""
data = {
"id": "process-id",
"name": "Test Process"
}
with self.assertRaises(ValueError):
validate_wrroc_tes(data)

def test_validate_wrroc_wes(self):
"""
Test that validate_wrroc_wes correctly validates a WRROC entity for WES conversion.
"""
data = {
"id": "workflow-id",
"name": "Test Workflow",
"workflowType": "CWL",
"workflowVersion": "v1.0",
"result": [{"id": "https://github.com/elixir-cloud-aai/CrateGen/blob/main/LICENSE", "name": "Output 1"}]
}
model = validate_wrroc_wes(data)
self.assertEqual(model.workflowType, "CWL")
self.assertEqual(model.workflowVersion, "v1.0")

def test_validate_wrroc_wes_invalid_url(self):
"""
Test that validate_wrroc_wes raises a ValueError if a result URL is invalid.
"""
data = {
"id": "workflow-id",
"name": "Test Workflow",
"workflowType": "CWL",
"workflowVersion": "v1.0",
"result": [{"id": "invalid_url", "name": "Output 1"}]
}
with self.assertRaises(ValueError):
validate_wrroc_wes(data)


def test_validate_wrroc_wes_missing_fields(self):
"""
Test that validate_wrroc_wes raises a ValueError if required fields for WES conversion are missing.
"""
data = {
"id": "workflow-id",
"name": "Test Workflow"
}
with self.assertRaises(ValueError):
validate_wrroc_wes(data)

if __name__ == "__main__":
unittest.main()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add edge case tests for WRROC validators

Consider adding more edge case tests for the WRROC validators, such as testing with minimal valid data, data with all optional fields present, and data with unexpected additional fields. This will help ensure the validators are robust and handle a wide range of inputs correctly.

def test_validate_wrroc_process_minimal(self):
    """Test validate_wrroc with minimal valid WRROCProcess data."""
    data = {"id": "min-process-id"}
    model = validate_wrroc(data)
    self.assertIsInstance(model, WRROCProcess)

def test_validate_wrroc_workflow_all_fields(self):
    """Test validate_wrroc with all fields for WRROCWorkflow."""
    data = {
        "id": "full-workflow-id",
        "name": "Full Workflow",
        "workflowType": "CWL",
        "workflowVersion": "v1.0",
        "description": "A complete workflow",
        "inputs": [{"id": "input1", "type": "string"}],
        "outputs": [{"id": "output1", "type": "file"}]
    }
    model = validate_wrroc(data)
    self.assertIsInstance(model, WRROCWorkflow)
    self.assertEqual(model.description, "A complete workflow")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is a very good suggestion.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you found the suggestion helpful! Adding these edge case tests will definitely strengthen the robustness of your validators. If you need any assistance implementing these tests or have any questions about specific scenarios to cover, feel free to ask. I'm here to help!

tests/unit/test_wrroc_models.py Outdated Show resolved Hide resolved
crategen/validators.py Outdated Show resolved Hide resolved
crategen/validators.py Outdated Show resolved Hide resolved
crategen/validators.py Outdated Show resolved Hide resolved
@Karanjot786 Karanjot786 requested a review from uniqueg August 16, 2024 10:28
@Karanjot786
Copy link
Member Author

Overview

This PR introduces data validation in both the TES and WES converters, ensuring that the input data adheres to the expected format and contains all necessary fields before the conversion process begins. The changes are aimed at improving the robustness and reliability of the conversion process by leveraging Pydantic models for validation.

Changes Implemented

  • TES Converter (tes_converter.py):

    • Added data validation using the TESData model before the conversion process.
    • Updated the conversion logic to handle only validated data, ensuring correctness and preventing errors.
  • WES Converter (wes_converter.py):

    • Implemented validation using the WESData model before conversion.
    • Adjusted the conversion logic to operate on validated data, improving the reliability of the conversion.

Validation Details

  • TES Data Validation:

    • The TESData model is used to validate the input TES data before conversion to WRROC.
    • Any data that does not adhere to the TES schema will raise a validation error, preventing the conversion process from proceeding with invalid data.
  • WES Data Validation:

    • The WESData model validates the WES data before it is converted to WRROC.
    • Validation ensures that all required fields are present and correctly formatted, with errors being raised for any discrepancies.

Additional Information

  • The validation steps ensure that only valid data is processed during the conversion, reducing the likelihood of errors and improving the overall robustness of the tool.
  • Deprecated fields such as task_logs in WES are now handled with warnings, and new fields like tes_logs_url are supported to maintain compatibility with the latest specifications.

Please review the changes and provide feedback.

crategen/converters/tes_converter.py Outdated Show resolved Hide resolved
crategen/converters/tes_converter.py Show resolved Hide resolved
crategen/converters/tes_converter.py Show resolved Hide resolved
crategen/converters/tes_converter.py Show resolved Hide resolved
@@ -1,52 +1,66 @@
from .abstract_converter import AbstractConverter
from .utils import convert_to_iso8601
from ..models import TESData, WRROCDataTES
from pydantic import ValidationError

class TESConverter(AbstractConverter):

def convert_to_wrroc(self, tes_data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type information and Google-style docstrings here and elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added type hints and Google-style docstrings to all methods in both TES and WES converters.

Copy link
Member

@uniqueg uniqueg Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to add Google-style docstrings to all modules, classes, methods and functions in the future. And add type hints to all methods, functions and global vars. And ideally also to local vars, unless their types is really obvious.

You can resolve this conversation after you've read it.

crategen/validators.py Outdated Show resolved Hide resolved
crategen/validators.py Outdated Show resolved Hide resolved
crategen/validators.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the code are these validators called? Shouldn't they be used in the convert_from_wrroc() methods of the TES and WES converters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the validators directly in the convert_from_wrroc methods of both TES and WES converters. This ensures that all data is properly validated before the conversion process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see validate_wrroc_wes() and validate_wrroc_tes() anywhere being called in the codebase. Can you please send me a permalink to the exact locations where they are called? Probably I'm just confused...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the models in detail yet, but there are clearly some issues (see comment below in validators). Please address all other issues and double check your models against the TES and WES specs, the WRROC profiles and our conversion table first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve double-checked the models against the TES and WES specifications and the WRROC profiles. The models have been updated accordingly to ensure they are comprehensive and accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it's a good idea to split them up into several modules. I will still need to check them in detail in another round, because I don't have time to look into this in detail now.

Anyway, you can resolve this conversation when you've read it.

@Karanjot786
Copy link
Member Author

This PR addresses several issues raised in the mentor's feedback, particularly focusing on the WRROCDataWES model and its validation logic. The following changes have been made:

1. Fixed Validation Errors in WRROCDataWES

  • Updated the test data in test_validate_wrroc_wes to ensure all required fields, such as status and object, are provided with valid values.
  • Adjusted the model to allow None values where appropriate, ensuring the model can handle cases where optional fields are not present.

2. Enhanced Test Coverage

  • Modified existing tests to ensure that they accurately reflect the expected data structures and validation logic.
  • Added new tests to cover edge cases, particularly where certain fields may be optional or could be None.

3. Updated Validators

  • Improved the validate_wrroc_wes function to handle the validated data more robustly by using .dict() for correct data conversion.
  • Ensured that validation errors are properly caught and raised with informative messages.

4. Code Consistency and Cleanup

  • Ensured consistency in the use of variable names and structure across the codebase, particularly in the converters and validators.
  • Removed unnecessary comments and redundant code to improve readability and maintainability.

How to Test

  1. Run the test suite to verify that all tests pass:
    poetry run pytest --cov=crategen

Notes

This PR primarily focuses on fixing the immediate validation issues. Future PRs will address additional feedback regarding handling multiple executors and improving error messages.

@Karanjot786 Karanjot786 requested a review from uniqueg August 22, 2024 16:02
@SalihuDickson
Copy link

@Karanjot786, @uniqueg, I just merged a PR to address a few things, namely;

  1. separated the data models into separate files because the one models file was getting crowded.
  2. modified tes model to better fit with the specifications from the GAGH tes schema and added a reference for each model.
  3. configured the ruff linter to properly format and lint the code.
  4. enabled ci checks on all branches

Co-authored-by: salihuDickson <[email protected]>
@SalihuDickson
Copy link

SalihuDickson commented Aug 24, 2024

@Karanjot786, @uniqueg, I just merged a PR, that among other minor fixes, configures lefthook to carry out pre-push checks, making sure code pushed to the repo meets a certain level of quality. Currently, linting and formatting are the only checks.

@Karanjot786
Copy link
Member Author

Thank you for the updates, @SalihuDickson

I appreciate the work separating the data models into individual files; this will help keep the codebase more organized as it grows. The modifications to the TES model to better align with the GA4GH specifications are also great, and having the references directly in the models will be useful. I'll make sure to integrate these changes into my workflow. Also, the configuration of the ruff linter and enabling CI checks on all branches are much appreciated—they’ll help maintain a high standard of code quality in the future.

Adding lefthook for pre-push checks is a fantastic step toward ensuring consistent code quality. I'll make sure my code adheres to the new checks before pushing any updates. This will certainly help streamline the review process and reduce back-and-forth on formatting and linting issues.

Thanks again for implementing these improvements! I'll continue to align my work with these updates.

Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking A LOT cleaner now - thanks a lot @Karanjot786. There are still some issues though, and with the cleaner code I found a few more ones. But I expect it will be only two more rounds of changes, one for the code and one for the models (which I haven't reviewed now but will try to do in the next few days).

Comment on lines 14 to 17
- uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very outdated, I think the last version is v4 - or even v5. Please check and change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out the outdated versions in the CI workflow. I’ll update actions/checkout and actions/setup-python to their latest versions (v4 or v5) as suggested. I’ll ensure this change is reflected in the .github/workflows/ci.yml file.

Comment on lines 13 to 17
Args:
data (dict): The input TES data.

def convert_to_wrroc(self, tes_data):
Returns:
dict: The converted WRROC data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please annotate types in the signature only. Do not additionally annotate them in the docstring, because there is a chance that the info in the signature and that in the docstring will stray. Better to have a single source of truth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I will remove the type annotations from the docstrings in tes_converter.py and other relevant files. I understand the importance of maintaining a single source of truth with type hints in the function signatures.


# Convert to WRROC
# Convert to WRROC format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also remove this comment. The code is not hard to read at all, and even the error message tells us very expliclity what's going on here.

The same goes for the other comment above and the remaining comments in the WES converter.

No need for comments for just a few lines of straightforward code. Only comment if something really tricky comes, like some bitwise operations or a complicated algo etc. Or if you add a workaround for a bug or vulnerability (then you can add a link to the issue). So for these kind of simple tools, basically almost never.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance regarding comments. I will remove the redundant comments from the TES and WES converter files and ensure that only necessary and meaningful comments are included. I agree that the code is straightforward and self-explanatory.

{"@id": output.url, "name": output.path} for output in data_tes.outputs
],
"startTime": data_tes.creation_time,
"endTime": end_time,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to validate your wrroc_data before you return it? Or do you do this somewhere else upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I’ll add a validation step for wrroc_data in the convert_to_wrroc method of tes_converter.py to ensure the data conforms to the expected schema before returning it.

"inputs": [{"url": obj.id, "path": obj.name} for obj in data_wrroc.object],
"outputs": [{"url": res.id, "path": res.name} for res in data_wrroc.result],
"creation_time": data_wrroc.startTime,
"logs": [{"end_time": data_wrroc.endTime}],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: Make sure that not only the input to the conversion is validated, but also the output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged! I will also add validation for the output data in the convert_from_wrroc method in tes_converter.py. This will ensure the data is correctly validated after conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is empty now, guess you can remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the initial review of the model splits. I will await your detailed review to see if any further changes are needed. Meanwhile, I'll ensure the current organization is as clean and logical as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it's a good idea to split them up into several modules. I will still need to check them in detail in another round, because I don't have time to look into this in detail now.

Anyway, you can resolve this conversation when you've read it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see validate_wrroc_wes() and validate_wrroc_tes() anywhere being called in the codebase. Can you please send me a permalink to the exact locations where they are called? Probably I'm just confused...

lefthook.yml Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure you want/need this file under version control? Is it perhaps sth that is just for you?

Copy link
Member Author

@Karanjot786 Karanjot786 Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will consider whether the lefthook.yml file is necessary for version control.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @uniqueg, I actually added the lefthook.yml file. It adds the pre-push check to make sure the code meets certain quality standards before it can be pushed. It's under version control because I want it to be applied with every contributor.

Comment on lines 23 to 25
# Convert '@id' to 'id' for validation purposes
if '@id' in data:
data['id'] = data.pop('@id')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the models are you actually using @id aliases now? I didn't look at the models in depth, but I didn't see them (or any import of Fields or the use of alias). Rather, it looks like the models still just use id? Again, maybe I'm confused, so a pointer would help.

@Karanjot786
Copy link
Member Author

Hi @uniqueg,
Here are the locations where the validate_wrroc_tes() and validate_wrroc_wes() functions are used in the codebase:

  1. validate_wrroc_tes() is used in the test file to validate WRROC entities for TES conversion. You can find its usage here: validate_wrroc_tes() Usage.

  2. validate_wrroc_wes() is similarly used in the test file for validating WRROC entities for WES conversion. Its usage can be seen here: validate_wrroc_wes() Usage.

These functions are primarily used in the test suite to ensure that the data conforms to the expected structures for TES and WES conversions.

@Karanjot786
Copy link
Member Author

Hi @uniqueg ,

Thank you for pointing this out. I have reviewed the current state of our models and codebase. The code snippet related to converting @id to id for validation purposes has already been removed in the recent updates to the codebase.

Currently, the models do not explicitly use aliases such as @id. Instead, they utilize straightforward field names like id without any aliasing.

If there's a need to introduce field aliases using Pydantic’s Field with the alias parameter, I can update the models accordingly. However, since we are not using aliases in the current implementation, there isn't any related code or import of Field from Pydantic for aliasing.

Please let me know if you would like me to implement aliases or if any other changes are required!

@uniqueg
Copy link
Member

uniqueg commented Aug 30, 2024

Hi @uniqueg, Here are the locations where the validate_wrroc_tes() and validate_wrroc_wes() functions are used in the codebase:

  1. validate_wrroc_tes() is used in the test file to validate WRROC entities for TES conversion. You can find its usage here: validate_wrroc_tes() Usage.
  2. validate_wrroc_wes() is similarly used in the test file for validating WRROC entities for WES conversion. Its usage can be seen here: validate_wrroc_wes() Usage.

These functions are primarily used in the test suite to ensure that the data conforms to the expected structures for TES and WES conversions.

Thanks a lot! I hadn't checked the test files.

Generally, if code is only used in tests, it should not be in the main codebase - rather you want to keep it in the test folder. And consider creating a pytest fixture from it.

Here though, I think the validator code is useful and should be used in the main code base. You should never return unvalidated code. And you should always validate inputs. So consider where in the codebase you wand/need to add these validations.

@uniqueg
Copy link
Member

uniqueg commented Aug 30, 2024

Hi @uniqueg ,

Thank you for pointing this out. I have reviewed the current state of our models and codebase. The code snippet related to converting @id to id for validation purposes has already been removed in the recent updates to the codebase.

Currently, the models do not explicitly use aliases such as @id. Instead, they utilize straightforward field names like id without any aliasing.

If there's a need to introduce field aliases using Pydantic’s Field with the alias parameter, I can update the models accordingly. However, since we are not using aliases in the current implementation, there isn't any related code or import of Field from Pydantic for aliasing.

Please let me know if you would like me to implement aliases or if any other changes are required!

Well, you want the resulting WRROC objects to be valid - which (as far as I know) requires @id, not id. So somewhere you need to make sure that the correct property names are used, including @ if required - both when WRROC inputs are validated and when WRROC outputs are generated and validatedd.

@Karanjot786
Copy link
Member Author

Karanjot786 commented Aug 30, 2024

Summary:

This PR addresses several updates and improvements based on recent mentor feedback:

1. CI Workflow Updates:

  • Updated the CI workflow in .github/workflows/ci.yml to use the latest versions of actions/checkout and actions/setup-python (upgraded from v2 to the latest v4).

2. Code Cleanup and Refinements:

  • Removed redundant type annotations from the docstrings in crategen/converters/tes_converter.py and crategen/converters/wes_converter.py. Type hints are now only present in function signatures.
  • Removed unnecessary comments that described straightforward code in both TESConverter and WESConverter.
  • Added validation steps for wrroc_data before returning it in the convert_to_wrroc method of TESConverter and ensured the output data is validated after conversion in the convert_from_wrroc method.
  • Added Google-style docstrings and type hints to all modules, classes, methods, and functions within the TESConverter and WESConverter.

3. Consistency Improvements:

  • Updated the method signature in AbstractConverter to ensure consistency, changing convert_from_wrroc(self, wrroc_data) to convert_from_wrroc(self, data).

4. Validator Integration:

  • Integrated validate_wrroc_tes() and validate_wrroc_wes() validators into the appropriate locations within crategen/converters/tes_converter.py and crategen/converters/wes_converter.py to ensure data validation before and after conversions.

5. Validator Usage Clarification:

  • Provided clarification and permalinks to where validate_wrroc_wes() and validate_wrroc_tes() are being used in the codebase.

6. Alias Usage Update:

  • Confirmed that the code for converting @id to id for validation purposes has been removed. The models currently use straightforward field names without aliasing. Updated the WRROC models to include Field definitions with aliases (e.g., id to @id) where necessary to conform to the WRROC schema requirements.

7. Test Case Adjustments:

  • Updated the test cases in tests/unit/test_wrroc_models.py to use field aliases (e.g., @id instead of id) in the test data, ensuring compatibility with the updated model definitions that require these aliases.

@Karanjot786 Karanjot786 requested a review from uniqueg August 30, 2024 16:22
Co-authored-by: salihuDickson <[email protected]>
@Karanjot786
Copy link
Member Author

Hi @SalihuDickson,

I encountered a build failure during the recent CI run due to multiple errors in the WESConverter code, specifically in crategen/converters/wes_converter.py. Here are the errors that were flagged:

  • Attribute Errors:

    • Item "None" of "Log | None" has no attribute name, start_time, end_time (lines 32, 34, 35).
    • The result seems to be expecting objects with attributes location and name, but a string is being passed.
  • Missing Arguments in Log:

    • On line 68, multiple required arguments (cmd, stdout, stderr, exit_code, and system_logs) are missing when constructing the Log object.
  • Incompatible Types:

    • start_time and end_time are being passed as str | None but expected as datetime | None.
    • workflow_engine_parameters, workflow_engine, workflow_engine_version, and workflow_url are missing from the RunRequest.
    • The outputs attribute in WESData expects a dictionary (dict[str, str]), but a list of WESOutputs objects is being passed.
  • Incompatible State Type:

    • The state in WESData expects a State | None, but a str is being passed.
  • Other Issues:

    • Some fields are missing when constructing the WESData object, including task_logs_url.

Here's how I think we can fix these issues:

  1. Ensure that when Log or TaskLog is created, all required fields (cmd, stdout, stderr, exit_code, and system_logs) are provided.
  2. Convert start_time and end_time to datetime where applicable before assigning them.
  3. Ensure outputs is passed as dict[str, str] instead of a list of WESOutputs objects.
  4. Ensure that the correct State enum is passed instead of a raw string for state.
  5. Complete the RunRequest by providing the necessary fields such as workflow_engine_parameters, workflow_engine, workflow_engine_version, and workflow_url.
  6. Make sure task_logs_url is passed to WESData during creation.

Please let me know your thoughts on this or if you have any suggestions on how best to resolve these errors.

Thank you!

Best regards,
Karanjot

@SalihuDickson
Copy link

@Karanjot786, thank you for bringing this to my attention, I am working on a fix.

Copy link

@SalihuDickson SalihuDickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just need some clarification on these points, you don't need to make any changes to the code just yet.

"status": data_wes.state,
"startTime": convert_to_iso8601(data_wes.run_log.start_time),
"endTime": convert_to_iso8601(data_wes.run_log.end_time),
"result": [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs in the wes schema and according to the Workflow run ro-crate schema, result can be an object so there is no reason for this to be a list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the WES specification does not provide any specific keys for the output object, can you please provide a reference as to where you got the location and name keys for the output object.

Co-authored-by: salihuDickson <[email protected]>
@SalihuDickson
Copy link

Hey @Karanjot786, can you please walk me through the use of the crategen/validators.py module. Specifically how the work being done there is any different from the validators in the pydantic models.

Please don't make any changes yet, I simply want to get a better understanding of why we have the validators module because I feel like I might be missing something.

@Karanjot786
Copy link
Member Author

Hi @SalihuDickson,

Thank you for your feedback! I'd be happy to clarify the points you've mentioned.

  • WES Converter Output Object:

    • You pointed out that in the WES schema, the result should not be a list, and you're correct. The original implementation does handle result as a list, which isn't fully aligned with the Workflow run RO-Crate schema. I'll look into adjusting this so that the result conforms to the correct object structure.
    • As for the location and name keys, those were likely added based on earlier assumptions about the output structure. You're right that the WES specification doesn't explicitly define these keys. I will cross-check the schema more thoroughly and update accordingly once we finalize this.
  • Validators in crategen/validators.py:

    • The crategen/validators.py module is essentially providing a layer of custom validation logic on top of the Pydantic models. While the Pydantic models do validate the shape and types of the data, the validators module performs additional checks that are specific to the conversion logic and schema compatibility within our application.
    • For instance, the validators may ensure that certain fields conform to specific RO-Crate or TES/WES conversion requirements, or they may ensure compatibility between multiple models. These validations are often context-specific and may not fit neatly into the standard Pydantic model validation pipeline.
    • The key difference is that Pydantic's validation is generally focused on ensuring data integrity at the model level, whereas the custom validators help enforce business logic and additional rules necessary for proper conversion across schemas (TES, WES, RO-Crate). In some cases, these validators also provide error messages tailored to the conversion process, which might not be as easily customizable within Pydantic's validation system.

I hope this helps clear up the reason for the current setup. Please let me know if you'd like further details or if any adjustments should be made based on this explanation.

@uniqueg
Copy link
Member

uniqueg commented Sep 28, 2024

Hi @SalihuDickson,

Thank you for your feedback! I'd be happy to clarify the points you've mentioned.

  • WES Converter Output Object:

    • You pointed out that in the WES schema, the result should not be a list, and you're correct. The original implementation does handle result as a list, which isn't fully aligned with the Workflow run RO-Crate schema. I'll look into adjusting this so that the result conforms to the correct object structure.
    • As for the location and name keys, those were likely added based on earlier assumptions about the output structure. You're right that the WES specification doesn't explicitly define these keys. I will cross-check the schema more thoroughly and update accordingly once we finalize this.
  • Validators in crategen/validators.py:

    • The crategen/validators.py module is essentially providing a layer of custom validation logic on top of the Pydantic models. While the Pydantic models do validate the shape and types of the data, the validators module performs additional checks that are specific to the conversion logic and schema compatibility within our application.
    • For instance, the validators may ensure that certain fields conform to specific RO-Crate or TES/WES conversion requirements, or they may ensure compatibility between multiple models. These validations are often context-specific and may not fit neatly into the standard Pydantic model validation pipeline.
    • The key difference is that Pydantic's validation is generally focused on ensuring data integrity at the model level, whereas the custom validators help enforce business logic and additional rules necessary for proper conversion across schemas (TES, WES, RO-Crate). In some cases, these validators also provide error messages tailored to the conversion process, which might not be as easily customizable within Pydantic's validation system.

I hope this helps clear up the reason for the current setup. Please let me know if you'd like further details or if any adjustments should be made based on this explanation.

Thanks for the explanations, @Karanjot786!

Could you please give one (or better: two) concrete examples of things that we need to validate that we won't get for free from Pydantic's automatic validation? Also, for these examples (and more generally), is there any particular reason why we don't use Pydantic's support for writing custom validators?

Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Karanjot786 - there is a lot of good stuff in this PR, and it's getting there. However, it's just too big a PR to get it over the line, and I feel I am getting caught up in details when reviewing.

I strongly suggest we take this PR apart into several more focused ones, probably to be merged in this order:

  • style: support linting (make sure that mypy and ruff check pass)
  • ci: reformat, reset branches & upgrade actions (put just the latest .github/workflows/ci.yml without including the job to run unit tests; consider including lefthook.yml as well)
  • feat: add WES, TES & WRROC models (please include appropriate unit tests here, if applicable; once you do add your first tests, add a job to the CI to run them)
  • feat: integrate model validation in conversions (you can use this PR here; basically, the models should be refactored to use the new models for validation; please leave out any custom validators - we will either add these in the next PR or not at all - see separate discussion thread started by @SalihuDickson)

For changes to pyproject.toml, please use your best judgement where to put them (possibly to multiple PRs).

Some of these may not require any changes from the current state - others just a few. The models one and the last one - where everything comes together - will probably still need some work though, but will become much easier to manage when there are only a few files affected at any given time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but with the exception of the last couple of lines, the changes here are not related to this PR. And given that the PR is already huge and difficult to review, I would strongly suggest not to blow it up with anything that isn't directly related to its core focus.

My suggestion:

  • Open a new PR ci: reformat, reset branches & upgrade actions (or similar) from a branch created from the default branch that introduces all of the changes introduced here, except for adding the job for running the unit tests (however, do not include the commented stuff, just remove it for now). We can then merge this PR immediately and you can merge the changes into this branch.
  • In this branch, you then just add the job for running the unit tests.

In this way, we already merge some unrelated changes from this PR and make it a bit smaller.

I will probably suggest some more such points, because this PR is dragging along waaay too long, which often happens when there are too many different things being cobbled together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll move the changes related to .github/workflows/ci.yml into a new PR, as suggested. I’ll only include the job for running the unit tests in this current PR. This should make the current PR smaller and easier to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, all the changes introduced here are linting/style-related changes. They are all fine, and I do think they should be made, but please not in this PR.

Please create a separate PR for linting/style changes, where you can do reformatting changes across the entire codebase. I suggest you create that one ASAP, we quickly review and merge it into the default branch, and then you can merge the default branch into this branch to get rid of many of the suggested changes in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll create a separate PR for the linting and style-related changes in crategen/cli.py and other files, as advised. This will help keep the PR focused on core functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/format change unrelated to this PR: See above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll move the style and formatting changes from crategen/converter_manager.py into the linting/style PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just style/format changes: See above

Exception: The wrroc_data to data renaming should stay in this PR (or again a separate PR, I will tell you later when I've looked through everything).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll retain the wrroc_data to data renaming in this PR, as you mentioned. I’ll move the other style/format changes into a separate PR for linting and formatting.

TESInput(url=AnyUrl(url=obj.id), path=obj.name) for obj in data_wrroc.object
]
tes_outputs = [
TESOutput(url=AnyUrl(url=data_wrroc.result.id), path=data_wrroc.result.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that we can only have one .result in data_wrroc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I’ll verify if data_wrroc.result can have multiple items. If needed, I’ll adjust the logic to handle multiple results or return an error if more than one result is provided.

Comment on lines +86 to +88
tes_executors = [
TESExecutor(image=data_wrroc.instrument or "", command=[])
] # Provide default empty list for command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two points:

  • Where would we source the command from? Currently it seems that the command would always be an empty list, meaning that no reasonable TES task could ever be created here.
  • Similarly, I think if there is no command to run or no environment/image to run it in, I don't think it makes sense to continue. Rather, we should exit with a helpful error message that no TES task can be created from the specified WRROC entity, because no image or command are specified.

Copy link
Member Author

@Karanjot786 Karanjot786 Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your points. I’ll update the logic

  1. Source the command from the input data (e.g., data_wrroc).
  2. Add an error message if no command or environment is provided, rather than proceeding with an invalid TES task.

Raises:
ValidationError: If TES data is invalid.
"""
# Validate TES data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all comments from the code in this entire module, I think the code is trivial enough to understand without comments (and the comments are distracting and sometimes misleading).

Please also remove all trivial comments from any other modules. I would say that this should be all or almost all comments. There are no complex algorithms in the project and no workarounds for bugs in other libraries etc so far (as far as I remember). If you feel that your code is too complex to understand just by reading it and the associated docstring, consider refactoring it to make it simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll remove the comments from crategen/converters/tes_converter.py and other modules as requested. I’ll focus on refactoring the code where necessary to ensure clarity without comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR - looks like this is related to linting, so please put in a dedicated PR (see comments above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll move the lefthook.yml changes into the dedicated linting/style PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look unrelated to this PR for the most part. Please put in the appropriate PR (or multiple PRs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll split the changes in pyproject.toml into relevant PRs based on the focus of each change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prepare a separate PR for the models. See general comment on PR for details.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll prepare a separate PR for the models as per your suggestion. This should help keep the PR focused and make it easier to manage.

@Karanjot786
Copy link
Member Author

Hi @SalihuDickson,
Thank you for your feedback! I'd be happy to clarify the points you've mentioned.

  • WES Converter Output Object:

    • You pointed out that in the WES schema, the result should not be a list, and you're correct. The original implementation does handle result as a list, which isn't fully aligned with the Workflow run RO-Crate schema. I'll look into adjusting this so that the result conforms to the correct object structure.
    • As for the location and name keys, those were likely added based on earlier assumptions about the output structure. You're right that the WES specification doesn't explicitly define these keys. I will cross-check the schema more thoroughly and update accordingly once we finalize this.
  • Validators in crategen/validators.py:

    • The crategen/validators.py module is essentially providing a layer of custom validation logic on top of the Pydantic models. While the Pydantic models do validate the shape and types of the data, the validators module performs additional checks that are specific to the conversion logic and schema compatibility within our application.
    • For instance, the validators may ensure that certain fields conform to specific RO-Crate or TES/WES conversion requirements, or they may ensure compatibility between multiple models. These validations are often context-specific and may not fit neatly into the standard Pydantic model validation pipeline.
    • The key difference is that Pydantic's validation is generally focused on ensuring data integrity at the model level, whereas the custom validators help enforce business logic and additional rules necessary for proper conversion across schemas (TES, WES, RO-Crate). In some cases, these validators also provide error messages tailored to the conversion process, which might not be as easily customizable within Pydantic's validation system.

I hope this helps clear up the reason for the current setup. Please let me know if you'd like further details or if any adjustments should be made based on this explanation.

Thanks for the explanations, @Karanjot786!

Could you please give one (or better: two) concrete examples of things that we need to validate that we won't get for free from Pydantic's automatic validation? Also, for these examples (and more generally), is there any particular reason why we don't use Pydantic's support for writing custom validators?

@uniqueg, Thank you for your feedback!

Two examples of validations we don’t get from Pydantic’s automatic validation include:

1. Content and URL Mutual Exclusion: In TESInput, if the content field is set, the URL field should be ignored, and vice versa. This logic requires validation beyond simple type
2. Path Validation: The path field in TESInput and TESOutput must always contain an absolute path. We are enforcing this with custom logic that ensures the value is validated beyond type constraints.

Regarding Pydantic's custom validators, we could explore leveraging them for these validations. If you agree, I can refactor the code to use Pydantic's custom validation mechanism.

@Karanjot786
Copy link
Member Author

Hi @Karanjot786 - there is a lot of good stuff in this PR, and it's getting there. However, it's just too big a PR to get it over the line, and I feel I am getting caught up in details when reviewing.

I strongly suggest we take this PR apart into several more focused ones, probably to be merged in this order:

* `style: support linting` (make sure that `mypy` and `ruff check` pass)

* `ci: reformat, reset branches & upgrade actions` (put just the latest `.github/workflows/ci.yml` without including the job to run unit tests; consider including `lefthook.yml` as well)

* `feat: add WES, TES & WRROC models` (please include appropriate unit tests here, if applicable; once you do add your first tests, add a job to the CI to run them)

* `feat: integrate model validation in conversions` (you can use this PR here; basically, the models should be refactored to use the new models for validation; please leave out any custom validators - we will either add these in the next PR or not at all - see separate discussion thread started by @SalihuDickson)

For changes to pyproject.toml, please use your best judgement where to put them (possibly to multiple PRs).

Some of these may not require any changes from the current state - others just a few. The models one and the last one - where everything comes together - will probably still need some work though, but will become much easier to manage when there are only a few files affected at any given time.

@uniqueg, Thanks for the suggestion. I understand the challenge of reviewing a large PR, and I agree that splitting the changes will make the process smoother.

I will break the current PR into smaller, focused ones in the following order:

1. style: Support for linting and ensuring mypy and ruff pass.
2. ci: Reformat, reset branches, and upgrade actions.
3. feat: Add WES, TES, and WRROC models, including relevant unit tests.
4. feat: Integrate model validation into conversions.

For changes to pyproject.toml, I’ll distribute them across relevant PRs as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants