-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(schema): recursive validation of arbitrarily deep nested structure #790
fix(schema): recursive validation of arbitrarily deep nested structure #790
Conversation
…s://github.com/opentargets/gentropy into ds_3545-schema-validation-misses-nested-arrays
🤯 WOW |
Currently there are two tests failing for SchemaValidationError:
Details on the error:
These are issues due to a bug in the VEP parser. For now, I'm skipping these tests. |
@DSuveges If we transition to pyspark 3.5 this feature could be easly ported from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertSchemaEqual.html |
@DSuveges unfortunately there is another issue with the schema validation def test_schema_is_correct(spark: SparkSession) -> None:
"""Test schema is correct."""
from gentropy.common.schemas import compare_struct_schemas
# fmt: off
schema = t.StructType(
[
t.StructField(
"arr",
t.ArrayType(
t.StructType(
[
t.StructField("a", t.StringType()),
t.StructField("b", t.IntegerType())
]
)
)
),
t.StructField("id", t.IntegerType())
]
)
schema2 = t.StructType(
[
t.StructField(
"arr",
t.ArrayType(
t.StructType(
[
t.StructField("b", t.IntegerType()),
t.StructField("a", t.StringType())
]
)
)
),
t.StructField("id", t.IntegerType())
]
)
df1 = spark.createDataFrame([([("a", 1,)], 1),], schema=schema)
df2 = spark.createDataFrame([([(1,"a",)], 1),], schema=schema2)
diff = compare_struct_schemas(
observed_schema=df1.schema,
expected_schema=df2.schema,
)
assert len(diff) != 0 The above test does not fail, although the schema is almost the same, but the nested struct fields are not ordered correctly. This is the issue the variant index schema is facing now as well. vep_output_json_path= "gs://ot_orchestration/releases/26.09/variants/annotated_variants"
variant_index_path = "gs://ot_orchestration/releases/26.09/variant_index"
gnomad_variant_annotations_path = "gs://genetics_etl_python_playground/static_assets/gnomad_variants"
hash_threshold = 300
from gentropy.dataset.variant_index import VariantIndex
from gentropy.datasource.ensembl.vep_parser import VariantEffectPredictorParser
variant_index = VariantEffectPredictorParser.extract_variant_index_from_vep(
session.spark, vep_output_json_path, hash_threshold
)
annotations = VariantIndex.from_parquet(
session=session,
path=gnomad_variant_annotations_path,
recursiveFileLookup=True,
)
variant_index.df.printSchema()
annotations.df.printSchema() results in
The Edit, by looking at the code and back at the schemas, there are even more mismatches - |
I most likely misunderstand something, but when you say
Of course it doesn't fail, because you are testing for finding difference and the test indeed finds the difference (because the schemas are different), so it passes. # What does the comparison returns:
diff contains:
Which indicates there's a difference in the schema. So the assertion |
You are right. there should be |
As discussed offline:
|
Try to get out of this with the minimum number of scars 😅 |
@d0choa we can just drop the entire branch bump to pyspark 3.5 and hope for the best. |
I am almost done with the fix in safe array union. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! The schema validation has become much more accurate to the actual structure (no flattening). And at the same time the implementation is easier to understand. Great testing suite too!
"""This exception is raised when a schema validation fails.""" | ||
|
||
def __init__( | ||
self: SchemaValidationError, message: str, errors: defaultdict[str, list[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. So the default dict acts as a dictionary that stores errors found in the schemas but whose keys are not predetermined.
self.message = message # Explicitly set the message attribute | ||
self.errors = errors | ||
|
||
def __str__(self: SchemaValidationError) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the method printed when you raise the exception, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
src/gentropy/common/schemas.py
Outdated
) | ||
|
||
# If element type is a struct, resolve nesting: | ||
elif observed_type == "struct": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enforce both types are the same
elif observed_type == "struct": | |
elif observed_type == "struct" and expected_type == "struct": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, explicit is better than implicit. (the equality of the two schemas were tested already)
src/gentropy/common/schemas.py
Outdated
) | ||
|
||
# If element type is an array, resolve nesting: | ||
elif observed_type == "array": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enforce both types are the same
elif observed_type == "array": | |
elif observed_type == "array" and expected_type == "array": |
# If element type is a struct, resolve nesting: | ||
elif observed_type == "struct": | ||
schema_issues = compare_struct_schemas( | ||
observed_schema.elementType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mypy is raising an issue here because both the observed_schema
and expected_schema
are technically of type DataType
, whereas the parameters have to be StructType
. Are you having issues as well? I think mypy would be able to resolve it if the conditional above was made using the elementType, not the name of the type, i.e. elif observed_schema.elementType == StructType()
. Same applies below when you call compare_array_schemas
{ | ||
f"{parent_field_name}{field.name}" | ||
for field in observed_schema | ||
if list(observed_schema).count(field) > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice
src/gentropy/common/schemas.py
Outdated
|
||
|
||
def flatten_schema(schema: t.StructType, prefix: str = "") -> list[Any]: | ||
def flatten_schema(schema: StructType, prefix: str = "") -> list[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are now parsing each schema without flattening, I'd suggest removing this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can drop it if you think this function would not be useful in other context.
✨ Context
So far the validation of datasets were not fully bullet proof: deeply nested (array of array) fields were not validated and automatically passed the validation. This was an issue, because the VEP parser had a bug yielding inSilicoPredictor column with a schema array of array of struct instead of array of struct. The validation was passing, tests were passing, and the bug was discovered too late.
🛠 What does this PR implement