fix(schema): recursive validation of arbitrarily deep nested structure #790

DSuveges · 2024-09-25T14:42:43Z

✨ Context

So far the validation of datasets were not fully bullet proof: deeply nested (array of array) fields were not validated and automatically passed the validation. This was an issue, because the VEP parser had a bug yielding inSilicoPredictor column with a schema array of array of struct instead of array of struct. The validation was passing, tests were passing, and the bug was discovered too late.

🛠 What does this PR implement

New, recursive schema comparison methods for arrays and structs.
Updated schema validation method in the Dataset class relying on the new comparing methods.
New SchemaValidationError class that is called upon failing validation.
The old schema flattening methods are kept because it might have other usages besides the schema validation.
Tests for the new schema comparison functions.
Temporarily skipping VEP parser tests.

…s://github.com/opentargets/gentropy into ds_3545-schema-validation-misses-nested-arrays

d0choa · 2024-09-25T15:16:17Z

🤯 WOW

DSuveges · 2024-09-25T15:36:11Z

Currently there are two tests failing for SchemaValidationError:

=========================== short test summary info ============================
FAILED tests/gentropy/datasource/ensembl/test_vep_variants.py::TestVEPParser::test_extract_variant_index_from_vep - gentropy.common.schemas.SchemaValidationError: Schema validation failed for VariantIndex
Errors:
  columns_with_non_matching_type: For column "inSilicoPredictors[][]" found array instead of struct
FAILED tests/gentropy/datasource/ensembl/test_vep_variants.py::TestVEPParser::test_conversion - gentropy.common.schemas.SchemaValidationError: Schema validation failed for VariantIndex
Errors:
  columns_with_non_matching_type: For column "inSilicoPredictors[][]" found array instead of struct
===== 2 failed, 414 passed, 1 skipped, 1603 warnings in 208.82s (0:03:28) ======

Details on the error:

E           gentropy.common.schemas.SchemaValidationError: Schema validation failed for VariantIndex
E           Errors:
E             columns_with_non_matching_type: For column "inSilicoPredictors[][]" found array instead of struct
src/gentropy/dataset/dataset.py:152: SchemaValidationError

These are issues due to a bug in the VEP parser. For now, I'm skipping these tests.

project-defiant · 2024-09-26T11:25:05Z

@DSuveges If we transition to pyspark 3.5 this feature could be easly ported from https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

project-defiant · 2024-09-27T08:53:04Z

@DSuveges unfortunately there is another issue with the schema validation

def test_schema_is_correct(spark: SparkSession) -> None:
    """Test schema is correct."""
    from gentropy.common.schemas import compare_struct_schemas

    # fmt: off
    schema = t.StructType(
        [
            t.StructField(
                "arr",
                t.ArrayType(
                    t.StructType(
                        [
                            t.StructField("a", t.StringType()),
                            t.StructField("b", t.IntegerType())
                        ]
                    )
                )
            ),
            t.StructField("id", t.IntegerType())
        ]
    )
    schema2 = t.StructType(
        [
            t.StructField(
                "arr",
                t.ArrayType(
                    t.StructType(
                        [
                            t.StructField("b", t.IntegerType()),
                            t.StructField("a", t.StringType())
                        ]
                    )
                )
            ),
            t.StructField("id", t.IntegerType())
        ]
    )
    df1 = spark.createDataFrame([([("a", 1,)], 1),], schema=schema)
    df2 = spark.createDataFrame([([(1,"a",)], 1),], schema=schema2)
    diff = compare_struct_schemas(
        observed_schema=df1.schema,
        expected_schema=df2.schema,
    )
    assert len(diff) != 0

The above test does not fail, although the schema is almost the same, but the nested struct fields are not ordered correctly. This is the issue the variant index schema is facing now as well.

vep_output_json_path=  "gs://ot_orchestration/releases/26.09/variants/annotated_variants"
variant_index_path = "gs://ot_orchestration/releases/26.09/variant_index"
gnomad_variant_annotations_path = "gs://genetics_etl_python_playground/static_assets/gnomad_variants"
hash_threshold = 300
from gentropy.dataset.variant_index import VariantIndex
from gentropy.datasource.ensembl.vep_parser import VariantEffectPredictorParser
variant_index = VariantEffectPredictorParser.extract_variant_index_from_vep(
        session.spark, vep_output_json_path, hash_threshold
    )


annotations = VariantIndex.from_parquet(
    session=session,
    path=gnomad_variant_annotations_path,
    recursiveFileLookup=True,
)
variant_index.df.printSchema()
annotations.df.printSchema()

results in

root
 |-- variantId: string (nullable = false)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- inSilicoPredictors: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)
 |-- hgvsId: string (nullable = true)
 |-- transcriptConsequences: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- variantFunctionalConsequenceIds: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- consequenceScore: float (nullable = true)
 |    |    |-- aminoAcidChange: string (nullable = true)
 |    |    |-- uniprotAccessions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- isEnsemblCanonical: boolean (nullable = false)
 |    |    |-- codons: string (nullable = true)
 |    |    |-- distanceFromFootprint: long (nullable = true)
 |    |    |-- distanceFromTss: long (nullable = true)
 |    |    |-- appris: string (nullable = true)
 |    |    |-- maneSelect: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- impact: string (nullable = true)
 |    |    |-- lofteePrediction: string (nullable = true)
 |    |    |-- siftPrediction: float (nullable = true)
 |    |    |-- polyphenPrediction: float (nullable = true)
 |    |    |-- transcriptId: string (nullable = true)
 |    |    |-- transcriptIndex: integer (nullable = false)
 |-- rsIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- alleleFrequencies: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- populationName: string (nullable = true)
 |    |    |-- alleleFrequency: double (nullable = true)
 |-- dbXrefs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)

root
 |-- variantId: string (nullable = true)
 |-- chromosome: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAllele: string (nullable = true)
 |-- inSilicoPredictors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |-- mostSevereConsequenceId: string (nullable = true)
 |-- transcriptConsequences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- variantFunctionalConsequenceIds: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- aminoAcidChange: string (nullable = true)
 |    |    |-- uniprotAccessions: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- isEnsemblCanonical: boolean (nullable = true)
 |    |    |-- codons: string (nullable = true)
 |    |    |-- distanceFromFootprint: long (nullable = true)
 |    |    |-- distanceFromTss: long (nullable = true)
 |    |    |-- appris: string (nullable = true)
 |    |    |-- maneSelect: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- impact: string (nullable = true)
 |    |    |-- lofteePrediction: string (nullable = true)
 |    |    |-- siftPrediction: float (nullable = true)
 |    |    |-- polyphenPrediction: float (nullable = true)
 |    |    |-- consequenceScore: float (nullable = true)
 |    |    |-- transcriptIndex: integer (nullable = true)
 |    |    |-- transcriptId: string (nullable = true)
 |-- rsIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hgvsId: string (nullable = true)
 |-- alleleFrequencies: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- populationName: string (nullable = true)
 |    |    |-- alleleFrequency: double (nullable = true)
 |-- dbXrefs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)

The transcriptConsequences.consequenceScore and hgvsid fields are misaligned in both schemas. The VEP variant index was generated with the code that patched the insilicoPredictors.

Edit, by looking at the code and back at the schemas, there are even more mismatches - transcriptConsequences.transcriptIndex

DSuveges · 2024-09-27T09:55:54Z

I most likely misunderstand something, but when you say

The above test does not fail, although the schema is almost the same, but the nested struct fields are not ordered correctly. This is the issue the variant index schema is facing now as well.

Of course it doesn't fail, because you are testing for finding difference and the test indeed finds the difference (because the schemas are different), so it passes.

# What does the comparison returns:
diff

contains:

defaultdict(list, {'unexpected_columns': ['arr']})

Which indicates there's a difference in the schema. So the assertion assert len(diff) != 0 falls to True.

project-defiant · 2024-09-27T10:06:26Z

schema = t.StructType(
        [
            t.StructField(
                "arr",
                t.ArrayType(
                    t.StructType(
                        [
                            t.StructField("a", t.StringType()),
                            t.StructField("b", t.IntegerType())
                        ]
                    )
                )
            ),
            t.StructField("id", t.IntegerType())
        ]
    )
    schema2 = t.StructType(
        [
            t.StructField(
                "arr",
                t.ArrayType(
                    t.StructType(
                        [
                            t.StructField("b", t.IntegerType()),
                            t.StructField("a", t.StringType())
                        ]
                    )
                )
            ),
            t.StructField("id", t.IntegerType())
        ]
    )
    df1 = spark.createDataFrame([([("a", 1,)], 1),], schema=schema)
    df2 = spark.createDataFrame([([(1,"a",)], 1),], schema=schema2)
    diff = compare_struct_schemas(
        observed_schema=df1.schema,
        expected_schema=df2.schema,

You are right. there should be arr columns in both, I have fixed the comment

DSuveges · 2024-09-27T10:51:34Z

As discussed offline:

The schema validation doesn't take the order of fields of struct into account.
Most application this doesn't matter, however you can't merge two arrays of structs if the order of the fields are not matching.
Proposed solution is to extend the safe_array_sort function with a piece of extra logic that sorts fields alphabetically.

d0choa · 2024-09-27T12:10:15Z

Try to get out of this with the minimum number of scars 😅

DSuveges · 2024-09-27T12:17:20Z

@d0choa we can just drop the entire branch bump to pyspark 3.5 and hope for the best.

project-defiant · 2024-09-27T12:25:04Z

I am almost done with the fix in safe array union.

project-defiant · 2024-09-27T14:32:24Z

@d0choa @DSuveges fix attempt in #793 resulted in no failiure in the VariantIndexStep see :)

ireneisdoomed

This is great! The schema validation has become much more accurate to the actual structure (no flattening). And at the same time the implementation is easier to understand. Great testing suite too!

ireneisdoomed · 2024-09-30T09:50:42Z