Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise the inner exception when parsing nested dataclasses #136

Open
mishamsk opened this issue Aug 3, 2023 · 3 comments
Open

Raise the inner exception when parsing nested dataclasses #136

mishamsk opened this issue Aug 3, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@mishamsk
Copy link
Contributor

mishamsk commented Aug 3, 2023

Is your feature request related to a problem? Please describe.
First, this has been discussed in #83 and dismissed. But I'd like to raise the question and propose a solution yet again.

For the context, which is the same as in #135 - I am working with deeply nested structures, millions of them. For some context - some of the serialized trees that I have are 300MB in msgpack! (that's >1GB in memory after deserialization). And they have ONE root.

So when something breaks during deserialization, the current basic functionality is unusable. Even if I could have inspected the 300MB worth of text, finding what caused an error is impossible. The outer context is not helpful, on the contrary it is detrimental. A typical example - I've changed some model and trying to code a migration from the old one.

They way I've fixed it is by using an unwrapped helper: code and doc

It is almost ideal, but there is one thing that can't be done outside of mashumaro - namely, I can't get the index of child if it is part of a sequence. Only the field name. So for model like this (omitting data class and base class for brevity):

class Child;
    attr: int

class Parent;
    children: tuple[Parent | Child, ...]

if I say pass a string that is not an integer in a leaf Child node 10 levels deep, my helper will tell me the path children.children.[8 times more].attr with mashumaro's inner exception. But I won't know the indexes along the way.

Describe the solution you'd like
At least for the InvalidFieldValue exception add path capturing, including the index of item in a sequence.

And then either:

  • make what my helper does the new behavior
  • make it a config option to let InvalidFieldValue behave in such a way for backward compatibility

Describe alternatives you've considered
See above

Additional context
I can try my hand at this if the concept seems right. Didn't want to do a PR before we discuss this

@Fatal1ty
Copy link
Owner

Fatal1ty commented Aug 5, 2023

So when something breaks during deserialization, the current basic functionality is unusable. Even if I could have inspected the 300MB worth of text, finding what caused an error is impossible. The outer context is not helpful, on the contrary it is detrimental. A typical example - I've changed some model and trying to code a migration from the old one.

This is a very good example of why this issue should be addressed. To be honest, I've come across this problem myself.

They way I've fixed it is by using an unwrapped helper: code and doc

I really like this helper. Good job!

It is almost ideal, but there is one thing that can't be done outside of mashumaro - namely, I can't get the index of child if it is part of a sequence.

I don't see any another solution other that changing the generated sequence unpacker code to something like this:

def sequence_unpacker(seq):
    result = []
    for idx, el in enumerate(seq):
        try:
            result.append(int(el))
        except Exception:
            raise SomeException(seq, pos=idx)

But it will cost us a speed reduction because in the current implementation we are getting [int(el) for el in seq]. We might introduce a new "debugability" config option for those who want sacrifice performance in favour of improved error handling.

And then either:

make what my helper does the new behavior
make it a config option to let InvalidFieldValue behave in such a way for backward compatibility

I think we should start from a config option and then see how it goes. If we manage to keep sequence debugability without relying on stack trace and avoid notable impact on performance, we can make this behavior default.

I can try my hand at this if the concept seems right. Didn't want to do a PR before we discuss this

The concept is definitely right and this is a very important issue. If you need any help, let me know.

@Fatal1ty Fatal1ty added the enhancement New feature or request label Aug 5, 2023
@mishamsk
Copy link
Contributor Author

mishamsk commented Aug 7, 2023

@Fatal1ty read the code, indeed adding an index seems like:

  • (a) a lot of refactoring across a number of types
  • (b) should be slower indeed

for (b), maybe this pattern would be less degrading:

def sequence_unpacker(seq):
    result = [] 
    try:
        for idx, el in enumerate(seq):
            result.append(int(el))
    except Exception:
        raise SomeException(seq, pos=idx)

but still a real impact.

I'll draft something on one of the upcoming weekends

@Fatal1ty
Copy link
Owner

Fatal1ty commented Nov 27, 2024

I've been enjoying using plugins for JSON Schema lately and I just thought why wouldn't I add serialization / deserialization plugins. On the rights of nonsense I sketched a POC:

class SequencePlugin(BasePlugin):
    def sequence_unpacker(self, spec: ValueSpec, item_unpacker: Expression):
        lines = CodeLines()
        self.meta.ensure_object_imported(SequenceItemInvalidError)
        name = f"_unpack_sequence_{random_hex()}"
        with lines.indent(f"def {name}(seq):"):
            lines.append("result = []")
            with lines.indent("try:"):
                with lines.indent("for idx, item in enumerate(seq):"):
                    lines.append(f"result.append({item_unpacker})")
                lines.append("return result")
            with lines.indent("except Exception:"):
                lines.append("raise SequenceItemInvalidError(seq, idx, item)")
        lines.append(f"setattr({spec.cls_attrs_name}, '{name}', {name})")
        exec(lines.as_text(), spec.builder.globals, spec.builder.__dict__)
        return f"{spec.cls_attrs_name}.{name}"

    def get_unpacker(self, spec: ValueSpec) -> Optional[Expression]:
        if ensure_generic_collection_subclass(spec, Sequence):
            args = get_args(spec.type)
            item_unpacker = spec.unpacker_registry.get(
                spec.copy(
                    type=args[0],
                    expression="item",
                    could_be_none=True,
                    field_ctx=spec.field_ctx.copy(metadata={}),
                )
            )
            name = self.sequence_unpacker(spec, item_unpacker)
            return f"{name}({spec.expression})"


decoder = BasicDecoder(Sequence[int], plugins=[SequencePlugin()])
x = decoder.decode([1, 'a', 3])
Traceback (most recent call last):
  File "<string>", line 5, in _unpack_sequence_f442d14b93e64c80ba9e31e83eea6a82
ValueError: invalid literal for int() with base 10: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/xxx/projects/mashumaro/test.py", line 125, in <module>
    x = decoder.decode([1, "a", 3])
  File "<string>", line 8, in _unpack_sequence_f442d14b93e64c80ba9e31e83eea6a82
SequenceItemInvalidError: Item as pos 1 in sequence [1, 'a', 3] has invalid value a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants