Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing a List of dataclasses #271

Open
JonasFrey96 opened this issue Jul 5, 2023 · 8 comments
Open

Parsing a List of dataclasses #271

JonasFrey96 opened this issue Jul 5, 2023 · 8 comments

Comments

@JonasFrey96
Copy link
Contributor

JonasFrey96 commented Jul 5, 2023

Hello,

it would be great if we could add support to pass a List of dataclasses:

@dataclass
class ExperimentParams(Serializable):
        @dataclass
        class LayerParams:
            name: str
            loss_scale: float = 1.0

        target_layers: List[any] = field(
            default_factory=lambda LayerParams=LayerParams: [
                LayerParams( name="loss_a", loss_scale=1.0),
                LayerParams(name="loss_b", loss_scale=2.0),
            ]
        )

Is there any way to elegantly parse such a configuration file:

python train.py --target_layers_0.loss_scale 10

This relates to:

                raise NotImplementedError(
                    f"Field {field.name} is of type {field_type}, which isn't "
                    f"supported yet. (container of a dataclass type)"
                )

If someone can point me toward what to change, I am happy to contribute.

@lebrice
Copy link
Owner

lebrice commented Jul 5, 2023

Hello again @JonasFrey96 !

There is limited support for something similar, but not exactly the same:

https://github.com/lebrice/SimpleParsing/blob/master/examples/merging/multiple_example.py

In particular, this example here might be of interest to you: https://github.com/lebrice/SimpleParsing/blob/master/examples/merging/multiple_lists_example.py

Let me know what you think!

@JonasFrey96 JonasFrey96 changed the title Parsing a List od dataclasses Parsing a List of dataclasses Jul 5, 2023
@JonasFrey96
Copy link
Contributor Author

In my specific use case, it would be great to allow for parsing a list of data classes.
This would help a lot. I tried to look into the code, but it is quite overwhelming and seems to be none trivial.

@lebrice
Copy link
Owner

lebrice commented Jul 9, 2023

@MiguelMonteiro
Copy link

Hi, I think this feature would be very useful for things like lists of data augmentations in deep learning code. Is it planned to be implemented?

@hugobb
Copy link

hugobb commented Jul 31, 2023

Hi, I'm also interested in this feature. From my understanding the idea of the feature, would be to enable this type of hierarchical dataclass definition with List.

@dataclass
class CNNStack:
    name: str = "stack"
    num_layers: int = 3
    kernel_sizes: Tuple[int, int, int] = (7, 5, 5)
    num_filters: List[int] = field(default_factory=[32, 64, 64].copy)


@dataclass
class Config:
    stack_list: List[CNNStack] = field(default_factory=[CNNStack(), CNNStack(), CNNStack()].copy)

parser = ArgumentParser(conflict_resolution=ConflictResolution.ALWAYS_MERGE)
parser.add_arguments(Config, dest="config", default=Config())
args = parser.parse_args()
print(args)

However this is not supported yet, this only works if we manually construct the parser which in my use cases is not very convenient.
I would also be interested in helping with this, but also not sure where to start.

@hugobb
Copy link

hugobb commented Jul 31, 2023

I think the challenging part and the main difference with the multiple_lists_example.py is that in the proposed feature the number of CNNStack would not be fixed but could vary depending on the number of arguments provided by user.

@lebrice
Copy link
Owner

lebrice commented Aug 1, 2023

Hey there @hugobb , @MiguelMonteiro , thanks for commenting.

I agree with you, this feature would be great to have.

@hugobb you're right: this would be challenging to implement, especially if we don't know how many dataclasses to parse in advance!

I guess a first step could be to make this work when we know the number of classes in advance, for instance something like cnn_blocks: tuple[CnnBlock, CnnBlock]
. (As far as I recall, this isnt already supported, but I might be wrong, it's been a while since I've worked on this)

Then, as a next step, we could probably figure out the number of values by inspecting the parsed values during postprocessing, or by adding an argument that can be used explicitly, like --cnn_blocks=2 or similar.

This feature will probably require a substantial refactoring of the current "reuse" feature, which is quite messy:

  • the conflict resolution mode might need to become local on a per-dataclass basis instead of global
  • the parsing functions would need to share the information of how many values are expected (IIRC this is already partly the case, but I'd like to make this more explicit)

I could potentially invest some time in order to try to get this to work, but I'd like to get a bit more info if possible before that.

Here are some design-y questions to start us off:

  • what differences would there be (if any) with the current reuse feature with respect to how the values for each dataclass are passed from the command-line?
    • How would you expect to specify a different value for one of the fields of, say, the last dataclass in the list?
  • how should the number of dataclasses to parse be determined? Should it be explicitly passed via an argument, or should it be inferred from the passed values? Will this always be possible? What should we do if/when it isn't?

Thanks again @hugobb @MiguelMonteiro @JonasFrey96 for posting. Lets keep this ball rolling.

@MiguelMonteiro
Copy link

Hi @lebrice, thanks for your answer.

what differences would there be (if any) with the current reuse feature with respect to how the values for each dataclass are passed from the command-line?

I am not sure how this feature works, sorry

How would you expect to specify a different value for one of the fields of, say, the last dataclass in the list?

Perhaps the most intuitive way is for the parser to automatically append ".1", ".2", etc... to the end of the name.
For example:
--cnn_stack.1.num_layers=2 --cnn_stack.2.num_layers=5

how should the number of dataclasses to parse be determined? Should it be explicitly passed via an argument, or should it be inferred from the passed values? Will this always be possible? What should we do if/when it isn't?

Ideally, it would be infered from the arguemnts passed from the command line.

What I am still unsure is if it would ever be possible to handle lists with different dataclasses in them. This would be very useful to configure data-augmentations for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants