Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretty printing dataset #3987

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Jul 3, 2024

Description

Implementation of one-line approach described here: #3980

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team


# not a dictionary
return str(obj)
return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with this is that all of the pretty-printed subkeys will be on a single line, and no formatting/wrapping is applied there. By using PrettyPrinter directly, you can get something pretty similar, and also get the formatting.

Happy to share some more context/pointers on this tomorrow or next week, if you'd like.

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played a bit with PrettyPrinter and different indentations but in the end, I preferred one line to them. There shouldn't be too many keys as they are some subset of class constructor input arguments passed via _describe(). I was planning to have this as a bare minimum but let me add a second option with the indentation, so we can choose which one looks better.

Happy to discuss them tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two considered options: #3980

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review July 8, 2024 19:31
@ElenaKhaustova ElenaKhaustova mentioned this pull request Jul 8, 2024
7 tasks
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the pretty repr for a PartitionedDataset look like? E.g. from the docs:

# conf/base/catalog.yml

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: s3://my-bucket-name/path/to/folder
  dataset:  # full dataset config notation
    type: pandas.CSVDataset
    load_args:
      delimiter: ","
    save_args:
      index: false
  credentials: my_credentials
  load_args:
    load_arg1: value1
    load_arg2: value2
  filepath_arg: filepath  # the argument of the dataset to pass the filepath to
  filename_suffix: ".csv"


# not a dictionary
return str(obj)
return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it comes from kedro_datasets, I feel like it would be much cleaner to have the short name users can provide (i.e. pandas.CSVDataset instead of kedro_datasets.pandas.csv_dataset.CSVDataset).

Similarly, from Kedro core, MemoryDataset, etc. seems much easier to read than the full path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on your points but decided to keep the full path, so __repr__ is as ambiguous as possible and one could easily understand where to look the implementation of the printed dataset.

Ideally, it would be good to have some "short=True" flag for the representation that can be set by user. But by default, I prefer to keep a full module name.

Curious what other think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to hear what others think, but IMO we actually don't use the module name in the docs and examples. The standard way people write entries in catalog is the "short" version, and I think that should be reflected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also consider adding __str__ with a short representation, so that __repr__ is actual representation but __str__ adapted for printing

kedro/io/core.py Outdated
if value is not None # 3
def _pretty_repr(self, object_description: dict[str, Any]) -> str:
str_keys = []
for arg_name in sorted(object_description, key=lambda key: str(key)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They key is already a string, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for arg_name in sorted(object_description, key=lambda key: str(key)):
for key, value in sorted(object_description.items()):

feels a lot more Pythonic

Nit: obj is a perfectly Pythonic name for these cases, and easier than object_description IMO

kedro/io/core.py Outdated Show resolved Hide resolved
if object_description[arg_name] is not None:
descr = pprint.pformat(
object_description[arg_name],
sort_dicts=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sort_dicts=False,

Although, why sort the top-level key, but not subkeys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep them in the order provided in the config, so removing top-level key sorting for consistency.

@datajoely
Copy link
Contributor

Can you include an example of what it looks like?

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova
Copy link
Contributor Author

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: s3://my-bucket-name/path/to/folder
  dataset:  # full dataset config notation
    type: pandas.CSVDataset
    load_args:
      delimiter: ","
    save_args:
      index: false
  credentials: my_credentials
  load_args:
    load_arg1: value1
    load_arg2: value2
  filepath_arg: filepath  # the argument of the dataset to pass the filepath to
  filename_suffix: ".csv"

Here is how _describe() is implemented, so credentials are omitted: https://github.com/kedro-org/kedro-plugins/blob/be99fecf6cf5ac8f6a0a717c56b06dbc148b26eb/kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py#L318

Screenshot 2024-07-09 at 12 23 18

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jul 9, 2024

Can you include an example of what it looks like?

We used one-line approach described here: #3980 (comment)

Here is the updated representation after removing sorting:
IPython
Screenshot 2024-07-09 at 12 30 27

VS Code notebooks
Screenshot 2024-07-09 at 15 15 12

Jupyter Notebook 7.0+
Screenshot 2024-07-09 at 15 18 21

@astrojuanlu, @merelcht, @datajoely

Edit: updated representations

@@ -227,32 +229,23 @@ def save(self, data: _DI) -> None:
message = f"Failed while saving data to data set {str(self)}.\n{str(exc)}"
raise DatasetError(message) from exc

def __str__(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we're removing the __str__ format right? Shouldn't we keep it around for backwards compatibility?

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If __str__ is not defined the default is to use the __repr__, so it shouldn't be a problem.

Screenshot 2024-07-09 at 14 20 15

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Still, we're changing str right? Not sure if someone depends on str(dataset). it looks like a small, preventable breaking change.

@deepyaman
Copy link
Member

Can you include an example of what it looks like?

We used one-line approach described here: #3980 (comment)

Here is the updated representation after removing sorting: Screenshot 2024-07-09 at 12 30 27

Would make more sense for PartitionedDataset to have a dataset key that is formatted like a dataset repr, for consistency with CachedDataset.

For CachedDataset, the wrapped dataset represents should also not be in quotes I think.

@ElenaKhaustova
Copy link
Contributor Author

Would make more sense for PartitionedDataset to have a dataset key that is formatted like a dataset repr, for consistency with CachedDataset.

Happy to do that later on.

@ElenaKhaustova
Copy link
Contributor Author

For CachedDataset, the wrapped dataset represents should also not be in quotes I think.

    def _describe(self) -> dict[str, Any]:
        return {
            "dataset": self._dataset._pretty_repr(self._dataset._describe()),
            "cache": self._cache._pretty_repr(self._cache._describe()),
        }

The idea was to keep "dataset" and "cache" representations aligned with dataset representation - Class(arg_1=val_1, ...) , so the values for "dataset" and "cache" keys are converted to the corresponding strings. That's why I cannot easily get rid of quotes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants