Pretty printing dataset #3987

ElenaKhaustova · 2024-07-03T18:05:03Z

Description

Implementation of one-line approach described here: #3980

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

deepyaman · 2024-07-04T13:32:06Z

kedro/io/core.py

-
-            # not a dictionary
-            return str(obj)
+        return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})"


I think the issue with this is that all of the pretty-printed subkeys will be on a single line, and no formatting/wrapping is applied there. By using PrettyPrinter directly, you can get something pretty similar, and also get the formatting.

Happy to share some more context/pointers on this tomorrow or next week, if you'd like.

I played a bit with PrettyPrinter and different indentations but in the end, I preferred one line to them. There shouldn't be too many keys as they are some subset of class constructor input arguments passed via _describe(). I was planning to have this as a bare minimum but let me add a second option with the indentation, so we can choose which one looks better.

Happy to discuss them tomorrow.

Two considered options: #3980

Signed-off-by: Elena Khaustova <[email protected]>

deepyaman

What does the pretty repr for a PartitionedDataset look like? E.g. from the docs:

# conf/base/catalog.yml

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: s3://my-bucket-name/path/to/folder
  dataset:  # full dataset config notation
    type: pandas.CSVDataset
    load_args:
      delimiter: ","
    save_args:
      index: false
  credentials: my_credentials
  load_args:
    load_arg1: value1
    load_arg2: value2
  filepath_arg: filepath  # the argument of the dataset to pass the filepath to
  filename_suffix: ".csv"

deepyaman · 2024-07-08T23:14:07Z

kedro/io/core.py

-
-            # not a dictionary
-            return str(obj)
+        return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})"


If it comes from kedro_datasets, I feel like it would be much cleaner to have the short name users can provide (i.e. pandas.CSVDataset instead of kedro_datasets.pandas.csv_dataset.CSVDataset).

Similarly, from Kedro core, MemoryDataset, etc. seems much easier to read than the full path.

I agree on your points but decided to keep the full path, so __repr__ is as ambiguous as possible and one could easily understand where to look the implementation of the printed dataset.

Ideally, it would be good to have some "short=True" flag for the representation that can be set by user. But by default, I prefer to keep a full module name.

Curious what other think.

Happy to hear what others think, but IMO we actually don't use the module name in the docs and examples. The standard way people write entries in catalog is the "short" version, and I think that should be reflected.

We can also consider adding __str__ with a short representation, so that __repr__ is actual representation but __str__ adapted for printing

deepyaman · 2024-07-08T23:16:13Z

kedro/io/core.py

-                    if value is not None  # 3
+    def _pretty_repr(self, object_description: dict[str, Any]) -> str:
+        str_keys = []
+        for arg_name in sorted(object_description, key=lambda key: str(key)):


They key is already a string, right?

Suggested change

for arg_name in sorted(object_description, key=lambda key: str(key)):

for key, value in sorted(object_description.items()):

feels a lot more Pythonic

Nit: obj is a perfectly Pythonic name for these cases, and easier than object_description IMO

kedro/io/core.py

deepyaman · 2024-07-08T23:31:33Z

kedro/io/core.py

+            if object_description[arg_name] is not None:
+                descr = pprint.pformat(
+                    object_description[arg_name],
+                    sort_dicts=False,


Suggested change

sort_dicts=False,

Although, why sort the top-level key, but not subkeys?

I wanted to keep them in the order provided in the config, so removing top-level key sorting for consistency.

datajoely · 2024-07-09T08:08:24Z

Can you include an example of what it looks like?

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova · 2024-07-09T11:26:46Z

my_partitioned_dataset:
  type: partitions.PartitionedDataset
  path: s3://my-bucket-name/path/to/folder
  dataset:  # full dataset config notation
    type: pandas.CSVDataset
    load_args:
      delimiter: ","
    save_args:
      index: false
  credentials: my_credentials
  load_args:
    load_arg1: value1
    load_arg2: value2
  filepath_arg: filepath  # the argument of the dataset to pass the filepath to
  filename_suffix: ".csv"

Here is how _describe() is implemented, so credentials are omitted: https://github.com/kedro-org/kedro-plugins/blob/be99fecf6cf5ac8f6a0a717c56b06dbc148b26eb/kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py#L318

ElenaKhaustova · 2024-07-09T11:33:37Z

Can you include an example of what it looks like?

We used one-line approach described here: #3980 (comment)

Here is the updated representation after removing sorting:
IPython

VS Code notebooks

Jupyter Notebook 7.0+

@astrojuanlu, @merelcht, @datajoely

Edit: updated representations

astrojuanlu · 2024-07-09T12:04:48Z

kedro/io/core.py

@@ -227,32 +229,23 @@ def save(self, data: _DI) -> None:
            message = f"Failed while saving data to data set {str(self)}.\n{str(exc)}"
            raise DatasetError(message) from exc

-    def __str__(self) -> str:


IIUC, we're removing the __str__ format right? Shouldn't we keep it around for backwards compatibility?

If __str__ is not defined the default is to use the __repr__, so it shouldn't be a problem.

Oh right. Still, we're changing str right? Not sure if someone depends on str(dataset). it looks like a small, preventable breaking change.

deepyaman · 2024-07-09T12:14:26Z

Can you include an example of what it looks like?

We used one-line approach described here: #3980 (comment)

Here is the updated representation after removing sorting:

Would make more sense for PartitionedDataset to have a dataset key that is formatted like a dataset repr, for consistency with CachedDataset.

For CachedDataset, the wrapped dataset represents should also not be in quotes I think.

ElenaKhaustova · 2024-07-09T13:41:01Z

Would make more sense for PartitionedDataset to have a dataset key that is formatted like a dataset repr, for consistency with CachedDataset.

Happy to do that later on.

ElenaKhaustova · 2024-07-09T13:50:44Z

For CachedDataset, the wrapped dataset represents should also not be in quotes I think.

    def _describe(self) -> dict[str, Any]:
        return {
            "dataset": self._dataset._pretty_repr(self._dataset._describe()),
            "cache": self._cache._pretty_repr(self._cache._describe()),
        }

The idea was to keep "dataset" and "cache" representations aligned with dataset representation - Class(arg_1=val_1, ...) , so the values for "dataset" and "cache" keys are converted to the corresponding strings. That's why I cannot easily get rid of quotes.

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova added 7 commits July 3, 2024 14:49

Implemented basic __repr__

502f982

Signed-off-by: Elena Khaustova <[email protected]>

Updated __repr__

9aa704f

Signed-off-by: Elena Khaustova <[email protected]>

Removed __str__

45b51e8

Signed-off-by: Elena Khaustova <[email protected]>

Updated _describe() for CachedDataset

94e6465

Signed-off-by: Elena Khaustova <[email protected]>

Made pretty_repr protected

e2884e7

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3980-pretty-printing-dataset

248a2c0

Reverted width parameter to default

aea8ec0

Signed-off-by: Elena Khaustova <[email protected]>

deepyaman reviewed Jul 4, 2024

View reviewed changes

ElenaKhaustova added 12 commits July 4, 2024 20:30

Updated printing params

659de36

Signed-off-by: Elena Khaustova <[email protected]>

Updated printing width

d1cb03f

Signed-off-by: Elena Khaustova <[email protected]>

Disable sorting

4de32c4

Signed-off-by: Elena Khaustova <[email protected]>

Disable sorting

9e6bae9

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3980-pretty-printing-dataset

57f9a3f

Updated test_str_representation

e61a775

Signed-off-by: Elena Khaustova <[email protected]>

Updated cached dataset tests

0f84735

Signed-off-by: Elena Khaustova <[email protected]>

Updated data catalog tests

3a7e748

Signed-off-by: Elena Khaustova <[email protected]>

Updated core tests

216cb42

Signed-off-by: Elena Khaustova <[email protected]>

Updated versioned dataset tests

ca42da1

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests for lambda dataset

924b53e

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests for memory dataset

f529c4b

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova marked this pull request as ready for review July 8, 2024 19:31

ElenaKhaustova requested a review from merelcht as a code owner July 8, 2024 19:31

ElenaKhaustova mentioned this pull request Jul 8, 2024

Pretty printing catalog #3990

Open

7 tasks

ElenaKhaustova requested review from noklam, DimedS and astrojuanlu July 8, 2024 19:40

deepyaman requested changes Jul 8, 2024

View reviewed changes

ElenaKhaustova added 2 commits July 9, 2024 10:20

Set width to maxsize

7754078

Signed-off-by: Elena Khaustova <[email protected]>

Removed top-level keys sorting

14f237f

Signed-off-by: Elena Khaustova <[email protected]>

Updated tests

cfac9d6

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3980-pretty-printing-dataset

bfc0841

ElenaKhaustova requested a review from deepyaman July 9, 2024 11:41

astrojuanlu reviewed Jul 9, 2024

View reviewed changes

ElenaKhaustova added 2 commits July 9, 2024 19:44

Merge branch 'main' into feature/3980-pretty-printing-dataset

7190b3c

Updated release notes

54964f8

Signed-off-by: Elena Khaustova <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretty printing dataset #3987

Pretty printing dataset #3987

ElenaKhaustova commented Jul 3, 2024 •

edited

Loading

deepyaman Jul 4, 2024

ElenaKhaustova Jul 4, 2024 •

edited

Loading

ElenaKhaustova Jul 9, 2024

deepyaman left a comment

deepyaman Jul 8, 2024

ElenaKhaustova Jul 9, 2024

deepyaman Jul 9, 2024

ElenaKhaustova Jul 9, 2024

deepyaman Jul 8, 2024

deepyaman Jul 8, 2024

deepyaman Jul 8, 2024

ElenaKhaustova Jul 9, 2024

datajoely commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024 •

edited

Loading

astrojuanlu Jul 9, 2024

ElenaKhaustova Jul 9, 2024 •

edited

Loading

astrojuanlu Jul 9, 2024

deepyaman commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

	for arg_name in sorted(object_description, key=lambda key: str(key)):
	for key, value in sorted(object_description.items()):

Pretty printing dataset #3987

Are you sure you want to change the base?

Pretty printing dataset #3987

Conversation

ElenaKhaustova commented Jul 3, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

Choose a reason for hiding this comment

ElenaKhaustova Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datajoely commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

ElenaKhaustova Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024

ElenaKhaustova commented Jul 3, 2024 •

edited

Loading

ElenaKhaustova Jul 4, 2024 •

edited

Loading

ElenaKhaustova commented Jul 9, 2024 •

edited

Loading

ElenaKhaustova Jul 9, 2024 •

edited

Loading