-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretty printing dataset #3987
base: main
Are you sure you want to change the base?
Pretty printing dataset #3987
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
|
||
# not a dictionary | ||
return str(obj) | ||
return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue with this is that all of the pretty-printed subkeys will be on a single line, and no formatting/wrapping is applied there. By using PrettyPrinter
directly, you can get something pretty similar, and also get the formatting.
Happy to share some more context/pointers on this tomorrow or next week, if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played a bit with PrettyPrinter
and different indentations but in the end, I preferred one line to them. There shouldn't be too many keys as they are some subset of class constructor input arguments passed via _describe()
. I was planning to have this as a bare minimum but let me add a second option with the indentation, so we can choose which one looks better.
Happy to discuss them tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two considered options: #3980
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the pretty repr for a PartitionedDataset
look like? E.g. from the docs:
# conf/base/catalog.yml
my_partitioned_dataset:
type: partitions.PartitionedDataset
path: s3://my-bucket-name/path/to/folder
dataset: # full dataset config notation
type: pandas.CSVDataset
load_args:
delimiter: ","
save_args:
index: false
credentials: my_credentials
load_args:
load_arg1: value1
load_arg2: value2
filepath_arg: filepath # the argument of the dataset to pass the filepath to
filename_suffix: ".csv"
|
||
# not a dictionary | ||
return str(obj) | ||
return f"{type(self).__module__}.{type(self).__name__}({', '.join(str_keys)})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it comes from kedro_datasets
, I feel like it would be much cleaner to have the short name users can provide (i.e. pandas.CSVDataset
instead of kedro_datasets.pandas.csv_dataset.CSVDataset
).
Similarly, from Kedro core, MemoryDataset
, etc. seems much easier to read than the full path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on your points but decided to keep the full path, so __repr__
is as ambiguous as possible and one could easily understand where to look the implementation of the printed dataset.
Ideally, it would be good to have some "short=True"
flag for the representation that can be set by user. But by default, I prefer to keep a full module name.
Curious what other think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to hear what others think, but IMO we actually don't use the module name in the docs and examples. The standard way people write entries in catalog is the "short" version, and I think that should be reflected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also consider adding __str__
with a short representation, so that __repr__
is actual representation but __str__
adapted for printing
kedro/io/core.py
Outdated
if value is not None # 3 | ||
def _pretty_repr(self, object_description: dict[str, Any]) -> str: | ||
str_keys = [] | ||
for arg_name in sorted(object_description, key=lambda key: str(key)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They key is already a string, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for arg_name in sorted(object_description, key=lambda key: str(key)): | |
for key, value in sorted(object_description.items()): |
feels a lot more Pythonic
Nit: obj
is a perfectly Pythonic name for these cases, and easier than object_description
IMO
if object_description[arg_name] is not None: | ||
descr = pprint.pformat( | ||
object_description[arg_name], | ||
sort_dicts=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_dicts=False, |
Although, why sort the top-level key, but not subkeys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep them in the order provided in the config, so removing top-level key sorting for consistency.
Can you include an example of what it looks like? |
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Here is how |
We used one-line approach described here: #3980 (comment) Here is the updated representation after removing sorting: @astrojuanlu, @merelcht, @datajoely Edit: updated representations |
@@ -227,32 +229,23 @@ def save(self, data: _DI) -> None: | |||
message = f"Failed while saving data to data set {str(self)}.\n{str(exc)}" | |||
raise DatasetError(message) from exc | |||
|
|||
def __str__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we're removing the __str__
format right? Shouldn't we keep it around for backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. Still, we're changing str right? Not sure if someone depends on str(dataset). it looks like a small, preventable breaking change.
Would make more sense for For |
Happy to do that later on. |
The idea was to keep |
Signed-off-by: Elena Khaustova <[email protected]>
Description
Implementation of one-line approach described here: #3980
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file