From 5f4330fca1b5ddc7064833d6fa21455d6a2c0c50 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Mon, 6 May 2024 07:50:47 +0200 Subject: [PATCH 01/21] Add a tutorial on code beyond starter files Signed-off-by: Yury Fedotov --- .../code_beyond_starter_files.md | 131 ++++++++++++++++++ docs/source/kedro_project_setup/index.md | 1 + 2 files changed, 132 insertions(+) create mode 100644 docs/source/kedro_project_setup/code_beyond_starter_files.md diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md new file mode 100644 index 0000000000..8ce30c57e8 --- /dev/null +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -0,0 +1,131 @@ +# Adding code beyond starter files + +After you [create a Kedro project](../get_started/new_project.md) and +[add a pipeline](../tutorial/create_a_pipeline.md), you notice that Kedro generates a +few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`... + +While those may be sufficient for a small project, they quickly become large, hard to +read and collaborate on as your codebase grows. +Those files also sometimes make new users think that Kedro requires code +to be located only in those starter files, which is not true. + +This section elaborates what are the Kedro requirements in terms of organizing code +in files and modules. +It also provides examples of common scenarios such as sharing utilities between +pipelines and using Kedro in a monorepo setup. + +## Where does Kedro look for code to be located + +The only technical constraint for arranging code in the project is that `pipeline_registry.py` +file must be located in `/src/` directory, which is where +it is created by default. + +This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` +mapping from pipeline name to corresponding `Pipeline` object. + +Other than that, **Kedro does not impose any constraints on where you should keep files with +`Pipeline`s, `Node`s, or functions wrapped by `node`**. + +```{note} +You actually can make Kedro look for pipeline registry in a different place by modifying the +`__main__.py` file of your project, but such advanced customizations are not in scope of this article. +``` + +This being the only constraint means that you can, for example: +* Add `utils.py` file to a pipeline folder and import utilities defined by multiple + functions in `nodes.py`. +* Delete or rename a default `nodes.py` file, split it into multiple files or modules. +* Instead of having a single `pipeline.py` in your pipeline folder, split it + into `historical_pipeline.py` and `inference_pipeline.py`. +* Instead of registering many pipelines in `register_pipelines()` function one by one, + create a few `tp.Dict[str, Pipeline]` objects in different places of the project + and then make `register_pipelines()` return a union of those. +* Store code that has nothing to do with Kedro `Pipeline` and `Node` concepts, or should + be reused by multiple pipelines of your project, in a module at the same level as the + `pipelines` folder of your project. This scenario is covered in more detail below. + +## Common codebase extension scenarios + +This section provides examples of how you can handle some common cases of adding more +code to or around your Kedro project. +**Provided implementations are by no means the only ways to achieve target scenarios**, +and serve only illustrative purposes. + +### Sharing utilities between pipelines + +Oftentimes you have utilities that have to be imported by multiple `pipelines`. +To keep them as part of a Kedro project, **create a module (e.g. `utils`) at the same +level as the `pipelines` folder**, and organize the functionalities there: + +```text +├── conf +├── data +├── notebooks +└── src + ├── my_project + │ ├── __init__.py + │ ├── __main__.py + │ ├── pipeline_registry.py + │ ├── settings.py + │ ├── pipelines + │ └── utils <-- Create a module to store your utilities + │ ├── __init__.py <-- Required to import from it + │ ├── pandas_utils.py <-- Put a file with utility functions here + │ ├── dictionary_utils.py <-- Or a few files + │ ├── visualization_utils <-- Or sub-modules to organize even more utilities + └── tests +``` + +Example of importing a function `find_common_keys` from `dictionary_utils.py` would be: + +```python +from my_project.utils.dictionary_utils import find_common_keys +``` + +```{note} +For imports like this to be displayed in IDE properly, it is required to perform an editable +installation of the Kedro project to your virtual environment. +This is done via `pip install -e `, the easiest way to achieve +which is to `cd` to the root of Kedro project and do `pip install -e .`. +``` + +### Kedro project in a monorepo setup + +The way a Kedro project is generated may build an impression that it should +only be acting as a root of a `git` repo. This is not true: just like you can combine +multiple Python packages in a single repo, you can combine multiple Kedro projects, +or a Kedro project with other parts of your project's software stack. + +```{note} +A practice of combining multiple, often unrelated software components in a single version +control repository is not specific to Python and called [_**monorepo design**_](https://monorepo.tools/). +``` + +A common use case of Kedro is that a software product built by a team has components that +are well separable from the Kedro project. + +Let's use **a recommendation tool for production equipment operators** as an example. +It would imply three parts: + +| **#** | **Part** | **Considerations** | +|-------|--------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry |
  • Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.
| +| 2 | An optimizer that leverages the ML model and implements domain business logic to derive recommendations |
  • A good design consideration might be to make it independent of the UI framework.
| +| 3 | User interface (UI) application |
  • This can be e.g., a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.
  • Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).
  • Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.
| + +A suggested solution in this case would be a **monorepo** design. Below is an example: + +```text +└── repo_root + ├── packages + │ ├── kedro_project <-- A Kedro project for ML model training. + │ │ ├── conf + │ │ ├── data + │ │ ├── notebooks + │ │ ├── ... + │ ├── optimizer <-- Standalone package. + │ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. + ├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. + ├── pyproject.toml <-- Settings for those, like `[tool.isort]`. + └── ... +``` diff --git a/docs/source/kedro_project_setup/index.md b/docs/source/kedro_project_setup/index.md index 5112a51094..e8f484cd78 100644 --- a/docs/source/kedro_project_setup/index.md +++ b/docs/source/kedro_project_setup/index.md @@ -6,4 +6,5 @@ dependencies session settings +code_beyond_starter_files ``` From 036cbe0e37b2154bd66d6cbd94bbe3a3bfa3d5e7 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Mon, 6 May 2024 08:00:10 +0200 Subject: [PATCH 02/21] Mention change in RELEASE.md Signed-off-by: Yury Fedotov --- RELEASE.md | 1 + 1 file changed, 1 insertion(+) diff --git a/RELEASE.md b/RELEASE.md index 8d2b98f0b9..d643e65cf4 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -7,6 +7,7 @@ ## Breaking changes to the API ## Documentation changes +* Added a guide on extending a Kedro project beyond files generated by default. ## Community contributions From 02f02743442982e8a43d8d59fc6e479549bf67f2 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Mon, 6 May 2024 09:37:19 +0200 Subject: [PATCH 03/21] Address Vale comments on UK endings Signed-off-by: Yury Fedotov --- .../code_beyond_starter_files.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 8ce30c57e8..b1e9f780f8 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -9,7 +9,7 @@ read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code to be located only in those starter files, which is not true. -This section elaborates what are the Kedro requirements in terms of organizing code +This section elaborates what are the Kedro requirements in terms of organising code in files and modules. It also provides examples of common scenarios such as sharing utilities between pipelines and using Kedro in a monorepo setup. @@ -54,8 +54,8 @@ and serve only illustrative purposes. ### Sharing utilities between pipelines Oftentimes you have utilities that have to be imported by multiple `pipelines`. -To keep them as part of a Kedro project, **create a module (e.g. `utils`) at the same -level as the `pipelines` folder**, and organize the functionalities there: +To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same +level as the `pipelines` folder**, and organise the functionalities there: ```text ├── conf @@ -93,8 +93,8 @@ which is to `cd` to the root of Kedro project and do `pip install -e .`. The way a Kedro project is generated may build an impression that it should only be acting as a root of a `git` repo. This is not true: just like you can combine -multiple Python packages in a single repo, you can combine multiple Kedro projects, -or a Kedro project with other parts of your project's software stack. +multiple Python packages in a single repo, you can combine multiple Kedro projects. +Or a Kedro project with other parts of your project's software stack. ```{note} A practice of combining multiple, often unrelated software components in a single version @@ -107,11 +107,11 @@ are well separable from the Kedro project. Let's use **a recommendation tool for production equipment operators** as an example. It would imply three parts: -| **#** | **Part** | **Considerations** | -|-------|--------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry |
  • Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.
| -| 2 | An optimizer that leverages the ML model and implements domain business logic to derive recommendations |
  • A good design consideration might be to make it independent of the UI framework.
| -| 3 | User interface (UI) application |
  • This can be e.g., a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.
  • Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).
  • Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.
| +| **#** | **Part** | **Considerations** | +|-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| 1 | An ML model, or more precisely, a workflow to prepare the data, train an estimator, ship it to some registry |
  • Here Kedro fits well, as it allows to develop those pipelines in a modular and extensible way.
| +| 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations |
  • A good design consideration might be to make it independent of the UI framework.
| +| 3 | User interface (UI) application |
  • This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.
  • Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).
  • Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.
| A suggested solution in this case would be a **monorepo** design. Below is an example: From 4d7a35a6b731765524aa3513643761ea8535299d Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:31:11 -0500 Subject: [PATCH 04/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md Comment on customizations Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index b1e9f780f8..7dadf473d1 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -28,7 +28,7 @@ Other than that, **Kedro does not impose any constraints on where you should kee ```{note} You actually can make Kedro look for pipeline registry in a different place by modifying the -`__main__.py` file of your project, but such advanced customizations are not in scope of this article. +`__main__.py` file of your project, but such advanced customisations are not in scope for this section. ``` This being the only constraint means that you can, for example: From ca00b727065f00e8e5ff0bcd506f339866f8a4eb Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:31:45 -0500 Subject: [PATCH 05/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md For example in splitting Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 7dadf473d1..bc7b8bb8b8 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -35,7 +35,7 @@ This being the only constraint means that you can, for example: * Add `utils.py` file to a pipeline folder and import utilities defined by multiple functions in `nodes.py`. * Delete or rename a default `nodes.py` file, split it into multiple files or modules. -* Instead of having a single `pipeline.py` in your pipeline folder, split it +* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, into `historical_pipeline.py` and `inference_pipeline.py`. * Instead of registering many pipelines in `register_pipelines()` function one by one, create a few `tp.Dict[str, Pipeline]` objects in different places of the project From ff01643c14349437e116fad5241f59ee3b5a3c81 Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:32:13 -0500 Subject: [PATCH 06/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md Make provided examples title not bold Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index bc7b8bb8b8..9f7f732677 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -48,7 +48,7 @@ This being the only constraint means that you can, for example: This section provides examples of how you can handle some common cases of adding more code to or around your Kedro project. -**Provided implementations are by no means the only ways to achieve target scenarios**, +The provided examples are by no means the only ways to achieve the target scenarios, and serve only illustrative purposes. ### Sharing utilities between pipelines From b85840651528719d65906ec3c22267222fab8e9c Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:32:33 -0500 Subject: [PATCH 07/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md And as where appropriate Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 9f7f732677..f60ce80e35 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -49,7 +49,7 @@ This being the only constraint means that you can, for example: This section provides examples of how you can handle some common cases of adding more code to or around your Kedro project. The provided examples are by no means the only ways to achieve the target scenarios, -and serve only illustrative purposes. +and serve only as illustrative purposes. ### Sharing utilities between pipelines From 144783b0dab0518cdba80f66d929eb54af7a277b Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:34:47 -0500 Subject: [PATCH 08/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md Organise spelling Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index f60ce80e35..f9055c1826 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -72,7 +72,7 @@ level as the `pipelines` folder**, and organise the functionalities there: │ ├── __init__.py <-- Required to import from it │ ├── pandas_utils.py <-- Put a file with utility functions here │ ├── dictionary_utils.py <-- Or a few files - │ ├── visualization_utils <-- Or sub-modules to organize even more utilities + │ ├── visualisation_utils <-- Or sub-modules to organise even more utilities └── tests ``` From cf3a65ff95d9636dfe1cf12e945a0360c59a4820 Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:35:49 -0500 Subject: [PATCH 09/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md Proper wording for pip install note Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index f9055c1826..9cb6c97e2e 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -86,7 +86,7 @@ from my_project.utils.dictionary_utils import find_common_keys For imports like this to be displayed in IDE properly, it is required to perform an editable installation of the Kedro project to your virtual environment. This is done via `pip install -e `, the easiest way to achieve -which is to `cd` to the root of Kedro project and do `pip install -e .`. +this is to `cd` to the root of your Kedro project and run `pip install -e .`. ``` ### Kedro project in a monorepo setup From 729dc1ada5035e50f5f6be896d74b6f6d055dc2d Mon Sep 17 00:00:00 2001 From: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> Date: Thu, 16 May 2024 19:36:22 -0500 Subject: [PATCH 10/21] Update docs/source/kedro_project_setup/code_beyond_starter_files.md Implies to consists replacement Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Yury Fedotov <102987839+yury-fedotov@users.noreply.github.com> --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 9cb6c97e2e..b421dd8509 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -105,7 +105,7 @@ A common use case of Kedro is that a software product built by a team has compon are well separable from the Kedro project. Let's use **a recommendation tool for production equipment operators** as an example. -It would imply three parts: +This example consists of three parts: | **#** | **Part** | **Considerations** | |-------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| From a58ff32e4c63a88ce1ad28fe6f6271bc2656e5ed Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Thu, 16 May 2024 20:06:41 -0500 Subject: [PATCH 11/21] Link deepdives to a list of examples Signed-off-by: Yury Fedotov --- .../code_beyond_starter_files.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index b421dd8509..924899b34e 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -32,17 +32,19 @@ You actually can make Kedro look for pipeline registry in a different place by m ``` This being the only constraint means that you can, for example: -* Add `utils.py` file to a pipeline folder and import utilities defined by multiple - functions in `nodes.py`. +* Add `utils.py` file to a pipeline folder and import utilities used by multiple + functions in `nodes.py` from there. +* [Share modules between pipelines](#sharing-modules-between-pipelines). + Those could be utility functionalities, or your standalone module responsible for + the domain logic of the industry you work at. +* [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are + software components independent of Kedro that you want to keep together in the version control system. * Delete or rename a default `nodes.py` file, split it into multiple files or modules. -* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, +* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, into `historical_pipeline.py` and `inference_pipeline.py`. * Instead of registering many pipelines in `register_pipelines()` function one by one, create a few `tp.Dict[str, Pipeline]` objects in different places of the project and then make `register_pipelines()` return a union of those. -* Store code that has nothing to do with Kedro `Pipeline` and `Node` concepts, or should - be reused by multiple pipelines of your project, in a module at the same level as the - `pipelines` folder of your project. This scenario is covered in more detail below. ## Common codebase extension scenarios @@ -51,9 +53,9 @@ code to or around your Kedro project. The provided examples are by no means the only ways to achieve the target scenarios, and serve only as illustrative purposes. -### Sharing utilities between pipelines +### Sharing modules between pipelines -Oftentimes you have utilities that have to be imported by multiple `pipelines`. +Oftentimes you have machinery that has to be imported by multiple `pipelines`. To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same level as the `pipelines` folder**, and organise the functionalities there: From 325489d26e8c09fec48c55bbb64f4af1e09d14ce Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 12:32:05 -0400 Subject: [PATCH 12/21] Revert weird MLflow release note edit Signed-off-by: Yury Fedotov --- RELEASE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/RELEASE.md b/RELEASE.md index 4733072a61..e348773522 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -46,7 +46,7 @@ Many thanks to the following Kedroids for contributing PRs to this release: * Updated `kedro pipeline create` and `kedro pipeline delete` to read the base environment from the project settings. * Updated CLI command `kedro catalog resolve` to read credentials properly. * Changed the path of where pipeline tests generated with `kedro pipeline create` from `/src/tests/pipelines/` to `/tests/pipelines/`. -* Updated ``.gitignore`` to prevent pushing Mlflow local runs folder to a remote forge when using mlflow and git. +* Updated ``.gitignore`` to prevent pushing MLflow local runs folder to a remote forge when using MLflow and Git. * Fixed error handling message for malformed yaml/json files in OmegaConfigLoader. * Fixed a bug in `node`-creation allowing self-dependencies when using transcoding, that is datasets named like `name@format`. * Improved error message when passing wrong value to node. From 1853e2925e2e17dc82f2afddc9c519caeb7d43d6 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 12:42:03 -0400 Subject: [PATCH 13/21] Remove note on changing registry location Signed-off-by: Yury Fedotov --- .../code_beyond_starter_files.md | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 924899b34e..811f597418 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -17,20 +17,15 @@ pipelines and using Kedro in a monorepo setup. ## Where does Kedro look for code to be located The only technical constraint for arranging code in the project is that `pipeline_registry.py` -file must be located in `/src/` directory, which is where -it is created by default. +and `settings.py` files must be located in `/src/` directory, which is where +they are created by default. -This file must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` -mapping from pipeline name to corresponding `Pipeline` object. +`pipeline_registry.py` must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` +mapping from pipeline name to a corresponding `Pipeline` object. Other than that, **Kedro does not impose any constraints on where you should keep files with `Pipeline`s, `Node`s, or functions wrapped by `node`**. -```{note} -You actually can make Kedro look for pipeline registry in a different place by modifying the -`__main__.py` file of your project, but such advanced customisations are not in scope for this section. -``` - This being the only constraint means that you can, for example: * Add `utils.py` file to a pipeline folder and import utilities used by multiple functions in `nodes.py` from there. From 9fc168e19b0613adde9329ae553399c6045723a0 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 12:51:15 -0400 Subject: [PATCH 14/21] Simplify domain logic comment Signed-off-by: Yury Fedotov --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 811f597418..54484aacd9 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -30,8 +30,8 @@ This being the only constraint means that you can, for example: * Add `utils.py` file to a pipeline folder and import utilities used by multiple functions in `nodes.py` from there. * [Share modules between pipelines](#sharing-modules-between-pipelines). - Those could be utility functionalities, or your standalone module responsible for - the domain logic of the industry you work at. + Those could be utility functionalities, or a standalone module responsible for + the domain-specific logic. * [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are software components independent of Kedro that you want to keep together in the version control system. * Delete or rename a default `nodes.py` file, split it into multiple files or modules. From 50a5bba858e1af1eee34e4343c0a6a867104216b Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 12:51:29 -0400 Subject: [PATCH 15/21] Replace tp.Dict by dict Signed-off-by: Yury Fedotov --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 54484aacd9..c5b5aae16b 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -20,7 +20,7 @@ The only technical constraint for arranging code in the project is that `pipelin and `settings.py` files must be located in `/src/` directory, which is where they are created by default. -`pipeline_registry.py` must have a `register_pipelines()` function that returns a `tp.Dict[str, Pipeline]` +`pipeline_registry.py` must have a `register_pipelines()` function that returns a `dict[str, Pipeline]` mapping from pipeline name to a corresponding `Pipeline` object. Other than that, **Kedro does not impose any constraints on where you should keep files with @@ -38,7 +38,7 @@ This being the only constraint means that you can, for example: * Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, into `historical_pipeline.py` and `inference_pipeline.py`. * Instead of registering many pipelines in `register_pipelines()` function one by one, - create a few `tp.Dict[str, Pipeline]` objects in different places of the project + create a few `dict[str, Pipeline]` objects in different places of the project and then make `register_pipelines()` return a union of those. ## Common codebase extension scenarios From 3024560c06d9d0535ef9ed086693f8e90cd92e4c Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 13:00:08 -0400 Subject: [PATCH 16/21] Remove pyproject.toml from monorepo tree example Signed-off-by: Yury Fedotov --- .../kedro_project_setup/code_beyond_starter_files.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index c5b5aae16b..3c8e58be28 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -115,14 +115,16 @@ A suggested solution in this case would be a **monorepo** design. Below is an ex ```text └── repo_root ├── packages - │ ├── kedro_project <-- A Kedro project for ML model training. + │ ├── kedro_project <-- A Kedro project for ML model training. │ │ ├── conf │ │ ├── data │ │ ├── notebooks │ │ ├── ... - │ ├── optimizer <-- Standalone package. - │ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. - ├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. - ├── pyproject.toml <-- Settings for those, like `[tool.isort]`. + │ ├── optimizer <-- Standalone package. + │ └── dashboard <-- Standalone package, may import `optimizer`, but should not know anything about model training pipeline. + ├── ... + ├── ... Examples of what you may want in the repo root: + ├── requirements.txt <-- Linters, code formatters... Not dependencies of packages. + ├── ruff.toml <-- Or configs for other tools that you want to share between packages. └── ... ``` From 0df55baa8ea71b37da77cfa67640e6eb1b888040 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 13:52:56 -0400 Subject: [PATCH 17/21] Replace historical and inference as pipeline split example Signed-off-by: Yury Fedotov --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 3c8e58be28..1d80d03ca5 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -35,8 +35,9 @@ This being the only constraint means that you can, for example: * [Use Kedro in a monorepo setup](#kedro-project-in-a-monorepo-setup) if there are software components independent of Kedro that you want to keep together in the version control system. * Delete or rename a default `nodes.py` file, split it into multiple files or modules. -* Instead of having a single `pipeline.py` in your pipeline folder, split it, for example, - into `historical_pipeline.py` and `inference_pipeline.py`. +* If you have multiple large `Pipeline` objects defined in a single `pipeline.py`, + split them into separate `.py` files. For example, in `data_processing` pipeline + you may want to have `cleaning_pipeline.py` and `merging_pipeline.py`. * Instead of registering many pipelines in `register_pipelines()` function one by one, create a few `dict[str, Pipeline]` objects in different places of the project and then make `register_pipelines()` return a union of those. From a9f0e3af26b02ecfbc27a29d5035d4775051f22d Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 14:11:09 -0400 Subject: [PATCH 18/21] Add a note about find_pipelines() Signed-off-by: Yury Fedotov --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 1d80d03ca5..19c1c22855 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -42,6 +42,11 @@ This being the only constraint means that you can, for example: create a few `dict[str, Pipeline]` objects in different places of the project and then make `register_pipelines()` return a union of those. + ```{note} + While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery), + for large projects you may want a finer control and register pipelines manually. + ``` + ## Common codebase extension scenarios This section provides examples of how you can handle some common cases of adding more From c02309c25e3ebb5a16bb0a35cccd0cad31682c13 Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Sat, 25 May 2024 14:12:04 -0400 Subject: [PATCH 19/21] Remove article before find pipelines Signed-off-by: Yury Fedotov --- docs/source/kedro_project_setup/code_beyond_starter_files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 19c1c22855..5a6e68101d 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -43,7 +43,7 @@ This being the only constraint means that you can, for example: and then make `register_pipelines()` return a union of those. ```{note} - While Kedro features a [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery), + While Kedro features [`find_pipelines()` functionality for autodiscovery of pipelines](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery), for large projects you may want a finer control and register pipelines manually. ``` From 6ee9ad20b6597cc10a47f9ee51282fe98595ece1 Mon Sep 17 00:00:00 2001 From: Merel Theisen <49397448+merelcht@users.noreply.github.com> Date: Tue, 9 Jul 2024 10:36:18 +0100 Subject: [PATCH 20/21] Apply suggestions from code review Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> --- .../kedro_project_setup/code_beyond_starter_files.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index 5a6e68101d..ccd393c2cb 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -7,9 +7,9 @@ few boilerplate files: `nodes.py`, `pipeline.py`, `pipeline_registry.py`... While those may be sufficient for a small project, they quickly become large, hard to read and collaborate on as your codebase grows. Those files also sometimes make new users think that Kedro requires code -to be located only in those starter files, which is not true. +to be located in those exact starter files, which is not true. -This section elaborates what are the Kedro requirements in terms of organising code +This section elaborates on what the Kedro requirements are for organising your project code in files and modules. It also provides examples of common scenarios such as sharing utilities between pipelines and using Kedro in a monorepo setup. @@ -17,7 +17,7 @@ pipelines and using Kedro in a monorepo setup. ## Where does Kedro look for code to be located The only technical constraint for arranging code in the project is that `pipeline_registry.py` -and `settings.py` files must be located in `/src/` directory, which is where +and `settings.py` files must be located in the `/src/` directory, which is where they are created by default. `pipeline_registry.py` must have a `register_pipelines()` function that returns a `dict[str, Pipeline]` @@ -56,7 +56,7 @@ and serve only as illustrative purposes. ### Sharing modules between pipelines -Oftentimes you have machinery that has to be imported by multiple `pipelines`. +Oftentimes you have functions that have to be imported by multiple `pipelines`. To keep them as part of a Kedro project, **create a module (for example, `utils`) at the same level as the `pipelines` folder**, and organise the functionalities there: @@ -86,7 +86,7 @@ from my_project.utils.dictionary_utils import find_common_keys ``` ```{note} -For imports like this to be displayed in IDE properly, it is required to perform an editable +For imports like this to be displayed in an IDE properly, it is required to perform an editable installation of the Kedro project to your virtual environment. This is done via `pip install -e `, the easiest way to achieve this is to `cd` to the root of your Kedro project and run `pip install -e .`. @@ -116,7 +116,7 @@ This example consists of three parts: | 2 | An optimiser that leverages the ML model and implements domain business logic to derive recommendations |
  • A good design consideration might be to make it independent of the UI framework.
| | 3 | User interface (UI) application |
  • This can be a [`plotly`](https://plotly.com/python/) or [`streamlit`](https://streamlit.io/) dashboard.
  • Or even a full-fledged front-end app leveraging JS framework like [`React`](https://react.dev/).
  • Regardless, this component may know how to access the ML model, but it should probably not know anything about how it was trained and was Kedro involved or not.
| -A suggested solution in this case would be a **monorepo** design. Below is an example: +A suggested solution in this case would be a **monorepo** design. Below is an example of such a project structure: ```text └── repo_root From 52ed71ea65a16dbc128d024276160b9da7b236bd Mon Sep 17 00:00:00 2001 From: Yury Fedotov Date: Fri, 6 Sep 2024 23:14:04 -0400 Subject: [PATCH 21/21] Implement Nok's comments re: utility functions Signed-off-by: Yury Fedotov --- .../code_beyond_starter_files.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/source/kedro_project_setup/code_beyond_starter_files.md b/docs/source/kedro_project_setup/code_beyond_starter_files.md index ccd393c2cb..51c00ea52e 100644 --- a/docs/source/kedro_project_setup/code_beyond_starter_files.md +++ b/docs/source/kedro_project_setup/code_beyond_starter_files.md @@ -71,18 +71,20 @@ level as the `pipelines` folder**, and organise the functionalities there: │ ├── pipeline_registry.py │ ├── settings.py │ ├── pipelines - │ └── utils <-- Create a module to store your utilities - │ ├── __init__.py <-- Required to import from it - │ ├── pandas_utils.py <-- Put a file with utility functions here - │ ├── dictionary_utils.py <-- Or a few files - │ ├── visualisation_utils <-- Or sub-modules to organise even more utilities + │ └── utils <-- Create a module to store your utilities + │ ├── __init__.py <-- Required to import from it + │ ├── pandas.py <-- Put a file with utility functions here + │ ├── dictionary.py <-- Or a few files + │ ├── visualisation <-- Or sub-modules to organise even more utilities └── tests ``` -Example of importing a function `find_common_keys` from `dictionary_utils.py` would be: +Example of importing and using a function `find_common_keys` from `utils.dictionary` would be: ```python -from my_project.utils.dictionary_utils import find_common_keys +from my_project import utils + +utils.dictionary.find_common_keys(...) ``` ```{note}