diff --git a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md index 08fa1ab776..121769a11a 100644 --- a/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md +++ b/docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md @@ -307,6 +307,7 @@ A resource configuration is used to define a [dlt resource](../../../general-usa - `write_disposition`: The write disposition for the resource. - `primary_key`: The primary key for the resource. - `include_from_parent`: A list of fields from the parent resource to be included in the resource output. See the [resource relationships](#include-fields-from-the-parent-resource) section for more details. +- `processing_steps`: A list of [processing steps](#processing-steps-filter-and-transform-data) to filter and transform the data. - `selected`: A flag to indicate if the resource is selected for loading. This could be useful when you want to load data only from child resources and not from the parent resource. You can also pass additional resource parameters that will be used to configure the dlt resource. See [dlt resource API reference](../../../api_reference/extract/decorators#resource) for more details. @@ -638,7 +639,23 @@ The `field` value can be specified as a [JSONPath](https://github.com/h2non/json Under the hood, dlt handles this by using a [transformer resource](../../../general-usage/resource.md#process-resources-with-dlttransformer). -#### Define a resource which is not a REST endpoint +#### Include fields from the parent resource + +You can include data from the parent resource in the child resource by using the `include_from_parent` field in the resource configuration. For example: + +```py +{ + "name": "issue_comments", + "endpoint": { + ... + }, + "include_from_parent": ["id", "title", "created_at"], +} +``` + +This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The name of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`. + +### Define a resource which is not a REST endpoint Sometimes, we want to request endpoints with specific values that are not returned by another endpoint. Thus, you can also include arbitrary dlt resources in your `RESTAPIConfig` instead of defining a resource for every path! @@ -685,22 +702,103 @@ def repositories() -> Generator[Dict[str, Any]]: yield from [{"name": "dlt"}, {"name": "verified-sources"}, {"name": "dlthub-education"}] ``` +### Processing steps: filter and transform data -#### Include fields from the parent resource +The `processing_steps` field in the resource configuration allows you to apply transformations to the data fetched from the API before it is loaded into your destination. This is useful when you need to filter out certain records, modify the data structure, or anonymize sensitive information. -You can include data from the parent resource in the child resource by using the `include_from_parent` field in the resource configuration. For example: +Each processing step is a dictionary specifying the type of operation (`filter` or `map`) and the function to apply. Steps apply in the order they are listed. + +#### Quick example + +```py +def lower_title(record): + record["title"] = record["title"].lower() + return record + +config: RESTAPIConfig = { + "client": { + "base_url": "https://api.example.com", + }, + "resources": [ + { + "name": "posts", + "processing_steps": [ + {"filter": lambda x: x["id"] < 10}, + {"map": lower_title}, + ], + }, + ], +} +``` + +In the example above: + +- First, the `filter` step uses a lambda function to include only records where `id` is less than 10. +- Thereafter, the `map` step applies the `lower_title` function to each remaining record. + +#### Using `filter` + +The `filter` step allows you to exclude records that do not meet certain criteria. The provided function should return `True` to keep the record or `False` to exclude it: ```py { - "name": "issue_comments", - "endpoint": { - ... + "name": "posts", + "endpoint": "posts", + "processing_steps": [ + {"filter": lambda x: x["id"] in [10, 20, 30]}, + ], +} +``` + +In this example, only records with `id` equal to 10, 20, or 30 will be included. + +#### Using `map` + +The `map` step allows you to modify the records fetched from the API. The provided function should take a record as an argument and return the modified record. For example, to anonymize the `email` field: + +```py +def anonymize_email(record): + record["email"] = "REDACTED" + return record + +config: RESTAPIConfig = { + "client": { + "base_url": "https://api.example.com", }, - "include_from_parent": ["id", "title", "created_at"], + "resources": [ + { + "name": "users", + "processing_steps": [ + {"map": anonymize_email}, + ], + }, + ], } ``` -This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The name of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`. +#### Combining `filter` and `map` + +You can combine multiple processing steps to achieve complex transformations: + +```py +{ + "name": "posts", + "endpoint": "posts", + "processing_steps": [ + {"filter": lambda x: x["id"] < 10}, + {"map": lower_title}, + {"filter": lambda x: "important" in x["title"]}, + ], +} +``` + +:::tip +#### Best practices +1. Order matters: Processing steps are applied in the order they are listed. Be mindful of the sequence, especially when combining `map` and `filter`. +2. Function definition: Define your filter and map functions separately for clarity and reuse. +3. Use `filter` to exclude records early in the process to reduce the amount of data that needs to be processed. +4. Combine consecutive `map` steps into a single function for faster execution. +::: ## Incremental loading