Skip to content

Commit

Permalink
Docs: rest_api: document processing_steps (#1872)
Browse files Browse the repository at this point in the history
* Document `processing_steps`
* Rearrange sections; update the headings structure

Co-authored-by: Willi Müller <[email protected]>
  • Loading branch information
burnash and willi-mueller authored Sep 25, 2024
1 parent d975aee commit 487f3dc
Showing 1 changed file with 106 additions and 8 deletions.
114 changes: 106 additions & 8 deletions docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@ A resource configuration is used to define a [dlt resource](../../../general-usa
- `write_disposition`: The write disposition for the resource.
- `primary_key`: The primary key for the resource.
- `include_from_parent`: A list of fields from the parent resource to be included in the resource output. See the [resource relationships](#include-fields-from-the-parent-resource) section for more details.
- `processing_steps`: A list of [processing steps](#processing-steps-filter-and-transform-data) to filter and transform the data.
- `selected`: A flag to indicate if the resource is selected for loading. This could be useful when you want to load data only from child resources and not from the parent resource.
You can also pass additional resource parameters that will be used to configure the dlt resource. See [dlt resource API reference](../../../api_reference/extract/decorators#resource) for more details.
Expand Down Expand Up @@ -638,7 +639,23 @@ The `field` value can be specified as a [JSONPath](https://github.com/h2non/json

Under the hood, dlt handles this by using a [transformer resource](../../../general-usage/resource.md#process-resources-with-dlttransformer).

#### Define a resource which is not a REST endpoint
#### Include fields from the parent resource

You can include data from the parent resource in the child resource by using the `include_from_parent` field in the resource configuration. For example:

```py
{
"name": "issue_comments",
"endpoint": {
...
},
"include_from_parent": ["id", "title", "created_at"],
}
```

This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The name of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`.

### Define a resource which is not a REST endpoint

Sometimes, we want to request endpoints with specific values that are not returned by another endpoint.
Thus, you can also include arbitrary dlt resources in your `RESTAPIConfig` instead of defining a resource for every path!
Expand Down Expand Up @@ -685,22 +702,103 @@ def repositories() -> Generator[Dict[str, Any]]:
yield from [{"name": "dlt"}, {"name": "verified-sources"}, {"name": "dlthub-education"}]
```
### Processing steps: filter and transform data
#### Include fields from the parent resource
The `processing_steps` field in the resource configuration allows you to apply transformations to the data fetched from the API before it is loaded into your destination. This is useful when you need to filter out certain records, modify the data structure, or anonymize sensitive information.
You can include data from the parent resource in the child resource by using the `include_from_parent` field in the resource configuration. For example:
Each processing step is a dictionary specifying the type of operation (`filter` or `map`) and the function to apply. Steps apply in the order they are listed.
#### Quick example
```py
def lower_title(record):
record["title"] = record["title"].lower()
return record
config: RESTAPIConfig = {
"client": {
"base_url": "https://api.example.com",
},
"resources": [
{
"name": "posts",
"processing_steps": [
{"filter": lambda x: x["id"] < 10},
{"map": lower_title},
],
},
],
}
```
In the example above:
- First, the `filter` step uses a lambda function to include only records where `id` is less than 10.
- Thereafter, the `map` step applies the `lower_title` function to each remaining record.
#### Using `filter`
The `filter` step allows you to exclude records that do not meet certain criteria. The provided function should return `True` to keep the record or `False` to exclude it:
```py
{
"name": "issue_comments",
"endpoint": {
...
"name": "posts",
"endpoint": "posts",
"processing_steps": [
{"filter": lambda x: x["id"] in [10, 20, 30]},
],
}
```
In this example, only records with `id` equal to 10, 20, or 30 will be included.
#### Using `map`
The `map` step allows you to modify the records fetched from the API. The provided function should take a record as an argument and return the modified record. For example, to anonymize the `email` field:
```py
def anonymize_email(record):
record["email"] = "REDACTED"
return record
config: RESTAPIConfig = {
"client": {
"base_url": "https://api.example.com",
},
"include_from_parent": ["id", "title", "created_at"],
"resources": [
{
"name": "users",
"processing_steps": [
{"map": anonymize_email},
],
},
],
}
```
This will include the `id`, `title`, and `created_at` fields from the `issues` resource in the `issue_comments` resource data. The name of the included fields will be prefixed with the parent resource name and an underscore (`_`) like so: `_issues_id`, `_issues_title`, `_issues_created_at`.
#### Combining `filter` and `map`
You can combine multiple processing steps to achieve complex transformations:
```py
{
"name": "posts",
"endpoint": "posts",
"processing_steps": [
{"filter": lambda x: x["id"] < 10},
{"map": lower_title},
{"filter": lambda x: "important" in x["title"]},
],
}
```
:::tip
#### Best practices
1. Order matters: Processing steps are applied in the order they are listed. Be mindful of the sequence, especially when combining `map` and `filter`.
2. Function definition: Define your filter and map functions separately for clarity and reuse.
3. Use `filter` to exclude records early in the process to reduce the amount of data that needs to be processed.
4. Combine consecutive `map` steps into a single function for faster execution.
:::
## Incremental loading
Expand Down

0 comments on commit 487f3dc

Please sign in to comment.