Pipelines — ETL Framework

Quickstart

Initialization

Create a new folder and run pipelines init inside of it.

It will create a file named pipeline.py with the following content:

from pipelines import tasks, Pipeline
from pipelines.tasks import sql_create_table_as, load_file_to_db, save_table_to_file

NAME = 'test_project'
SCHEMA = 'public'
YEAR_SUFFIX = '2023'

pipeline = Pipeline(
    name=NAME,
    schema=SCHEMA,
    version=VERSION,
    tasks=[
        load_file_to_db(
            input='original/original.csv',
            output='original',
        ),
        sql_create_table_as(
            table='norm',
            query='''
                select *, domain_of_url(url)
                from {original};
            '''
        ),
        tasks.CopyToFile(
            input='norm',
            output='norm',
        ),

        # clean up:
        sql('drop table {original}'),
        sql('drop table {norm}'),
    ]
)

The idea of this pipeline is to load the existing file with URLs and normalize them — extract domain name for each url, and finally save the result back to CSV-file.

List available tasks

> pipelines tasks
Tasks:
 1: load_file_to_db [original]: original/original.csv -> original
 2: ctas [norm]: norm
 3: copy_to_file [norm]: norm -> norm.csv.gz

Add files

Let's create file original.csv somewhere with the following content:

id,name,url
1,hello,http://hello.com/home
2,world,https://world.org/

Formatted:

id	name	url
1	hello	http://hello.com/home
2	world	https://world.org/

Now you can add it to your data sources using following command:

> pipelines add original.csv
data/original/original.csv

Running the pipeline

> pipeline list
Error: No pipeline found in the current directory!

> cd test_project
> pipeline list
...
> pipeline run
...

Now, when we have all the dependencies in place, we can run the pipeline using pipelines run.

> pipelines run
Running task 1: load_file_to_db [original]: original/original.csv -> original
0 original/original.csv original
Loading file original/original.csv
Dropping table 'public.test_project_2023_original'
drop true
'COPY 2'
Task took 0.385 seconds

Running task 2: ctas [norm]: norm
drop table if exists public.test_project_2023_norm;
create table public.test_project_2023_norm as
select *, domain_of_url(url)
from public.test_project_2023_original;

SELECT 2
Task took 0.812 seconds

--------------------------------------------------------------------------------
Running task 3: copy_to_file [norm]: norm -> norm.csv.gz
Writing data to file: norm.csv.gz
COPY 2
Task took 0.027 seconds

Results

You can see the result of you work in data/norm.csv.gz.

id,name,url,domain_of_url
1,hello,http://hello.com/home,hello.com
2,world,https://world.org/,world.org

Formatted:

id	name	url	domain_of_url
1	hello	http://hello.com/home	hello.com
2	world	https://world.org/	world.org

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
.idea		.idea
db		db
example_pipeline/data		example_pipeline/data
pipelines		pipelines
res		res
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
pipeline.py		pipeline.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipelines — ETL Framework

Quickstart

Initialization

List available tasks

Add files

Running the pipeline

Results

About

Releases

Packages

Languages

License

PavlenkoEgor311/pipelines

Folders and files

Latest commit

History

Repository files navigation

Pipelines — ETL Framework

Quickstart

Initialization

List available tasks

Add files

Running the pipeline

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages