Skip to content

temirovazat/kafka-to-clickhouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kafka to Clickhouse

python dockerfile lint code style platform

Description

The aim of this project is to implement an ETL system for analysts that stores data about movie views. Since the service needs to handle the constant influx of information from each user, it uses the event streaming platform Kafka. To provide an API layer that sends events to Kafka without any transformations underneath, it leverages the FastAPI framework. The ETL process for loading data into the analytical data store is implemented using the batch and stream data processing library PySpark. The storage must handle very large data and do so within a reasonable time frame for analysts to conduct their research. Therefore, the project involved research to choose the right storage solution, and the best choice was the analytical OLAP system ClickHouse.

Technologies

Python Kafka FastAPI PySpark Clickhouse Vertica Jupyter Notebook Docker

How to Run the Project:

Clone the repository and navigate to the infra directory:

git clone https://github.com/temirovazat/kafka-to-clickhouse.git
cd kafka-to-clickhouse/infra/

Create a .env file and add project settings:

nano .env
# Kafka
KAFKA_HOST=kafka
KAFKA_PORT=9092

# Clickhouse
CLICKHOUSE_HOST=clickhouse-node1
CLICKHOUSE_PORT=9000

Deploy and run the project in containers:

docker-compose up

Send a POST request with the current movie view frame:

http://127.0.0.1/films/<UUID>/video_progress
{
    "frame": <INTEGER>
}

About

♻ Data Synchronization from Kafka to Clickhouse 📊

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published