The aim of this project is to implement an ETL system for analysts that stores data about movie views. Since the service needs to handle the constant influx of information from each user, it uses the event streaming platform Kafka. To provide an API layer that sends events to Kafka without any transformations underneath, it leverages the FastAPI framework. The ETL process for loading data into the analytical data store is implemented using the batch and stream data processing library PySpark. The storage must handle very large data and do so within a reasonable time frame for analysts to conduct their research. Therefore, the project involved research to choose the right storage solution, and the best choice was the analytical OLAP system ClickHouse.
Python
Kafka
FastAPI
PySpark
Clickhouse
Vertica
Jupyter Notebook
Docker
Clone the repository and navigate to the infra
directory:
git clone https://github.com/temirovazat/kafka-to-clickhouse.git
cd kafka-to-clickhouse/infra/
Create a .env file and add project settings:
nano .env
# Kafka
KAFKA_HOST=kafka
KAFKA_PORT=9092
# Clickhouse
CLICKHOUSE_HOST=clickhouse-node1
CLICKHOUSE_PORT=9000
Deploy and run the project in containers:
docker-compose up
Send a POST request with the current movie view frame:
http://127.0.0.1/films/<UUID>/video_progress
{
"frame": <INTEGER>
}