Skip to content

kenjihiraoka/data-engineering-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 

Repository files navigation

Data Engineering Guide

This is a repo with tools used in data engineering

Databases

  • Column
    • AWS Redshift A fully-managed petabyte-scale cloud based data warehouse product designed for large scale data set storage and analysis.
    • Cassandra A SGBD designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
  • Distributed
    • Datomic High-throughput system, such as a time-series database or log store.
  • Document
    • MongoDB A cross-platform document-oriented database, uses JSON-like documents with optional schemas.
  • Relational
    • MySQL The most popular Open Source SQL database management system.
    • Oracle A robust, reliable and safe SGBD.
    • PostgreSQL A powerful open source object-relational database system.
    • SQLServer A database management system developed by Microsoft.

File Format (serialization)

  • Apache Avro A row-oriented serialization that use JSON for defining data types and protocols.
  • Apache Parquet An open-source column-oriented serialization, it's specialized in efficiently storing and processing nested data types.
  • Apache ORC A column-oriented serialization highly optimized for reading and writing.
  • Delta Lake An open-source storage layer that brings reliability to data lakes.

Pipeline Orchestration

  • Apache Airflow An open-source workflow management plataform, it schedule workflows and monitor them via your own UI.
  • Azkaban A batch workflow job scheduler.
  • Kedro A development workflow framework that implements software engineering best-practice for data pipelines.
  • Luigi Helps you to build batch jobs pipelines and monitor them all.

Batch Pipelines

  • Hadoop MapReduce Processing and generating big data sets with a parallel, distributed algorithm on a cluster.
  • Spark An open-source distributed general-purpose cluster-computing framework.

Stream Pipelines

  • Apache Hudi An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
  • Apache NiFi Automate your data flow between systems with reliable and no single point of failure.
  • Apache Flink An open-source stream-processing framework, provides multiple APIs at different levels of abstraction and offers dedicated libraries for common use cases.
  • Spark Streaming An extension of the core Spark that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

About

Some tools and framework for data engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published