Data Engineering Guide

This is a repo with tools used in data engineering

Databases

Column
- AWS Redshift A fully-managed petabyte-scale cloud based data warehouse product designed for large scale data set storage and analysis.
- Cassandra A SGBD designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Distributed
- Datomic High-throughput system, such as a time-series database or log store.
Document
- MongoDB A cross-platform document-oriented database, uses JSON-like documents with optional schemas.
Relational
- MySQL The most popular Open Source SQL database management system.
- Oracle A robust, reliable and safe SGBD.
- PostgreSQL A powerful open source object-relational database system.
- SQLServer A database management system developed by Microsoft.

Apache Avro A row-oriented serialization that use JSON for defining data types and protocols.
Apache Parquet An open-source column-oriented serialization, it's specialized in efficiently storing and processing nested data types.
Apache ORC A column-oriented serialization highly optimized for reading and writing.
Delta Lake An open-source storage layer that brings reliability to data lakes.

Apache Airflow An open-source workflow management plataform, it schedule workflows and monitor them via your own UI.
Azkaban A batch workflow job scheduler.
Kedro A development workflow framework that implements software engineering best-practice for data pipelines.
Luigi Helps you to build batch jobs pipelines and monitor them all.

Hadoop MapReduce Processing and generating big data sets with a parallel, distributed algorithm on a cluster.
Spark An open-source distributed general-purpose cluster-computing framework.

Apache Hudi An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
Apache NiFi Automate your data flow between systems with reliable and no single point of failure.
Apache Flink An open-source stream-processing framework, provides multiple APIs at different levels of abstraction and offers dedicated libraries for common use cases.
Spark Streaming An extension of the core Spark that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md