This is a repo with tools used in data engineering
- Column
- AWS Redshift A fully-managed petabyte-scale cloud based data warehouse product designed for large scale data set storage and analysis.
- Cassandra A SGBD designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
- Distributed
- Datomic High-throughput system, such as a time-series database or log store.
- Document
- MongoDB A cross-platform document-oriented database, uses JSON-like documents with optional schemas.
- Relational
- MySQL The most popular Open Source SQL database management system.
- Oracle A robust, reliable and safe SGBD.
- PostgreSQL A powerful open source object-relational database system.
- SQLServer A database management system developed by Microsoft.
- Apache Avro A row-oriented serialization that use JSON for defining data types and protocols.
- Apache Parquet An open-source column-oriented serialization, it's specialized in efficiently storing and processing nested data types.
- Apache ORC A column-oriented serialization highly optimized for reading and writing.
- Delta Lake An open-source storage layer that brings reliability to data lakes.
- Apache Airflow An open-source workflow management plataform, it schedule workflows and monitor them via your own UI.
- Azkaban A batch workflow job scheduler.
- Kedro A development workflow framework that implements software engineering best-practice for data pipelines.
- Luigi Helps you to build batch jobs pipelines and monitor them all.
- Hadoop MapReduce Processing and generating big data sets with a parallel, distributed algorithm on a cluster.
- Spark An open-source distributed general-purpose cluster-computing framework.
- Apache Hudi An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
- Apache NiFi Automate your data flow between systems with reliable and no single point of failure.
- Apache Flink An open-source stream-processing framework, provides multiple APIs at different levels of abstraction and offers dedicated libraries for common use cases.
- Spark Streaming An extension of the core Spark that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.