This the implementation of the Engine
contract of Open Data Fabric using the RisingWave open-source streaming database. It is currently in use in kamu data management tool.
This repository is a fork of the RisingWave repo since it currently cannot be used as a library or extended in a modular way with ODF-specific sources and sinks.
This engine is experimental and has limited functionality.
We currently recommend it for queries like:
- Streaming aggregations using window functions
- Streaming aggregations using tumbling windows with
GROUP BY
- Top-N aggregations via materialized views
More information and engine comparisons are available here.
See RisingWave examples, and SQL reference for inspiration.
Have a look at integration tests in this repo for examples of different transform categories.
Pay special attention to EMIT ON WINDOW CLOSE
as unlike Flink, RisingWave does not operate in event-time processing mode by default. In future we may hide this under the hood, but currently you need to specify this clause in your queries explicitly.
- No support for event-time
JOIN
s. Only processing-time joins are supported. There might be some workarounds depending on the use case.
See: https://docs.risingwave.com/docs/current/query-syntax-join-clause/#process-time-temporal-joins
- No support for allowed lateness intervals. Unlike Flink where lateness and watermark are separate concepts, RisingWave will drop all events below the current watermark.
See: https://docs.risingwave.com/docs/current/watermarks/
- A
SOURCE
accepts append-only data. ATABLE
with connector accepts both append-only data and updatable data, but it persists all data that flows in meaning that ODF checkpoints would grow significantly.
See: https://docs.risingwave.com/docs/current/data-ingestion/
- Requires row IDs assignment for delete/update operations.
Currently ODF does not preserve a link between a retraction and the row being retracted. This is easy to implement for ingest merge strategies via additional column, but we would need to assess if this is a reasonable demand for all transform engines that emit retractions/corrections.
An alternative would be to specify primary key for RW source and ensure it uses that to associate delete with the record being deleted.
This repository is a fork of the RisingWave engine. Please refer to the upstream repo for developer documentation.
To help with really bad build time we add this to Cargo.toml
:
[profile.dev]
debug = "line-tables-only"
To build the engine run:
# Use CARGO_BUILD_JOBS=N env var if this drains your RAM
./risedev b
Run the tests (note that tests use fixed ports and thus must run one at a time):
cd odf
make test
Key directories:
odf
- build scripts, container image, and other stuffsrc/odf_adapter
- gRPC server that communicates with ODF coordinator and spawns RW and a subprocesssrc/connectors/src/source/odf
- custom ODF source implementationsrc/connectors/src/sink/odf
- custom ODF sink implementation
Environment variables:
RUST_LOG=debug
- standard Rust loggingRW_ODF_SOURCE_DEBUG=1
- enables debug data logging in the sinkRW_ODF_SINK_DEBUG=1
- enables debug data logging in the sinkRW_ODF_SOURCE_MAX_RECORDS_PER_CHUNK=10
- makes source split arrow record batches into smaller chunksRW_ODF_SOURCE_SLEEP_CHUNK_MS=500
- makes source sleep some time before yielding a chunk