Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties

README.md

Probabilistic Aggregations for Jet

Collections of probabilistic aggregations. Currently implementation aggregations: HyperLogLog++

HyperLogLog++ Aggregation for Jet

Memory efficient cardinality estimator

Getting Started

HyperLogLog is a probabilistic data structure to estimate cardinality. Translated to plain English it means this: You feed HyperLogLog with items and it can tell you how many unique items it received.

It's extremely memory efficient: The usual way to count unique elements would be to store elements in e.g. HashSet. However this would have space complexity O(n) where n is number of unique element. HyperLogLog has constant space complexity.

What's the trade-off? Precision. There is an error rate. However for many applications it's OK to have approximate data if it means extremely low memory consumption.

Imagine you are processing web server access log and you want to tell how many unique IPs visited your site. Error rate 1-2% is often irrelevant.

You can read more about the data structure at Wikipedia: https://en.wikipedia.org/wiki/HyperLogLog

Usage

The aggregation is typically used in two stages:

Mapping step to transform your stream entries into 64 bits hashes
Aggregation step to estimate no. of unique hashes

In practices in looks like this:

import com.hazelcast.jet.contrib.probabilistic.*;
[...]
pipeline.readFrom(Sources.mySource())
                .mapUsingContext(HashingSupport.hashingContextFactory(), HashingSupport.hashingFn()) // hash items 
                .aggregate(ProbabilisticAggregations.hyperLogLog()) // actual aggregation
                .writeTo(Sinks.mySink()); // write cardinality into sink

Installing

The artifacts are published on the Maven repositories.

Add the following lines to your pom.xml to include it as a dependency to your project:

<dependency>
    <groupId>com.hazelcast.jet.contrib</groupId>
    <artifactId>probabilistic</artifactId>
    <version>${version}</version>
</dependency>

Gradle:

compile group: 'com.hazelcast.jet.contrib', name: 'probabilistic', version: ${version}

Running the tests

To run the tests run the command below:

./gradlew test

Authors

Jaromir Hamala

License

This project is licensed under the Apache 2.0 license - see the LICENSE file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probabilistic

probabilistic

README.md

Probabilistic Aggregations for Jet

HyperLogLog++ Aggregation for Jet

Getting Started

Usage

Installing

Running the tests

Authors

License

Files

probabilistic

Directory actions

More options

Directory actions

More options

Latest commit

History

probabilistic

Folders and files

parent directory

README.md

Probabilistic Aggregations for Jet

HyperLogLog++ Aggregation for Jet

Getting Started

Usage

Installing

Running the tests

Authors

License