Releases: bullet-db/bullet-core
New bullet-record Supports Float and Integer Types
This release upgrades the bullet-record dependency to version 0.2.0 which supports the Integer and Float data types and changes the BulletRecord class to an abstract class, allowing users to create various different implementations of BulletRecord. It also contains the AvroBulletRecord class which will serve as the default implementation of BulletRecord.
To configure which implementation of BulletRecord will be used in bullet-core this version contains a new setting:
bullet.record.provider.class.name
This setting can be used to specify which BulletRecordProvider class will be used to instantiate BulletRecords that will be used within bullet-core. It will use the AvroBulletRecordProvider class by default.
This version of bullet-core also removed the logic in the Querier to inject timestamps into records. As a result the two settings related to that functionality should be removed from your settings yaml. Namely, the settings
bullet.record.inject.timestamp.enable
bullet.record.inject.timestamp.key
should be removed.
Pre-Start delaying and Buffering changes
This release cleans up how this library was meant to be used in record-by-record streaming implementations - like Bullet Storm. Instead of buffering windows at the end of each window, this adds the capability to buffer it once at the beginning of the query to align windows properly for the duration of the query. In the former case, this extra buffering at the end would eventually add up and cause issues like having one window results appear in the next.
Related to this change, it also stops the time based windows (Tumbling, Additive Tumbling) from starting their new windows when they are reset. This is because the delay from the window's theoretical close time to its reset time could also eventually magnify causing alignment issues. Now, both these windows use the theoretical close times as their close times regardless of the when the reset happens. So the delay to the reset time eats into the next window and makes it shorter. For large data volumes, this is minimal.
The Querier now provides a void restart()
method to help mark the true start of a query after its initialization for the pre-start delay. It also simplifies the boolean shouldBuffer()
to only return true for queries with windows that are time-based. These queries should be buffered when they are finished in record-by-record processing systems. And conversely, all queries that return false for this method, should delay the start of these queries. See Bullet-Storm 0.8.2 for an implementation of this.
And finally, the Validator has been improved to support validating multi-key states.
This release also changes the underlying HTTP client used in the RESTPubsub from asynchttpclient (Netty based) to the Apache HTTPClient
Querier Partition Mode and AdditiveTumbling bug fix
This release removes the com.yahoo.bullet.core.querying.Querier#isClosedForPartition
and unifies it into a new Mode concept for a Querier instance. When a Querier is created, it now possible to pass in an enum Querier.Mode.PARTITION
or Querier.Mode.ALL
to tell the Querier that it is seeing a partition of the data for a Query or all of it. The Querier then decides the right thing to do when you call isClosed
or reset
. Previously, you would use the isClosedForPartition
when you partitioned the data for a Querier.
The AdditiveTumbling window previously never reset the data. This means that if you persisted the Querier instance in your partitioned stage and emitted the data when isClosedForPartition returned true, it would never reset the data. Your aggregation would then grow quadratically and be incorrect for future windows. To fix this, we needed a new resetForPartition method. Rather than expose that, this new Mode concept encapsulates that so you don't have to worry about which method to call.
Finally, the Querier#hasData
method has been renamed to hasNewData
to correctly signify what it means - new data was consumed or combined since the last call to reset
.
Added Headers to REST PubSub Publishers
This release was a small change to add content-type headers ("application/json") to the http requests sent to the REST PubSub endpoints.
In-Memory REST PubSub
This release offers an in-memory REST PubSub.
- The RESTPubSub can be used as a simple PubSub implementation.
- The Bullet default settings are updated to use this new PubSub implementation by default.
- A new separate rest_pubsub_defaults.yaml file was added with the RESTPubSub default settings. These settings should be added to your main yaml settings file when launching Bullet if you are using the REST PubSub.
- IMPORTANT: Some (3) Bullet settings changed to include ".ms" to clarify the time units:
bullet.query.default.duration
becamebullet.query.default.duration.ms
bullet.query.max.duration
becamebullet.query.max.duration.ms
bullet.query.window.min.emit.every
becamebullet.query.window.min.emit.every.ms
These will have to be updated in your Bullet settings file. These changes are reflected in the new Bullet default settings file.
- Some minor fixes to config validation - they are lot quieter. BulletConfig merging now calls the non-static validate method.
The REST PubSub uses two REST endpoints (one for queries and one for results). A new version of bullet-service will be released soon with a setting to expose these endpoints.
Windowing/Incremental Updates
This release brings windowing and incremental updates to Bullet!
This is a major breaking interface change. The changes here are too numerous to list individually. There has been a general refactor of classes and moving them to consolidated package names. The following list summarizes the changes:
- Refactoring - moving of commonly used classes like BulletConfig.java, BulletError.java, and other utility classes to a common package. Removing the
operations
package and movingaggregations
andtypesystem
to a top level package. Renamed thecom.yahoo.bullet.result.Metadata
class to Meta.java to prevent clashes with the pubsub Metadata and other such changes. - Removing of processing logic from the parsing classes. These strictly only deal with parsing logic. The processing logic is now moved to querying.
- Introduction of a Monoidal interface that abstracts out the common additive natures of many Bullet objects.
- Removal and unification of the
FilterQuery
andAggregationQuery
objects into one Monoidal object that lets your run and interact with a Bullet query - the Querier. Take a look at the Javadoc in this class to see how to roughly implement Bullet in a Stream processing framework. - The Monoidal interface is now used for all Aggregation [strategies]( and Windowing schemes.
- A new Windowing element in the Bullet query. The structure of a window in a Bullet JSON query is defined below in the Windowing section.
- Bullet methods, particularly in the Querier, are all reactive. They provide touch points to inquire about the status but leave it to the implementor to decide to honor it or not. For example, Querier provides a way to tell if the window is closed (or closed if the data is partitioned) but will still consume or combine data even if you don't reset and emit the data.
- With the windowing concept, it is now possible to define queries that may emit too much data. For instance, you could define a query with no filters that get each record as they come in! To prevent these situations, Bullet now defines a rate limiting mechanism that you can inquire the status of and fail or not. The rate limiting is only applied for getting data out of the Querier too fast. Sending data into the query is not limited (as it should be).
- Cleanup of sketches classes to unify the usage of dual sketches and metadata.
- All enums that define the FilterType, the AggregationType or the GroupOperationType etc can now be found in the respective classes for the parsing or their implementation.
- Configuration changes: new settings added for windowing and other concepts. Some configuration values have been changed. See below for the most important changes.
- The
aggregation.size
field now refers to the maximum number of result rows that will be returned per window and the aggregation itself is applied on the data specified by the window. For queries without windows, it is as if there is one giant window for the full query duration and the same size usage is applied. However,RAW
queries are an exception in that they know when they are done (if you have 500 records, you can return right away) and will return if the size is met. In a Tumbling window (see below), the RAW query will stop consuming or combining once this is reached (as an optimization). You can still provide data to the Querier - it will just be ignored.
Windowing
{
"window":{
"emit":{
"type":"TIME|RECORD",
"every":5000
},
"include":{
"type":"TIME|RECORD|ALL",
"first":5000
}
}
}
The emit
object specifies the criteria for when the window is emitted. The include
object specifies the criteria for what is in the window. This initial release does not support all combinations of these. It currently supports:
- Tumbling
{
"window":{
"emit":{
"type":"TIME",
"every":5000
}
}
This is a window that returns the result of the query aggregation on your data every 5 seconds. For instance, if you do a COUNT DISTINCT
, you will get a distinct count of the field(s) for your data within that 5 s window. The next result will include the count for the data in that window.
- Additive Tumbling:
{
"window":{
"emit":{
"type":"TIME",
"every":5000
},
"include":{
"type":"ALL",
}
}
This is still a window that returns the result of the query aggregation on your data every 5 seconds. However, the result includes the results since the start of the query. In the COUNT DISTINCT
example above, each window will include all the data since the start of time, so you will get a count of all the unique values since query start. It is an error to use this window for RAW
queries.
- Reactive
{
"window":{
"emit":{
"type":"RECORD",
"every":1
}
}
This is a special case of a Sliding window where the result is emitted for each record as the record arrives. We have currently disallowed providing anything other than 1
for every
. Furthermore, this window may only be used for RAW
queries. This lets you receive records as Bullet finds them making it snappy.
In all cases, if you provide include
and it is not of a ALL
type, it must be the same as emit
. This ensures that you fall into the Tumbling or Reactive cases above.
Other window types and combinations are coming in the future!
Configuration
There are quite a few changes to configuration. The most relevant ones are listed below:
-
The
bullet.query.max.duration
andbullet.query.default.duration
now default to INFINITY. This means that queries can run as long as they want now! This should be tweaked if this is not the behavior you desire. -
You can also disable windows entirely by setting the
bullet.query.window.disable
setting. All queries with windows will now be ignored. Note that this means you will only get the result at the end of your query duration (unless it's a RAW query) is particularly bad if your query has an infinite duration! -
Metadata concepts have been unified, renamed and cleaned up. New ones are also introduced for windowing.
Take a look at the Configuration defaults for a full list of all settings and their meaning.
Adds a BufferingSubscriber helper
Adds a helper Subscriber implementation com.yahoo.bullet.pubsub.BufferingSubscriber
to help with writing Subscribers that buffer a fixed number of uncommitted messages (a backpressure kind of mechanism) before stopping to read from the underlying pubsub. Failed messages are re-added back to the buffer for immediate re-emission. You only need to implement a single List<PubSubMessage> getMessages() throws PubSubException
in your Subscriber that extends this class (along with your regular close
and a constructor that sets the fixed buffer size).
You would use this if you're reading from a PubSub that cannot do resiliency or if want to manage resiliency yourself.
Additional helpers to Config, PubSubMessage, Metadata
- Adds
getAs
,getRequiredConfig
,getWithDefaultAs
tocom.yahoo.bullet.Config
. - Adds a default constructor to
com.yahoo.bullet.BulletConfig
to create a config with defaults. - Adds a new
FAIL
signal tocom.yahoo.bullet.pubsub.Metadata.Signal
to use for failure PubSubMessages. Config
no longer throws IOException when unable to read a file.hasSignal(Signal)
added tocom.yahoo.bullet.pubsub.Metadata
- Adds
hasSignal
andhasSignal(Signal)
tocom.yahoo.bullet.pubsub.PubSubMessage
. com.yahoo.bullet.pubsub.PubSubMessage
now implementscom.yahoo.bullet.parsing.JSONFormatter
and provides methods to write itself to JSON and back usingPubSubMessage#toJSON
andPubSubMessage#fromJSON(String)
.com.yahoo.bullet.parsing.JSONFormatter
provides a static methodfromJSON(String, Class<T>)
to convert a String JSON back to a provided class.
Refactoring PubSub and helpers
In this release:
com.yahoo.bullet.pubsub.PubSub
is no longer required to be serializable- serialVersionUID added to
com.yahoo.bullet.pubsub.PubSubMessage
com.yahoo.bullet.RandomPool
added as a Utility class. It used in live in bullet-servicecom.yahoo.bullet.pubsub.PubSub
methods throw PubSubExceptions and provides helpers for checking required arguments
Equals, hashcode and other helper methods
This adds a couple of helper methods to PubSubMessage and Config.