Apache Druid (incubating) 0.14.0-incubating contains over 200 new features, performance/stability/documentation improvements, and bug fixes from 54 contributors. Major new features and improvements include:

New web console
Amazon Kinesis indexing service
Decommissioning mode for Historicals
Published segment cache in Broker
Bloom filter aggregator and expression
Updated Apache Parquet extension
Force push down option for nested GroupBy queries
Better segment handoff and drop rule handling
Automatically kill MapReduce jobs when Apache Hadoop ingestion tasks are killed
DogStatsD tag support for statsd emitter
New API for retrieving all lookup specs
New compaction options
More efficient cachingCost segment balancing strategy

The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Amerged+milestone%3A0.14.0

Documentation for this release is at: http://druid.io/docs/0.14.0-incubating/

Highlights

New web console

Druid has a new web console that provides functionality that was previously split between the coordinator and overlord consoles.

The new console allows the user to manage datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.

For more details, please see http://druid.io/docs/0.14.0-incubating/operations/management-uis.html

Added by @vogievetsky in #6923.

Kinesis indexing service

Druid now supports ingestion from Kinesis streams, provided by the new druid-kinesis-indexing-service core extension.

Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/kinesis-ingestion.html for details.

Added by @jsun98 in #6431.

Decommissioning mode for Historicals

Historical processes can now be put into a "decommissioning" mode, where the coordinator will no longer consider the Historical process as a target for segment replication. The coordinator will also move segments off the decommissioning Historical.

This is controlled via Coordinator dynamic configuration. For more details, please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#dynamic-configuration.

Added by @egor-ryashin in #6349.

Published segment cache on Broker

The Druid Broker now has the ability to maintain a cache of published segments via polling the Coordinator, which can significantly improve response time for metadata queries on the sys.segments system table.

Please see http://druid.io/docs/0.14.0-incubating/querying/sql.html#retrieving-metadata for details.

Added by @surekhasaharan in #6901

Bloom filter aggregator and expression

A new aggregator for constructing Bloom filters at query time and support for performing Bloom filter checks within Druid expressions have been added to the druid-bloom-filter extension.

Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/bloom-filter.html

Added by @clintropolis in #6904 and #6397

Updated Parquet extension

druid-extensions-parquet has been moved into the core extension set from the contrib extensions and now supports flattening and int96 values.

Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.

Added by @clintropolis in #6360

Force push down option for nested GroupBy queries

Outer query execution for nested GroupBy queries can now be pushed down to Historical processes; previously, the outer queries would always be executed on the Broker.

Please see #5471 for details.

Added by @samarthjain in #5471.

Better segment handoff and retention rule handling

Segment handoff will now ignore segments that would be dropped by a datasource's retention rules, avoiding ingestion failures caused by issue #5868.

Period load rules will now include the future by default.

A new "Period Drop Before" rule has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/rule-configuration.html#period-drop-before-rule for details.

Added by @QiuMM in #6676, #6414, and #6415.

Automatically kill MapReduce jobs when Hadoop ingestion tasks are killed

Druid will now automatically terminate MapReduce jobs created by Hadoop batch ingestion tasks when the ingestion task is killed.

Added by @ankit0811 in #6828.

DogStatsD tag support for statsd-emitter

The statsd-emitter extension now supports DogStatsD-style tags. Please see http://druid.io/docs/0.14.0-incubating/development/extensions-contrib/statsd.html

Added by @deiwin in #6605, with support for constant tags added by @glasser in #6791.

New API for retrieving all lookup specs

A new API for retrieving all lookup specs for all tiers has been added. Please see http://druid.io/docs/0.14.0-incubating/querying/lookups.html#get-all-lookups for details.

Added by @jihoonson in #7025.

New compaction options

Auto-compaction now supports the maxRowsPerSegment option. Please see http://druid.io/docs/0.14.0-incubating/design/coordinator.html#compacting-segments for details.

The compaction task now supports a new segmentGranularity option, deprecating the older keepSegmentGranularity option for controlling the segment granularity of compacted segments. Please see the segmentGranularity table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.

Added by @jihoonson in #6758 and #6780.

More efficient cachingCost segment balancing strategy

The cachingCost Coordinator segment balancing strategy will now only consider Historical processes for balancing decisions. Previously the strategy would unnecessarily consider active worker tasks as well, which are not targets for segment replication.

Added by @QiuMM in #6879.

New metrics:

New allocation rate metric jvm/heapAlloc/bytes, added by @egor-ryashin in #6710.
New query count metric query/count, added by @QiuMM in #6473.
SQL query metrics sqlQuery/bytes and sqlQuery/time, added by @gaodayue in #6302.
Apache Kafka ingestion lag metrics ingest/kafka/maxLag and ingest/kafka/avgLag, added by @QiuMM in #6587
Task count metrics task/success/count, task/failed/count, task/running/count, task/pending/count, task/waiting/count, added by @QiuMM in #6657

New interfaces for extension developers

RequestLogEvent

It is now possible to control the fields in RequestLogEvent, emitted by EmittingRequestLogger. Please see #6477 for details. Added by @leventov.

Custom TLS certificate checks

An extension point for custom TLS certificate checks has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/tls-support.html#custom-tls-certificate-checks for details. Added by @jon-wei in #6432.

Kafka Indexing Service no longer experimental

The Kafka Indexing Service extension has been moved out of experimental status.

SQL Enhancements

Enhancements to dsql

The dsql command line client now supports CLI history, basic autocomplete, and specifying query timeouts in the query context.

Added in #6929 by @gianm.

Add SQL id, request logs, and metrics

SQL queries now have an ID, and native queries executed as part of a SQL query will have the associated SQL query ID in the native query's request logs. SQL queries will now be logged in the request logs.

Two new metrics, sqlQuery/time and sqlQuery/bytes, are now emitted for SQL queries.

Please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#request-logging and http://druid.io/docs/0.14.0-incubating/querying/sql.html#sql-metrics for details.

Added by @gaodayue in #6302

More SQL aggregator support

The follow aggregators are now supported in SQL:

DataSketches HLL sketch
DataSketches Theta sketch
DataSketches quantiles sketch
Fixed bins histogram
Bloom filter aggregator

Added by @jon-wei in #6951 and @clintropolis in #6502

Other SQL enhancements

SQL: Add support for queries with project-after-semijoin. #6756
SQL: Support for selecting multi-value dimensions. #6462
SQL: Support AVG on system tables. #601
SQL: Add "POSITION" function. #6596
SQL: Set INFORMATION_SCHEMA catalog name to "druid". #6595
SQL: Fix ordering of sort, sortProject in DruidSemiJoin. #6769

Added by @gianm.

Updating from 0.13.0-incubating and earlier

Kafka ingestion downtime when upgrading

Due to the issue described in #6958, existing Kafka indexing tasks can be terminated unnecessarily during a rolling upgrade of the Overlord. The terminated tasks will be restarted by the Overlord and will function correctly after the initial restart.

Parquet extension changes

The druid-parquet-extensions extension has been moved from contrib to core. When deploying 0.14.0-incubating, please ensure that your extensions-contrib directory does not have any older versions of the Parquet extension.

Additionally, there are now two styles of Parquet parsers in the extension:

parquet-avro: Converts Parquet to Avro, and then parses the Avro representation. This was the existing parser prior to 0.14.0-incubating.
parquet: A new parser that parses the Parquet format directly. Only this new parser supports int96 values.

Prior to 0.14.0-incubating, a specifying a parquet type parser would have a task use the Avro-converting parser. In 0.14.0-incubating, to continue using the Avro-converting parser, you will need to update your ingestion specs to use parquet-avro instead.

The inputFormat field in the inputSpec for tasks using Parquet input must also match the choice of parser:

parquet: org.apache.druid.data.input.parquet.DruidParquetInputFormat
parquet-avro: org.apache.druid.data.input.parquet.DruidParquetInputFormat

Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.

Running Druid with non-2.8.3 Hadoop

If you plan to use Druid 0.14.0-incubating with Hadoop versions other than 2.8.3, you may need to do the following:

Set the Hadoop dependency coordinates to your target version as described in http://druid.io/docs/0.14.0-incubating/operations/other-hadoop.html under Tip #3: Use specific versions of Hadoop libraries.
Rebuild Druid with your target version of Hadoop by changing hadoop.compile.version in the main Druid pom.xml and then following the standard build instructions.

Other Behavior changes

Old task cleanup

Old task entries in the metadata storage will now be cleaned up automatically together with their task logs. Please see http:/druid.io/docs/0.14.0-incubating/development/extensions-core/configuration/index.html#task-logging and #6592 for details.

Automatic processing buffer sizing

The druid.processing.buffer.sizeBytes property has new default behavior if it is not set. Druid will now automatically choose a value for the processing buffer size using the following formula:

processingBufferSize = totalDirectMemory / (numMergeBuffers + numProcessingThreads + 1)
processingBufferSize = min(processingBufferSize, 1GB)

Where:

totalDirectMemory: The direct memory limit for the JVM specified by -XX:MaxDirectMemorySize
numMergeBuffers: The value of druid.processing.numMergeBuffers.
numProcessingThreads: The value of druid.processing.numThreads.

At most, Druid will use 1GB for the automatically chosen processing buffer size. The processing buffer size can still be specified manually.

Please see #6588 for details.

Retention rules now include the future by default

Please be aware that new retention rules will now include the future by default. Please see #6414 for details.

Property changes

Segment announcing

The druid.announcer.type property used for choosing between Zookeeper or HTTP-based segment management/discovery has been moved to druid.serverview.type. If you were using http prior to 0.14.0-incubating, you will need to update your configs to use the new druid.serverview.type.

Please see the following for details:

fix missing property in JsonTypeInfo of SegmentWriteOutMediumFactory

The druid.peon.defaultSegmentWriteOutMediumFactory.@type property has been fixed. The property is now druid.peon.defaultSegmentWriteOutMediumFactory.type without the "@".

Please see #6656 for details.

Deprecations

Approximate Histogram aggregator

The ApproximateHistogram aggregator has been deprecated; it is a distribution-dependent algorithm without formal error bounds and has significant accuracy issues.

The DataSketches quantiles aggregator should be used instead for quantile and histogram use cases.

Please see Histogram and Quantiles Aggregators

Cardinality/HyperUnique aggregator

The Cardinality and HyperUnique aggregators have been deprecated in favor of the DataSketches HLL aggregator and Theta Sketch aggregator. These aggregators have better accuracy and performance characteristics.

Please see Count Distinct Aggregators for details.

Query Chunk Period

The chunkPeriod query context configuration is now deprecated, along with the associated query/intervalChunk/time metric. Please see #6591 for details.

`keepSegmentGranularity` for Compaction

The keepSegmentGranularity option for compaction tasks has been deprecated. Please see #6758 and the segmentGranularity table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.

Interface changes for extension developers

`SegmentId` class

Druid now uses a SegmentId class instead of plain Strings to represent segment IDs. Please see #6370 for details.

Added by @leventov.

`druid-api`, `druid-common`, `java-util` moved to `druid-core`

The druid-api, druid-common, java-util modules have been moved into druid-core. Please update your dependencies accordingly if your project depended on these libraries.

Please see #6443 for details.

Credits

Thanks to everyone who contributed to this release!

@a2l007
@AlexanderSaydakov
@anantmf
@ankit0811
@asdf2014
@awelsh93
@benhopp
@Caroline1000
@clintropolis
@dclim
@deiwin
@DiegoEliasCosta
@drcrallen
@dyf6372
@Dylan1312
@egor-ryashin
@elloooooo
@evans
@FaxianZhao
@gaodayue
@gianm
@glasser
@Guadrado
@hate13
@hoesler
@hpandeycodeit
@janeklb
@jihoonson
@jon-wei
@jorbay-au
@jsun98
@justinborromeo
@kamaci
@leventov
@lxqfy
@mirkojotic
@navkumar
@niketh
@patelh
@pzhdfy
@QiuMM
@rcgarcia74
@richardstartin
@robertervin
@samarthjain
@seoeun25
@Shimi
@surekhasaharan
@taiii
@thomask
@VincentNewkirk
@vogievetsky
@yunwan
@zhaojiandong

druid-0.14.0-incubating