druid-0.14.0-incubating
Apache Druid (incubating) 0.14.0-incubating contains over 200 new features, performance/stability/documentation improvements, and bug fixes from 54 contributors. Major new features and improvements include:
- New web console
- Amazon Kinesis indexing service
- Decommissioning mode for Historicals
- Published segment cache in Broker
- Bloom filter aggregator and expression
- Updated Apache Parquet extension
- Force push down option for nested GroupBy queries
- Better segment handoff and drop rule handling
- Automatically kill MapReduce jobs when Apache Hadoop ingestion tasks are killed
- DogStatsD tag support for statsd emitter
- New API for retrieving all lookup specs
- New compaction options
- More efficient cachingCost segment balancing strategy
The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Amerged+milestone%3A0.14.0
Documentation for this release is at: http://druid.io/docs/0.14.0-incubating/
Highlights
New web console
Druid has a new web console that provides functionality that was previously split between the coordinator and overlord consoles.
The new console allows the user to manage datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.
For more details, please see http://druid.io/docs/0.14.0-incubating/operations/management-uis.html
Added by @vogievetsky in #6923.
Kinesis indexing service
Druid now supports ingestion from Kinesis streams, provided by the new druid-kinesis-indexing-service
core extension.
Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/kinesis-ingestion.html for details.
Decommissioning mode for Historicals
Historical processes can now be put into a "decommissioning" mode, where the coordinator will no longer consider the Historical process as a target for segment replication. The coordinator will also move segments off the decommissioning Historical.
This is controlled via Coordinator dynamic configuration. For more details, please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#dynamic-configuration.
Added by @egor-ryashin in #6349.
Published segment cache on Broker
The Druid Broker now has the ability to maintain a cache of published segments via polling the Coordinator, which can significantly improve response time for metadata queries on the sys.segments
system table.
Please see http://druid.io/docs/0.14.0-incubating/querying/sql.html#retrieving-metadata for details.
Added by @surekhasaharan in #6901
Bloom filter aggregator and expression
A new aggregator for constructing Bloom filters at query time and support for performing Bloom filter checks within Druid expressions have been added to the druid-bloom-filter
extension.
Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/bloom-filter.html
Added by @clintropolis in #6904 and #6397
Updated Parquet extension
druid-extensions-parquet
has been moved into the core extension set from the contrib extensions and now supports flattening and int96 values.
Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.
Added by @clintropolis in #6360
Force push down option for nested GroupBy queries
Outer query execution for nested GroupBy queries can now be pushed down to Historical processes; previously, the outer queries would always be executed on the Broker.
Please see #5471 for details.
Added by @samarthjain in #5471.
Better segment handoff and retention rule handling
Segment handoff will now ignore segments that would be dropped by a datasource's retention rules, avoiding ingestion failures caused by issue #5868.
Period load rules will now include the future by default.
A new "Period Drop Before" rule has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/rule-configuration.html#period-drop-before-rule for details.
Added by @QiuMM in #6676, #6414, and #6415.
Automatically kill MapReduce jobs when Hadoop ingestion tasks are killed
Druid will now automatically terminate MapReduce jobs created by Hadoop batch ingestion tasks when the ingestion task is killed.
Added by @ankit0811 in #6828.
DogStatsD tag support for statsd-emitter
The statsd-emitter
extension now supports DogStatsD-style tags. Please see http://druid.io/docs/0.14.0-incubating/development/extensions-contrib/statsd.html
Added by @deiwin in #6605, with support for constant tags added by @glasser in #6791.
New API for retrieving all lookup specs
A new API for retrieving all lookup specs for all tiers has been added. Please see http://druid.io/docs/0.14.0-incubating/querying/lookups.html#get-all-lookups for details.
Added by @jihoonson in #7025.
New compaction options
Auto-compaction now supports the maxRowsPerSegment
option. Please see http://druid.io/docs/0.14.0-incubating/design/coordinator.html#compacting-segments for details.
The compaction task now supports a new segmentGranularity
option, deprecating the older keepSegmentGranularity
option for controlling the segment granularity of compacted segments. Please see the segmentGranularity
table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.
Added by @jihoonson in #6758 and #6780.
More efficient cachingCost segment balancing strategy
The cachingCost
Coordinator segment balancing strategy will now only consider Historical processes for balancing decisions. Previously the strategy would unnecessarily consider active worker tasks as well, which are not targets for segment replication.
New metrics:
- New allocation rate metric
jvm/heapAlloc/bytes
, added by @egor-ryashin in #6710. - New query count metric
query/count
, added by @QiuMM in #6473. - SQL query metrics
sqlQuery/bytes
andsqlQuery/time
, added by @gaodayue in #6302. - Apache Kafka ingestion lag metrics
ingest/kafka/maxLag
andingest/kafka/avgLag
, added by @QiuMM in #6587 - Task count metrics
task/success/count
,task/failed/count
,task/running/count
,task/pending/count
,task/waiting/count
, added by @QiuMM in #6657
New interfaces for extension developers
RequestLogEvent
It is now possible to control the fields in RequestLogEvent
, emitted by EmittingRequestLogger
. Please see #6477 for details. Added by @leventov.
Custom TLS certificate checks
An extension point for custom TLS certificate checks has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/tls-support.html#custom-tls-certificate-checks for details. Added by @jon-wei in #6432.
Kafka Indexing Service no longer experimental
The Kafka Indexing Service extension has been moved out of experimental status.
SQL Enhancements
Enhancements to dsql
The dsql
command line client now supports CLI history, basic autocomplete, and specifying query timeouts in the query context.
Add SQL id, request logs, and metrics
SQL queries now have an ID, and native queries executed as part of a SQL query will have the associated SQL query ID in the native query's request logs. SQL queries will now be logged in the request logs.
Two new metrics, sqlQuery/time
and sqlQuery/bytes
, are now emitted for SQL queries.
Please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#request-logging and http://druid.io/docs/0.14.0-incubating/querying/sql.html#sql-metrics for details.
More SQL aggregator support
The follow aggregators are now supported in SQL:
- DataSketches HLL sketch
- DataSketches Theta sketch
- DataSketches quantiles sketch
- Fixed bins histogram
- Bloom filter aggregator
Added by @jon-wei in #6951 and @clintropolis in #6502
Other SQL enhancements
- SQL: Add support for queries with project-after-semijoin. #6756
- SQL: Support for selecting multi-value dimensions. #6462
- SQL: Support AVG on system tables. #601
- SQL: Add "POSITION" function. #6596
- SQL: Set INFORMATION_SCHEMA catalog name to "druid". #6595
- SQL: Fix ordering of sort, sortProject in DruidSemiJoin. #6769
Added by @gianm.
Updating from 0.13.0-incubating and earlier
Kafka ingestion downtime when upgrading
Due to the issue described in #6958, existing Kafka indexing tasks can be terminated unnecessarily during a rolling upgrade of the Overlord. The terminated tasks will be restarted by the Overlord and will function correctly after the initial restart.
Parquet extension changes
The druid-parquet-extensions
extension has been moved from contrib
to core
. When deploying 0.14.0-incubating, please ensure that your extensions-contrib
directory does not have any older versions of the Parquet extension.
Additionally, there are now two styles of Parquet parsers in the extension:
parquet-avro
: Converts Parquet to Avro, and then parses the Avro representation. This was the existing parser prior to 0.14.0-incubating.parquet
: A new parser that parses the Parquet format directly. Only this new parser supports int96 values.
Prior to 0.14.0-incubating, a specifying a parquet
type parser would have a task use the Avro-converting parser. In 0.14.0-incubating, to continue using the Avro-converting parser, you will need to update your ingestion specs to use parquet-avro
instead.
The inputFormat
field in the inputSpec
for tasks using Parquet input must also match the choice of parser:
parquet
:org.apache.druid.data.input.parquet.DruidParquetInputFormat
parquet-avro
:org.apache.druid.data.input.parquet.DruidParquetInputFormat
Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.
Running Druid with non-2.8.3 Hadoop
If you plan to use Druid 0.14.0-incubating with Hadoop versions other than 2.8.3, you may need to do the following:
- Set the Hadoop dependency coordinates to your target version as described in http://druid.io/docs/0.14.0-incubating/operations/other-hadoop.html under
Tip #3: Use specific versions of Hadoop libraries
. - Rebuild Druid with your target version of Hadoop by changing
hadoop.compile.version
in the main Druidpom.xml
and then following the standard build instructions.
Other Behavior changes
Old task cleanup
Old task entries in the metadata storage will now be cleaned up automatically together with their task logs. Please see http:/druid.io/docs/0.14.0-incubating/development/extensions-core/configuration/index.html#task-logging and #6592 for details.
Automatic processing buffer sizing
The druid.processing.buffer.sizeBytes
property has new default behavior if it is not set. Druid will now automatically choose a value for the processing buffer size using the following formula:
processingBufferSize = totalDirectMemory / (numMergeBuffers + numProcessingThreads + 1)
processingBufferSize = min(processingBufferSize, 1GB)
Where:
- totalDirectMemory: The direct memory limit for the JVM specified by
-XX:MaxDirectMemorySize
- numMergeBuffers: The value of
druid.processing.numMergeBuffers
. - numProcessingThreads: The value of
druid.processing.numThreads
.
At most, Druid will use 1GB for the automatically chosen processing buffer size. The processing buffer size can still be specified manually.
Please see #6588 for details.
Retention rules now include the future by default
Please be aware that new retention rules will now include the future by default. Please see #6414 for details.
Property changes
Segment announcing
The druid.announcer.type
property used for choosing between Zookeeper or HTTP-based segment management/discovery has been moved to druid.serverview.type
. If you were using http
prior to 0.14.0-incubating, you will need to update your configs to use the new druid.serverview.type
.
Please see the following for details:
- http://druid.io/docs/0.14.0-incubating/configuration/index.html#segment-management
- http://druid.io/docs/0.14.0-incubating/configuration/index.html#segment-discovery
fix missing property in JsonTypeInfo of SegmentWriteOutMediumFactory
The druid.peon.defaultSegmentWriteOutMediumFactory.@type
property has been fixed. The property is now druid.peon.defaultSegmentWriteOutMediumFactory.type
without the "@".
Please see #6656 for details.
Deprecations
Approximate Histogram aggregator
The ApproximateHistogram aggregator has been deprecated; it is a distribution-dependent algorithm without formal error bounds and has significant accuracy issues.
The DataSketches quantiles aggregator should be used instead for quantile and histogram use cases.
Please see Histogram and Quantiles Aggregators
Cardinality/HyperUnique aggregator
The Cardinality and HyperUnique aggregators have been deprecated in favor of the DataSketches HLL aggregator and Theta Sketch aggregator. These aggregators have better accuracy and performance characteristics.
Please see Count Distinct Aggregators for details.
Query Chunk Period
The chunkPeriod
query context configuration is now deprecated, along with the associated query/intervalChunk/time
metric. Please see #6591 for details.
keepSegmentGranularity
for Compaction
The keepSegmentGranularity
option for compaction tasks has been deprecated. Please see #6758 and the segmentGranularity
table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.
Interface changes for extension developers
SegmentId
class
Druid now uses a SegmentId
class instead of plain Strings to represent segment IDs. Please see #6370 for details.
Added by @leventov.
druid-api
, druid-common
, java-util
moved to druid-core
The druid-api
, druid-common
, java-util
modules have been moved into druid-core
. Please update your dependencies accordingly if your project depended on these libraries.
Please see #6443 for details.
Credits
Thanks to everyone who contributed to this release!
@a2l007
@AlexanderSaydakov
@anantmf
@ankit0811
@asdf2014
@awelsh93
@benhopp
@Caroline1000
@clintropolis
@dclim
@deiwin
@DiegoEliasCosta
@drcrallen
@dyf6372
@Dylan1312
@egor-ryashin
@elloooooo
@evans
@FaxianZhao
@gaodayue
@gianm
@glasser
@Guadrado
@hate13
@hoesler
@hpandeycodeit
@janeklb
@jihoonson
@jon-wei
@jorbay-au
@jsun98
@justinborromeo
@kamaci
@leventov
@lxqfy
@mirkojotic
@navkumar
@niketh
@patelh
@pzhdfy
@QiuMM
@rcgarcia74
@richardstartin
@robertervin
@samarthjain
@seoeun25
@Shimi
@surekhasaharan
@taiii
@thomask
@VincentNewkirk
@vogievetsky
@yunwan
@zhaojiandong