druid-0.13.0-incubating
Druid 0.13.0-incubating contains over 400 new features, performance/stability/documentation improvements, and bug fixes from 81 contributors. It is the first release of Druid in the Apache Incubator program. Major new features and improvements include:
- native parallel batch indexing
- automatic segment compaction
- system schema tables
- improved indexing task status, statistics, and error reporting
- SQL-compatible null handling
- result-level broker caching
- ingestion from RDBMS
- Bloom filter support
- additional SQL result formats
- additional aggregators (stringFirst/stringLast, ArrayOfDoublesSketch, HllSketch)
- support for multiple grouping specs in groupBy query
- mutual TLS support
- HTTP-based worker management
- broker backpressure
- maxBytesInMemory ingestion tuning configuration
- materialized views (community extension)
- parser for Influx Line Protocol (community extension)
- OpenTSDB emitter (community extension)
The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Aclosed+milestone%3A0.13.0
Documentation for this release is at: http://druid.io/docs/0.13.0-incubating/
Highlights
Native parallel batch indexing
Introduces the index_parallel
supervisor which manages the parallel batch ingestion of splittable sources without requiring a dependency on Hadoop. See http://druid.io/docs/latest/ingestion/native_tasks.html for more information.
Note: This is the initial single-phase implementation and has limitations on how it expects the input data to be partitioned. Notably, it does not have a shuffle implementation which will be added in the next iteration of this feature. For more details, see the proposal at #5543.
Added by @jihoonson in #5492.
Automatic segment compaction
Previously, compacting small segments into optimally-sized ones to improve query performance required submitting and running compaction or re-indexing tasks. This was often a manual process or required an external scheduler to handle the periodic submission of tasks. This patch implements automatic segment compaction managed by the coordinator service.
Note: This is the initial implementation and has limitations on interoperability with realtime ingestion tasks. Indexing tasks currently require acquisition of a lock on the portion of the timeline they will be modifying to prevent inconsistencies from concurrent operations. This implementation uses low-priority locks to ensure that it never interrupts realtime ingestion, but this also means that compaction may fail to make any progress if the realtime tasks are continually acquiring locks on the time interval being compacted. This will be improved in the next iteration of this feature with finer-grained locking. For more details, see the proposal at #4479.
Documentation for this feature: http://druid.io/docs/0.13.0-incubating/design/coordinator.html#compacting-segments
Added by @jihoonson in #5102.
System schema tables
Adds a system schema to the SQL interface which contains tables exposing information on served and published segments, nodes of the cluster, and information on running and completed indexing tasks.
Note: This implementation contains some known overhead inefficiencies that will be addressed in a future patch.
Documentation for this feature: http://druid.io/docs/0.13.0-incubating/querying/sql.html#system-schema
Added by @surekhasaharan in #6094.
Improved indexing task status, statistics, and error reporting
Improves the performance and detail of the ingestion-related APIs which were previously quite opaque making it difficult to determine the cause of parse exceptions, task failures, and the actual output from a completed task. Also adds improved ingestion metric reporting including moving average throughput statistics.
Added by @surekhasaharan and @jon-wei in #5801, #5418, and #5748.
SQL-compatible null handling
Improves Druid's handling of null values by treating them as missing values instead of being equivalent to empty strings or a zero-value. This makes Druid more SQL compatible and improves integration with external BI tools supporting ODBC/JDBC. See #4349 for proposal.
To enable this feature, you will need to set the system-wide property druid.generic.useDefaultValueForNull=false
.
Added by @nishantmonu51 in #5278 and #5958.
Results-level broker caching
Implements result-level caching on brokers which can operate concurrently with the traditional segment-level cache. See #4843 for proposal.
Documentation for this feature: http://druid.io/docs/0.13.0-incubating/configuration/index.html#broker-caching
Ingestion from RDBMS
Introduces a sql
firehose which supports data ingestion directly from an RDBMS.
Bloom filter support
Adds support for optimizing Druid queries by applying a Bloom filter generated by an external system such as Apache Hive. In the future, #6397 will support generation of Bloom filters as the result of Druid queries which can then be used to optimize future queries.
Added by @nishantmonu51 in #6222.
Additional SQL result formats
Adds result formats for line-based JSON and CSV and additionally X-Druid-Column-Names
and X-Druid-Column-Types
response headers containing a list of columns contained in the result.
'stringLast' and 'stringFirst' aggregators
Introduces two complementary aggregators, stringLast
and stringFirst
which operate on string columns and return the value with the maximum and minimum timestamp respectively.
Added by @andresgomezfrr in #5789.
ArrayOfDoublesSketch
Adds support for numeric Tuple sketches, which extend the functionality of the count distinct Theta sketches by adding arrays of double values associated with unique keys.
Added by @AlexanderSaydakov in #5148.
HllSketch
Adds a configurable implementation of a count distinct aggregator based on HllSketch from https://github.com/DataSketches. Comparison to Druid's native HyperLogLogCollector shows improved accuracy, efficiency, and speed: https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html
Added by @AlexanderSaydakov in #5712.
Support for multiple grouping specs in groupBy query
Adds support for the subtotalsSpec
groupBy parameter which allows Druid to be efficient by reusing intermediate results at the broker level when running multiple queries that group by subsets of the same set of columns. See proposal in #5179 for more information.
Added by @himanshug in #5280.
Mutual TLS support
Adds support for mutual TLS (server certificate validation + client certificate validation). See: https://en.wikipedia.org/wiki/Mutual_authentication
HTTP based worker management
Adds an HTTP-based indexing task management implementation to replace the previous one based on ZooKeeper. Part of a set of improvements to reduce and eventually eliminate Druid's dependency on ZooKeeper. See #4996 for proposal.
Added by @himanshug in #5104.
Broker backpressure
Allows the broker to exert backpressure on data-serving nodes to prevent the broker from crashing under memory pressure when results are coming in faster than they are being read by clients.
'maxBytesInMemory' ingestion tuning configuration
Previously, a major tuning parameter for indexing task memory management was the maxRowsInMemory
configuration, which determined the threshold for spilling the contents of memory to disk. This was difficult to properly configure since the 'size' of a row varied based on multiple factors. maxBytesInMemory
makes this configuration byte-based instead of row-based.
Added by @surekhasaharan in #5583.
Materialized views
Supports the creation of materialized views which can improve query performance in certain situations at the cost of additional storage. See http://druid.io/docs/latest/development/extensions-contrib/materialized-view.html for more information.
Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.
Added by @zhangxinyu1 in #5556.
Parser for Influx Line Protocol
Adds support for ingesting the Influx Line Protocol data format. For more information, see: https://docs.influxdata.com/influxdb/v1.6/write_protocols/line_protocol_tutorial/
Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.
Added by @njhartwell in #5440.
OpenTSDB emitter
Adds support for emitting Druid metrics to OpenTSDB.
Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.
Updating from 0.12.3 and earlier
Please see below for changes between 0.12.3 and 0.13.0 that you should be aware of before upgrading. If you're updating from an earlier version than 0.12.3, please see release notes of the relevant intermediate versions for additional notes.
MySQL metadata storage extension no longer includes JDBC driver
The MySQL metadata storage extension is now packaged together with the Druid distribution but without the required MySQL JDBC driver (due to licensing restrictions). To use this extension, the driver will need to be downloaded separately and added to the extension directory.
See http://druid.io/docs/latest/development/extensions-core/mysql.html for more details.
AWS region configuration required for S3 extension
As a result of switching from jets3t to the AWS SDK (#5382), users of the S3 extension are now required to explicitly set the target region. This can be done by setting the JVM system property aws.region
or the environment variable AWS_REGION
.
As an example, to set the region to 'us-east-1' through system properties:
- add
-Daws.region=us-east-1
to the jvm.config file for all Druid services - add
-Daws.region=us-east-1
todruid.indexer.runner.javaOpts
in middleManager/runtime.properties so that the property will be passed to peon (worker) processes
Ingestion spec changes
As a result of renaming packaging from io.druid
to org.apache.druid
, ingestion specs that reference classes by their fully-qualified class name will need to be modified accordingly.
As an example, if you are using the Parquet extension with Hadoop indexing, the inputFormat
field of the inputSpec
will need to change from io.druid.data.input.parquet.DruidParquetInputFormat
to org.apache.druid.data.input.parquet.DruidParquetInputFormat
.
Metrics changes
New metrics
task/action/log/time
- Milliseconds taken to log a task action to the audit log (#5714)task/action/run/time
- Milliseconds taken to execute a task action (#5714)query/node/backpressure
- Nanoseconds the channel is unreadable due to backpressure being applied (#6335) (Note that this is not enabled by default and requires a custom implementation ofQueryMetrics
to emit)
New dimensions
taskId
andtaskType
added to task-related metrics (#5664)
Other
HttpPostEmitterMonitor
no longer emits maxTime and minTime if no times were recorded (#6418)
Rollback restrictions
64-bit doubles aggregators
64-bit doubles aggregators are now used by default (see #5478). Support for 64-bit floating point columns was release in Druid 0.11.0, so if this is enabled, versions older than 0.11.0 will not be able to read the data segments.
To disable and keep the old format, you will need to set the system-wide property druid.indexing.doubleStorage=float
.
Disabling bitmap indexes
0.13.0 adds support for disabling bitmap indexes on a per-column basis, which can save space in cases where bitmap indexes add no value. This is done by setting the 'createBitmapIndex' field in the dimension schema. Segments written with this option will not be backwards compatible with older versions of Druid (#5402).
Behavior changes
Java package name changes
Druid's package names have all moved from io.druid
to org.apache.druid
. This affects the name of the Java main class that you should run when starting up services, which is now org.apache.druid.cli.Main
. It may also affect installation and configuration of extensions and monitors.
ParseSpec is now a required field in ingestion specs
There is no longer a default ParseSpec (previously the DelimitedParseSpec). Ingestion specs now require parseSpec
to be specified. If you previously did not provide a parseSpec
, you should use one with "format": "tsv"
to maintain the existing behavior (#6310).
Change to default maximum rows to return in one JDBC frame
The default value for druid.sql.avatica.maxRowsPerFrame
was reduced from 100k to 5k to minimize out of memory errors (#5409).
Router behavior change when routing to brokers dedicated to different time ranges
As a result of #5595, routers may now select an undesired broker in configurations where there are different tiers of brokers that are intended to be dedicated to queries on different time ranges. See #1362 for discussion.
Ruby TimestampSpec no longer ignores milliseconds
Timestamps parsed using a TimestampSpec with format 'ruby' no longer truncates the millisecond component. If you were using this parser and wanted a query granularity of SECOND, ensure that it is configured appropriately in your indexing specs (#6217).
Small increase in size of ZooKeeper task announcements
The datasource name was added to TaskAnnouncement
which will result in a small per task increase in the amount of data stored in ZooKeeper (#5511).
Addition of 'name' field to filtered aggregators
Aggregators of type 'filtered' now support a 'name' field. Previously, the filtered aggregator inherited the name of the aggregator it wrapped. If you have provided the 'name' field for both the filtered aggregator and the wrapped aggregator, it will prefer the name of the filtered aggregator. It will use the name of the wrapped aggregator if the name of the filtered aggregator is missing or empty (#6219).
utf8mb4 is now the recommended metadata storage charset
For upgrade instructions, use the ALTER DATABASE
and ALTER TABLE
instructions as described here: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html.
For motivation and reference, see #5377 and #5411.
Removed configuration properties
druid.indexer.runner.tlsStartPort
has been removed (#6194).druid.indexer.runner.separateIngestionEndpoint
has been removed (#6263).
Interface changes for extension developers
-
Packaging has been renamed from
io.druid
toorg.apache.druid
. All third-party extensions will need to rename their META-INF/io.druid.initialization.DruidModule
toorg.apache.druid.initialization.DruidModule
and update their extension's packaging appropriately (#6266). -
The
DataSegmentPuller
interface has been removed (#5461). -
A number of functions under
java-util
have been removed (#5461). -
The constructor of the
Metadata
class has changed (#5613). -
The 'spark2' Maven profile has been removed (#5581).
API deprecations
Overlord
- The
/druid/indexer/v1/supervisor/{id}/shutdown
endpoint has been deprecated in favor of/druid/indexer/v1/supervisor/{id}/terminate
(#6272 and #6234). - The
/druid/indexer/v1/task/{taskId}/segments
endpoint has been deprecated (#6368). - The
status
field returned by/druid/indexer/v1/task/ID/status
has been deprecated in favor ofstatusCode
(#6334). - The
reportParseExceptions
andignoreInvalidRows
parameters for ingestion tuning configurations have been deprecated in favor oflogParseExceptions
andmaxParseExceptions
(#5418).
Broker
- The
/druid/v2/datasources/{dataSourceName}/dimensions
endpoint has been deprecated. A segment metadata query or the INFORMATION_SCHEMA SQL table should be used instead (#6361). - The
/druid/v2/datasources/{dataSourceName}/metrics
endpoint has been deprecated. A segment metadata query or the INFORMATION_SCHEMA SQL table should be used instead (#6361).
Credits
Thanks to everyone who contributed to this release!
@a2l007
@adursun
@AK08
@akashdw
@aleksi75
@AlexanderSaydakov
@alperkokmen
@amalakar
@andresgomezfrr
@apollotonkosmo
@asdf2014
@awelsh93
@b-slim
@bolkedebruin
@Caroline1000
@chengchengpei
@clintropolis
@dansuzuki
@dclim
@DiegoEliasCosta
@dragonls
@drcrallen
@dyanarose
@dyf6372
@Dylan1312
@erikdubbelboer
@es1220
@evasomething
@fjy
@Fokko
@gaodayue
@gianm
@hate13
@himanshug
@hoesler
@jaihind213
@jcollado
@jihoonson
@jim-slattery-rs
@jkukul
@jon-wei
@josephglanville
@jsun98
@kaijianding
@KenjiTakahashi
@kevinconaway
@korvit0
@leventov
@lssenthilkumar
@mhshimul
@niketh
@NirajaM
@nishantmonu51
@njhartwell
@palanieppan-m
@pdeva
@pjain1
@QiuMM
@redlion99
@rpless
@samarthjain
@Scorpil
@scrawfor
@shiroari
@shivtools
@siddharths
@SpyderRivera
@spyk
@stuartmclean
@surekhasaharan
@susielu
@varaga
@vineshcpaul
@vvc11
@wysstartgo
@xvrl
@yunwan
@yuppie-flu
@yurmix
@zhangxinyu1
@zhztheplayer