Skip to content

ray-ng/TierLevelDB

Repository files navigation

multi version support(value log reclaim just save the latest value in log); value log reclaim(need lock read in case that get invalid address; need lock write in case that latest value be overrided); GC I/O; support vlog compression(less I/O overhead) without additional metadata; value log compress(increase cache hit ratio) and memory cache; keys no need stored in value log(less reading and writing data, increase cache hit ratio); separate gate size; decide whether reclaim vLogs according to sstables compaction statics; crash consistency algorithm; vLogs compression which is vital to column-oriented databases; disk I/O has a minimum limitation(set vlog block size equal to this option value, work with compression and cache) range query(read-head, vlog compression, cache)

background: mvcc; snapshot; kv separation

Abstract

Key-value(KV) stores built on the LSM-tree offer excellent write throughput, yet the LSM-tree suffers from high write amplification. KV separation mitigates I/O amplification in LSM-tree by storing keys in the LSM-tree and values in value-logs(vLogs). However, existing KV separation designs remain several flaws which constrain their performance and scope of use. We present TierLevelKV which introduces the notion of assemblies to organize SSTables and vLogs. The KV separation design of TierLevelKV performs garbage collection in vLogs with low I/O overhead and without blocking read or write operations. Meahwhile, TierLevelKV supports vLogs compression and multi-version so as to be used as a storage engine in column-oriented databases.

Introduction

We argue that performance and usage scope of current KV separation designs are constrained due to several flaws. First, garbage collection(GC) operations of current designs just retain the latest value for the same key so multi-version is not considered. Second, GC operations need to block read operations in case they get invalid addresses where storage spaces are reclaimed. Write operations are also needed to be blocked during GC operations in case latest values get covered by previous versions. Third, GC operations need to scan items in the selected vLog to check the validity of KV pairs, leading to high I/O overhead. Fourth, online GC operations require keys also being stored in vLogs, which consume extra I/Os.

We propose three complementary innovations to remedy above-mentioned deficiencies. The first innovation we introduce is assemblies and the algorithm of generating assemblies. We use assemblies to organize SSTables and vLogs. One assembly contains several SSTables as well as the vLogs that these SSTables refer to. In addition, each GC operation selects a single assembly to operate on. By this way, multi-version and vLogs compression are supported in our KV separation design. The second innovation is horizontal compaction operation. Our design uses horizontal compaction to perform GC, which brings several advantages. First, GC operations can be performed in an online and asynchronous method which avoids blocking read or write operations. Second, our design eliminates the scans through vLogs for checking the validity of KV pairs so as to reduce I/O overhead. Third, keys are no longer needed in vLogs, leading to less I/O overhead during GC process. The third innovation is the crash consistency mechanism which takes assemblies into consideration and provides the ability of fast recovery.

Combining the three innovations given above forms TierLevelKV, a new KV store built on top of LevelDB. (experiments results)

To summarize, this paper makes the following key contributions: (1) the design of TierLevelKV, a KV store built using the above three innovations, (2) a publicly available implementation of TierLevelKV, whose source code address is: (3) an evaluation of its benifits in comparsion to existing KV stores.

Background

Motivation

While KV separation effectively mitigates write and read amplifications of LSM-tree, we argue that itself cannot achieve high performance due to several flaws of current designs. The reasons are described below.

First, garbage collection(GC) operations of current designs just retain the latest value for the same key so multi-version is not supported. LevelDB supports the multi-version concurrency control(MVCC) mechanism, which means LevelDB can process multiple concurrent read operations based on different versions. However, existing KV separation designs marks the latest version of every key as valid, and reclaims storage spaces from other versions. As a result, KV stores based on existing KV separation designs do not support MVCC mechanism.

Second, current KV separation designs necessitate GC to scan vLogs to check the validity of KV pairs. For example, When WiscKey performs a GC operation, it reads a chunk of KV pairs from the vLog tail and queries the LSM-tree to see if each KV pair is valid. It then discards the values of invalid KV pairs, and writes back the valid values to the vLog head. Since the keys of the KV pairs may be scattered across the entire LSM-tree, the query overhead is high. For another example, HashKV scans the KV pairs in the value store and constructs a temporary in-memory hash table to check the validity of KV pairs during a GC operation. Besides, HashKV needs to scan all items in the selected vLog otherwise KV pairs written back by GC can cover newer ones. As we can see, the overhead of GC operations becomes substantial under large workloads.

Third, multiple concurrent read operations can be processed based on different versions while GC operations just retain the latest version of every key. In case read operations get invalid addresses, GC operations need to wait until all read operations being processed finish. Furthermore, current designs need to block read operations during GC progress. Therefore, current KV separation designs suffer from low read throughput.

Forth, LevelDB buffers written KV pairs and flushes them on the device when the buffer gets full. The recently written KV pairs are stored at higher levels. To perform a key lookup, LevelDB searches from high levels to low levels. If KV separation designs do not block write operations during GC progress, the latest value of a key can be covered by the former one written by GC. In case read operations get stale values, current designs need to block write operations during GC progress. Therefore, current KV separation designs suffer from low write throughput.

Fifth, current KV separation designs store keys as well as values in vLogs to implement online GC. When WiscKey performs a GC operation, it queries the LSM-tree using the associated key to see if each KV pair is valid. To check the validity of KV pairs during a GC operation, HashKV scans the KV pairs in the value store and constructs a temporary in-memory hash table which is indexed by keys. As we can see, current KV separation designs need to store keys in vLogs, which leads to more I/O overhead.

TierLevelKV Design

TierLevelKV is a persistent KV store that minimizes I/O amplification. To realize a high performance KV store atop KV separation, TierLevelKV includes three critical ideas. First, TierLevelKV uses assemblies to organize SSTables and vLogs. Second, TierLevelKV proposes horizontal compaction operation and uses it to perform GC. Third, TierLevelKV utilizes an unique crash-consistency technique to efficiently manage SSTables, vLogs and assemblies.

Assemblies

Definition of assemblies

HashKV follows KV separation by storing only keys and metadata in the LSM-tree for indexing KV pairs, while storing values separately in vLogs. Figure # depicts the architecture of TierLevelKV. It organizes SSTables and vLogs into units called assemblies. Figure # shows the structure of an assembly. One assembly contains several SSTables and vLogs. To describe the property of assemblies, we first present three new primitives, Ref(sst, vlog), Lset(sst), Tset(vlog).

We first define Ref(s, v), a function describing the relationship between the given SSTable s and vLog v. If SSTable s refers to vLog v, then the function Ref(s, v) will return true. Otherwise, it will return false. Definition 1. Ref(s, v) := {true, if SSTable s refers to vLog v; {flase, otherwise.

Let S be the set of all SSTables in the LSM-tree, and V be the set of all vLogs in the LSM-tree. Then, we formulate Lset(s), a function describing the set of vLogs that the given SSTable s refers to. Definition 2. Lset(s) := {v |v < V A Ref(s, v)}.

We further formulate Tset(v), a function describing the set of SSTables that refer to the given vLog v. Definition 3. Tset(v) := {s |s < S A Ref(s, v)}.

Let Sa be the set of SSTables in an arbitrary assembly a, and Va be the set of vLogs in the same assembly. We demand that assembly a maintains the following property: Property 1. {U Lset(s) |s < Sa} = Va A {U Tset(v) |v < Va} = Sa.

As it can be seen by the description of Property 1, for each assembly in the LSM-tree, all vLogs that its SSTables refer to are included in it and all SSTables that refer to its vLogs are also included in it.

Algorithm of generating assemblies

In this section, We first introdue the algorithm to generate assemblies. Then, we prove that TierLevelKV will have each assembly in the LSM-tree maintaining Property 1 by following the algorithm.

For inserts or updates of KV pairs, TierLevelKV first stores the new KV pairs in the memtable. When the memtable gets full, TierLevelKV flushes the values to disk at level L0 as a SSTable and the keys along with the addresses to disk at level L0 as a vLog. Besides, a new assembly is formed of the new SSTable and the new vLog jointly.

Next, we will prove that TierLevelKV will have each assembly in the LSM-tree maintaining Property 1 by following the algorithm mentioned above.

Horizontal compaction

remove vlog(if necessary) and sstable from assembly

Crash consistency

assembly data structure metadata(vmetadata structure), log, recover

vLog optimizations vLog compression vLog block cahce selective KV separation no keys stored in vlogs

Range query

Implementation

Evaluation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published