Refactored restore to better use network resources #62

vitrvvivs · 2017-09-26T17:24:31Z

Staggers write requests in order to reduce number of unprocessed items.
Combines unprocessed items into new batches (no more batches of only a few items).
Allows restoring from local file, because s3 likes to close long-running connections.

Implementation
It now has two completely separate loops:

readline (created in _startDownload) that parses and pushes each line into an array (requestItems)
_sendBatch (started in _checkTableReady), which pulls items from that array and sends them as batches.
This separation allows _sendBatch to call itself after a certain amount of time has passed (every (1000 / concurrency) milliseconds). The previous implementation allowed a certain number of concurrent requests regardless of speed; on a fast network (a large EC2 instance), even 1 concurrent request was equivalent to 2500 writes per second.

S3 has a chance of randomly closing the connection before the download is finished. This makes restoring from large files impossible. This is a hack, to download the file quickly, then do the much-slower restore.

most of the time was spent in node (CPU bound). Timing only how long the request took failed to acount for overhead, and thus throttled down to 20% of the target.

This is 60% faster on my test tables

Matt Geskey added 17 commits September 26, 2017 10:09

basic refactor of restore

0089b93

Added timeout between _sendBatch when concurrency exceeded

d972da9

moved send-batch emit

e8cefa4

attempt to time requests

e1ab3ae

purged the unclean

1ec2844

redid timing math

4b3650c

Revert "attempt to time requests"

d1fa68d

decrease concurrency when running fast

f6bec0b

debug printing

753bc3a

added option to read from local file rather than s3

7af2552

S3 has a chance of randomly closing the connection before the download is finished. This makes restoring from large files impossible. This is a hack, to download the file quickly, then do the much-slower restore.

removed timer based throttling

d952d33

most of the time was spent in node (CPU bound). Timing only how long the request took failed to acount for overhead, and thus throttled down to 20% of the target.

fixed for testing

424e25e

updated package.js for release

a461672

Separated read and send loops; no unprocessed items when requests < 1s

c9aaeca

This is 60% faster on my test tables

Minor fixes

d75a0ac

covered some edge cases

86df1e0

I did not understand this

1f2b39d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored restore to better use network resources #62

Refactored restore to better use network resources #62

vitrvvivs commented Sep 26, 2017

Refactored restore to better use network resources #62

Are you sure you want to change the base?

Refactored restore to better use network resources #62

Conversation

vitrvvivs commented Sep 26, 2017