Failed to connect to database: Access denied for user 'mogilefs #1

We need to ensure we're sane when dealing with larger files requiring multiple reads.

@paths

This matches the buffer size used by replication, and showed a performance increase when timing the 100M large file test in t/40-httpfile.t With the following patch, I was able to note a ~46 -> ~27s time difference with both MD5 methods using this change to increase buffer sizes. --- a/t/40-httpfile.t +++ b/t/40-httpfile.t @@ -125,5 +125,12 @@ $expect = $expect->digest; @paths = $mogc->get_paths("largefile"); $file = MogileFS::HTTPFile->at($paths[0]); ok($size == $file->size, "big file size match $size"); +use Time::HiRes qw/tv_interval gettimeofday/; + +my $t0; +$t0 = [gettimeofday]; ok($file->md5_mgmt(sub {}) eq $expect, "md5_mgmt on big file"); +print "mgmt ", tv_interval($t0), "\n"; +$t0 = [gettimeofday]; ok($file->md5_http(sub {}) eq $expect, "md5_http on big file"); +print "http ", tv_interval($t0), "\n";

Base64 requires further escaping for our tracker protocol which gets ugly and confusing. It's also easier to interact/verify with existing command-line tools using hex.

We need a place to store mappings for various checksum types we'll support.

This is needed to wire up checksums to classes.

Digest::MD5 and Digest::SHA1 both support the same API for streaming data for the calculation, so we can validate our content as we stream it.

Checksum usage will be decided on a per-class basis.

This branch is now rebased against my latest clear_cache which allows allows much faster metadata updates for testing.

Helps me keep my head straight.

This can come in handy.

We'll use the "Digest" class in Perl as a guide for this. Only MD5 is officially supported. However, this *should* support SHA-(1|256|384|512) and it's easy to add more algorithms.

We can now: * enable checksums for classes * save client-provided checksums to the database * verify them on create_close * read them in file_info

we need to be able to both enable and disable checksuming for a class

This returns undef if a checksum is missing for a class, and a MogileFS::Checksum object if it exists.

replication now lazily generates checksums if they're not provided by the client (but required by the storage class). replication may also verify checksums if they're available in the database. replication now sets the Content-MD5 header on PUT requests, in case the remote server is capable of rejecting corrupt transfers based on it replication attempts to verify the checksum of the freshly PUT-ed file. TODO: monitor will attempt "test-write" with mangled Content-MD5 to determine if storage backends are Content-MD5-capable so replication can avoid reading checksum on destination

This functionality (and a server capable of rejecting bad MD5s) will allow us to skip an expensive MogileFS::HTTPFile->digest request at replication time. Also testing with the following patch to Perlbal: --- a/lib/mogdeps/Perlbal/ClientHTTP.pm +++ b/lib/mogdeps/Perlbal/ClientHTTP.pm @@ -22,6 +22,7 @@ use fields ('put_in_progress', # 1 when we're currently waiting for an async job 'content_length', # length of document being transferred 'content_length_remain', # bytes remaining to be read 'chunked_upload_state', # bool/obj: if processing a chunked upload, Perlbal::ChunkedUploadState object, else undef + 'md5_ctx', ); use HTTP::Date (); @@ -29,6 +30,7 @@ use File::Path; use Errno qw( EPIPE ); use POSIX qw( O_CREAT O_TRUNC O_WRONLY O_RDONLY ENOENT ); +use Digest::MD5; # class list of directories we know exist our (%VerifiedDirs); @@ -61,6 +63,7 @@ sub init { $self->{put_fh} = undef; $self->{put_pos} = 0; $self->{chunked_upload_state} = undef; + $self->{md5_ctx} = undef; } sub close { @@ -134,6 +137,8 @@ sub handle_put { return $self->send_response(403) unless $self->{service}->{enable_put}; + $self->{md5_ctx} = $hd->header('Content-MD5') ? Digest::MD5->new : undef; + return if $self->handle_put_chunked; # they want to put something, so let's setup and wait for more reads @@ -421,6 +426,8 @@ sub put_writeout { my $data = join("", map { $$_ } @{$self->{read_buf}}); my $count = length $data; + my $md5_ctx = $self->{md5_ctx}; + $md5_ctx->add($data) if $md5_ctx; # reset our input buffer $self->{read_buf} = []; @@ -460,6 +467,17 @@ sub put_close { if (CORE::close($self->{put_fh})) { $self->{put_fh} = undef; + + my $md5_ctx = $self->{md5_ctx}; + if ($md5_ctx) { + my $actual = $md5_ctx->b64digest; + my $expect = $self->{req_headers}->header("Content-MD5"); + $expect =~ s/=+\s*\z//; + if ($actual ne $expect) { + return $self->send_response(400, + "Content-MD5 mismatch, expected: $expect actual: $actual"); + } + } return $self->send_response(200); } else { return $self->system_error("Error saving file", "error in close: $!");

Rereading a large file is expensive. If we can monitor and observe our storage nodes for MD5 rejectionability, we can rely on that instead of having to have anybody reread the entire file to calculate its MD5.

Only the fsck part remains to be implemented... And I've never studied/used fsck much :x

Stale rows are bad.

TODO: see if we can use LWP to avoid mistakes like this :x

Fsck behavior is based on existing behavior for size mismatches. size failures take precedence, since it's much cheaper to verify size match/mismatches than checksum mismatches. While checksum calculations are expensive and fsck is already parallel, so we do not parallelize checksum calculations on a per-FID basis.

It reads more easily this way, at least to me.

I'll be testing checksum functionality on my home installation before testing it on other installations, and I run SQLite at home. ref: http://www.sqlite.org/lang_altertable.html

We need to ensure the worker stays alive during MD5 generation, especially on large files that can take many seconds to verify.

This special-cases "NONE" for no hash for our users.

We don't actually use the BLOB type anywhere, as checksums are definitely not "L"(arge) objects.

The timeout comparison is wrong and causing ping_cb to never fire. This went unnoticed since I have reasonably fast disks on my storage nodes and the <$sock> operation was able to complete before being hit by a watchdog timeout.

Enabling this setting allows fsck to checksum all replicas on all devices and report any corrupted copies regardless of per-class settings. This feature is useful for determining if enabling checksums on certain classes is necessary and will also benefit users who cannot afford to store checksums in the database.

MD5 is faster than SHA1, and much faster than any of the SHA2 variants. Given the time penalty of fsck is already high with MD5, prevent folks from shooting themselves in the foot with extremely expensive hash algorithms.

Unlike the setting it replaces, this new setting can be used to disable checksumming entirely, regardless of per-class options. fsck_checksum=(class|off|MD5) class - is the default, fsck based on per-class hashtype off - skip all checksumming regardless of per-class setting MD5 - same as the previous fsck_auto_checksum=MD5

MD5 is I/O-intensive, and having fsck request MD5s concurrently ends up causing I/O contention on rotational drives with high seek latency. So limit fsck MD5 requests to a single job per device.

The digest path relies on having a known file size to calculate the MD5 timeout, so save an HTTP HEAD request since we always check file sizes in fsck before we checksum the file.

Optimized SHA-1 implementations aren't significantly slower than MD5 and some folks (e.g. Tomas Doran) may already have SHA-1 in place for their data. A liberally licensed, GPL-compatible collection of SHA-1 primitives is available from one of the OpenSSL developers: http://www.openssl.org/~appro/cryptogams/ It would be nice to allow the Perl Digest module to transparently take advantage of architecture-specific optimizations. Note there is no standardized equivalent to the HTTP Content-MD5 header/trailer for any of the SHA variants, so verification for replication/uploads may take significantly longer. Requested-by: Tomas Doran

$NAME is potentially ambiguous and $HASHTYPE matches the database column name.

Fsck would print "Status: N / 0 " if it's never been started before. Now internally finds the max(fid) on its own.

if fsck_checksum was set to off, it would ignore the checksums deep in the code, but would still attempt to "fix" the fids each time, which runs far more code and UPDATE's each fid's devcount even if you tell it not to. now it does what it should. however FSCK with checksums enabled will still UPDATE devcount on each check.

Changelog diff is: diff --git a/CHANGES b/CHANGES index 770e518..5b59d7f 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,12 @@ +2012-03-30: Release version 2.60 + + * Fix fsck status when running for the first time (dormando <[email protected]>) + + * Checksum support (Major update!) (Eric Wong <[email protected]>) + See doc/checksums.txt for an overview of how the new checksum system + works. Also keep an eye on the wiki (http://www.mogilefs.org) for more + complete documentation in the coming weeks. + 2012-02-29: Release version 2.59 * don't make SQLite error out on lock calls (dormando <[email protected]>)

Buffering log output in memory makes it difficult to view debug and error output during development. Since MogileFS does not write to stdout frequently, there should be no noticeable performance loss from this change. This also prevents mangling of TAP output which caused test failures if DEBUG=1 is set.

I noticed that attempting to delete a domain with classes returns an unhelpful "Operation failed" error message.

There's no need to broadcast changes to other workers if there are no changes. Since HTTP servers rarely (if ever) change their ability to toggle Content-MD5 rejection, this was causing needless wakeups in every monitor round. Tested by running mogilefsd with DEBUG=1 and using toggling Content-MD5 rejection in mogstored + perlbal 1.80 via: SET mogstored.enable_md5 = (0|1) to the mgmt port while watching syslog output. Noticed-by: dormando

Before we make changes to the fsck code, we should ensure we don't break existing use cases. Behavior I'm uncertain about is documented with "XXX".

The LWP::UserAgent module found in my Debian 6.0 installation does not have a "delete" convenience wrapper.

No point in using DBI directly if a task can be done directly via the MogileFS::Store API. Noticed-by: dormando

This was leftover when I was monitoring the test with DEBUG=1 :x

* ensure fsck can handle a stall from an unresponsive mogstored * ensure over-replicated files are cleaned up * ensure fsck can work correctly with dead devices if it beats reaper to an FID

A BCNT error is more descriptive than a generic POVI entry and more accurately reflects the change made to an FID entry. This also removes a dependency from /using/ the devcount column and simplifies the code. The devcount column remains invaluable and informative to users, but MogileFS should not trust it for making decisions when it has access to the file_on table.

We need to ensure fsck can resume and returns sane stats output.

Hopefully this can eliminate some random test failures.

After resetting, fsck_highest_fid_checked ends up at zero.

We checked the incorrect return value, so the second mogstored failing would've gone unnoticed.

These race conditions were causing this test fail occasionally. These test failures were more common on SQLite and Postgres, but not unheard of when using MySQL. Some of these race conditions were due to fsck/job_master not receiving settings changes in time, so we now resort to killing existing processes and forcing them to reload.

Whenever an FID is unfixable, be sure to update devcount (to zero) to easily inform the user via mogstats. If the FID magically reappears in the future, the desperate fallback mode will still find the file.

fix_fid(): we no longer rely blindly update devcount on every call. This is important because we call fix_fid() on checksum checks regardless, and devcount updates entail unnecessary updates to the `file' table. While we're at it, consolidate the places where we check the skip_devcount flag and log bad devcounts.

Instead of blindly sleeping, we can '!watch' through the tracker and detect the the error message fsck sends.

This is a simpler implementation and lets us be notified of worker death (and pending replacement) as soon the tracker notices it.

FIDs may be created while fsck is running, causing "mogadm fsck status" to report completion above 100% (and thus confusing users). Stopping fsck when it reaches fsck_fid_at_end (set to MAX(fid) at fsck startup) prevents this. This change should also have a pleasant side effect of reducing contention with replicate workers on newly-uploaded FIDs. ref: http://code.google.com/p/mogilefs/issues/detail?id=50

MogileFS::Server and Perlbal both ignore SIGPIPE for us. So there's no need to ever ignore it for socket writes in HTTPFile, either.

Use the replicate lock here to prevent an DevFID from being orphaned by a replicate process. This prevents orphaned requests if a user issues a delete request on a file while replication worker is copying it.

This avoids an uninitialized value warning from Perl when choosing a value for the deprecated listen_jobs value. Neither the child nor the parent processes are capable of handling undef values from :set_config_from_*.

Most of this was already nuked in the following commits: ebf8a5a 3db8a84

Unused since commit 0be2f97 when the old drain/rebalance code was dropped.

Based on my reading of the code, the conditional assignments of $hostip, $get_port, $devid, $timeout, and $url are needless.

Keeping track of explicit test counts causes needless merge conflicts. done_testing() is sufficient to note test completeness and detect failures.

may annoy centos5 users.

Unused since commit 6c23c9d ("make fsck worker distributed"). Since this always seemed fsck-specific, it's also unlikely plugins are using this.

Queries are computed in parallel and therefore replies are not in the right order. This patch adds an optional callid parameter so a caller can match the replies back to the queries. ERR lines return the callid as 3rd parameter. If the callid parameter is missing then the Protocol is the same as before. Example: GET_PATHS callid=1 ERR no_domain No+domain+provided 1 GET_DOMAINS callid=1 OK domains=0&callid=1

This is adapted from the Postgres implementation, but since SQLite runs on one machine, we can use kill 0 to detect if a process is got nuked before it could release a lock.

Since all Store implementations implement get_lock/release_lock, we can safely share the same implementation of both should_begin_replication_fidid() and note_done_replicating().

For rare SQLite setups, drop locks after 3600s regardless of the hostname of the lock holder. This can work around weird setups that change hostnames (frequently) or share SQLite DBs over NFS.

The devcount of a newly uploaded file is always 1, so we do not need another set of trips to the DB to set this in the file row.

Specifying "alivetypo" as the host status would cause mogilefs to implode.

This possible fix could also be hiding another bug, but the original test ordering was suspect...

Changelog diff is: diff --git a/CHANGES b/CHANGES index 5b59d7f..de3ba9b 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,33 @@ +2012-05-18: Release version 2.61 + + * fix issue #57 by Pyry and Eric (dormando <[email protected]>) + (mogadm host status sometimes allowed typos) + + * avoid unnecessary devcount update in create_close (Eric Wong <[email protected]>) + + * sqlite: implement locking via tables (Eric Wong <[email protected]>) + + * worker/query: Add optional callid parameter (Gernot Vormayr <[email protected]>) + (allows command pipelining) + + * delete: prevent orphan files from replicator race (Eric Wong <[email protected]>) + + * fsck: prevent running over 100% completion (Eric Wong <[email protected]>) + + * fsck: cleanup and reduce unnecessary devcount updates (Eric Wong <[email protected]>) + + * fsck: update devcount, forget devs on unfixable FIDs (Eric Wong <[email protected]>) + + * fsck: log bad count correctly instead of policy violation (Eric Wong <[email protected]>) + + * tests: add test for fsck functionality (Eric Wong <[email protected]>) + + * monitor: only broadcast reject_bad_md5 on change (Eric Wong <[email protected]>) + + * worker: delete_domain returns has_classes error (Eric Wong <[email protected]>) + + * log: enable autoflush for stdout logging (Eric Wong <[email protected]>) + 2012-03-30: Release version 2.60 * Fix fsck status when running for the first time (dormando <[email protected]>)

Schema upgrade needs to use Pg-specific column types for the v15 upgrade adding class.hashtype. Only CREATE TABLE is auto-converted where possible, not ALTER TABLE. Signed-off-by: Robin H. Johnson <[email protected]>

Changelog diff is: diff --git a/CHANGES b/CHANGES index b526b6a..6333f9f 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,9 @@ +2012-05-19: Release version 2.62 + + * Critical bugfix for a compilation error (dormando, reported by Robin) + + * Fix for upgrading a Postgres install for checksums (Robin * H. Johnson <[email protected]>) + 2012-05-18: Release version 2.61 * fix issue #57 by Pyry and Eric (dormando <[email protected]>)

commit f54e6fd botched the ordering of parameters when updating the file table.

Changelog diff is: diff --git a/CHANGES b/CHANGES index 6333f9f..07790f3 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,8 @@ +2012-05-29: Release version 2.63 + + * Critical bugfix for Postgres users introduced by 2.61. New file uploads + would fail. (noticed by robin H. Johnson, fixed by Eric Wong) + 2012-05-19: Release version 2.62 * Critical bugfix for a compilation error (dormando, reported by Robin)

Host don't define readability/writability themselves, so the Host::should_get_new_files sub is renamed to "alive" and Device->can_read_from respects host status. Also, queryworker now skips down/dead hosts in cmd_get_paths. ref: http://code.google.com/p/mogilefs/issues/detail?id=46

The observed unreachable state of the host should always supercede the observed state of the device. This is already the case with observed_writeable, but not with observed_readable nor observed_unreachable. The monitor worker does not (and should not, to save bandwidth) update states of all devices when a host goes down.

should_read_from() should replace all uses of can_read_from() in non-Monitor workers. This avoids the overhead of needlessly rechecking devices either the monitor or user marked down. This simplifies queryworker logic a bit.

Avoid needlessly attempting connections for checking files on host/devices the monitor (or user) marked as unreadable. This also makes the Fsck->size_on_disk function redundant.

URLs pointing to devices set to drain are undesirable. Files may disappear off draining devices immediately after we've queried the file_on table and invalidate the paths the client sends us.

This is mainly to prevent bugs like the fix in commit ac5534a from popping up again.

This subroutine has been unused since MogileFS 2.52 commit 18a40d2 ("Throw out old HTTPFile->size code and use LWP")

We cannot break existing case-insensitive behavior for list_keys right now, even if it's broken. This means SQLite/MySQL will use case-insensitive LIKE statements for list_keys and Postgres remains case-sensitive.

Enabling this boolean will make the "after" and "prefix" params of "list_keys" behave case-sensitively. If this setting is /not/ enabled, clients will hit after_mismatch errors when iterating through keys if they are using an uppercase "prefix" argument and a subsequent list_keys is called with an "after" that only matches case-insensitively. If unset, this defaults to false (0) to match historical (buggy) behavior. Historical behavior is preserved (even if broken) as users with small namespaces may rely on case-insensitive matching. Postgres users are not affected by this change, as the LIKE operator in Postgres is always case-sensitive. This change is tested on all three databases: Postgres, MySQL, and SQLite.

Device state reporting functions should respect whatever the underlying Host state is.

Because memcache TTL is now user configurable, data in memcached might be valid for a long time, and as such invalid paths might be returned. It would be possible to populate memcache, instead of just removing. But it might be wasteful when a device is marked as dead, those replicated fids might not need to be in memcached. There is still one TODO left. If someone modifies mindevcount and runs FSCK, then the mappings might become incorrect, but I reasoned that it would be rather rare.

Changelog diff is: diff --git a/CHANGES b/CHANGES index 07790f3..c552089 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,15 @@ +2012-06-21: Release version 2.64 + + * Delete memcache data when we replicate fids (Pyry Hakulinen <[email protected]>) + + * implement "case_sensitive_list_keys" server setting (Eric Wong <[email protected]>) + + * get_paths: deprioritize devs in "drain" state (Eric Wong <[email protected]>) + + * make marking a host down cause devices to act as down (Eric Wong <[email protected]>) + + * monitor skips hosts marked dead or down (Eric Wong <[email protected]>) + 2012-05-29: Release version 2.63 * Critical bugfix for Postgres users introduced by 2.61. New file uploads

This makes it easier to test mock or alternative iostat implementations. This can be used for emulating the iostat output on other platforms.

The parser now looks for contiguous lines of statistics and (if it's previously captured stats) emits whenever the first non-stats line appears. Relying on the "Device:" line is not portable to FreeBSD (and possibly other iostats implementations). The parser also ignores leading/trailing whitespace on each statistics line. Tested on Linux (sysstat 10.0.5) and FreeBSD 9. For testing iostat output on FreeBSD, I used MOG_IOSTAT_CMD like this on my GNU/Linux system: MOG_IOSTAT_CMD="ssh fbsd9vm iostat -dx 1 30" mogstored ... ref: http://code.google.com/p/mogilefs/issues/detail?id=9

This hasn't been used since the old rebalance code was nuked for 2.40 (commit 0be2f97)

mogstored gains a --skipconfig switch which we use in tests to ignore the default config file. mogilefsd has had this switch (with identical semantics) since 2004.

Reaper isn't tested anywhere else. We plan on changing it slightly so ensure we don't introduce regressions.

A subroutine with a 20-line comment deserves to be its own sub. This will make it easier to see what the future reaper lock will guard without needing to scroll on small terminals.

We don't want multiple reaper process stepping on each other during UPDATE/INSERT, causing needless conflicts/failures at the DB level for every single FID. JobMaster already locks its queues in a similar way to prevent conflicts, so this should not noticeably harm performance (and may improve performance due to the DB conflict reduction).

This controls the number of FIDs the reaper can inject into the replication queue for each dead device, per wakeup. This defaults to 1000, the same value its had since (at least) 2006.

This will make it easier to reuse this constant in other workers that can check the queues (e.g. job_master/reaper).

Users may now configure the queue_size_for_reaper server setting to limit the size of the non-urgent replication queue. The urgent replication queue (nexttry == 0) is unaffected, as are other processes (fsck) which may inject into the replication queue. The default remains unlimited, the reaper will queue as fast as it possibly can: 1000 FIDs every 5 seconds (per process)

Reaper now schedules the first batch of files on a newly-dead device for replication after delaying (on the reaper itself) for DEVICE_SUMMARY_CACHE_TIMEOUT+1 (16) seconds. This allows all subsequent files to replicate sooner, without the 16s delay. Since Danga::Socket is used to schedule this 16s delay, reaping of other dead devices won't be impacted by this delay. The DEVICE_SUMMARY_CACHE_TIMEOUT+1 delay still remains to offer a small level of protection against replicators with out-of-date internal device caches and writable-but-"dead" devices. As an additional countermeasure against out-of-date device caches, reapers will now slowly back off of a device over the course of 4 hours after it fails to find new, unreaped FIDs. Previously, any files that got replicated to an already "dead" device would remain there until a reaper restart.

Since reaper now schedule replications with the same priority as fsck, we will also rely on the replicator to call update_devcount for us, allowing us to avoid making an unnecessary write to the database.

The delay backoff should only occur if we got a successful lock. Backing off the delay on lock failure can result in the delay becoming undef and (incorrectly) making the reaper give up on a device. Fortunately, lock failures with the extremely long (60s) lock timeout is unlikely to be a problem in practice.

Update Pg locking code to use Postgres advisory locks. Now requires Postgres 8.4 as min version. Signed-off-by: Robin H. Johnson <[email protected]>

Signed-off-by: Robin H. Johnson <[email protected]>

Changelog diff is: diff --git a/CHANGES b/CHANGES index c552089..64455f4 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,28 @@ +2012-08-13: Release version 2.65 + + * Postgres advisory lock instead of table-based lock (Robin H. Johnson <[email protected]>) + Now requires minimum Postgres version of pg8.4. + + * reaper: switch to Danga::Socket for scheduling (Eric Wong <[email protected]>) + + * reaper: add queue_size_for_reaper server setting (Eric Wong <[email protected]>) + + * reaper: add "queue_rate_for_reaper" server setting (Eric Wong <[email protected]>) + defaults to 1000, same as previously. + + * reaper: global lock around DB interaction (Eric Wong <[email protected]>) + prevents reapers clobbering each other, causing a reduction in DB writes. + + * tests: add basic test for reaper (Eric Wong <[email protected]>) + + * fix tests when /etc/mogilefs/mogstored.conf exists (Eric Wong <[email protected]>) + + * iostat: increase flexibility of iostat parser (Eric Wong <[email protected]>) + + * iostat: allow MOG_IOSTAT_CMD env override (Eric Wong <[email protected]>) + + * When a mogstored is down, die with a more informative message. (Dave Lambley <[email protected]>) + 2012-06-21: Release version 2.64 * Delete memcache data when we replicate fids (Pyry Hakulinen <[email protected]>)

…inx backend less architecture dependent

…'s recommendation tcp_nodelay defaults to on, so there is no need to specify it remove unnecessary error_page config, there is no need for pretty error pages

…or each configured location

…th other running copies. Thanks Gernot for the heads up

…he variable

this attempts to prevent conflicts with other running instances of nginx

We were updating devcount field even when skip_devcount was true. We should not use $sto here because we already have FID object and nice method available for this. Signed-off-by: Eric Wong <[email protected]>

The caller expects an array ref, currently using use_dest_devs will kill JobMaster with: Oct 18 09:45:25 storage22 mogilefsd[23263]: crash log: rebalance cannot find suitable destination devices at /usr/local/share/perl/5.10.1/MogileFS/Worker/JobMaster.pm line 233 Oct 18 09:45:26 storage22 mogilefsd[22044]: Child 23263 (job_master) died: 256 (UNEXPECTED) Signed-off-by: Eric Wong <[email protected]>

On certain errors, the queryworker may send two "ERR" responses, causing the ProcManager to terminate the queryworker upon reading the second response if the queryworker is immediately fed another query. This can affect busy setups, but is also easy to reproduce with a single queryworker that's receiving a pipelined request to an invalid/non-existent domain: ( printf 'list_keys domain=\r\nlist_keys domain=\r\n' sleep 2 ) | socat - TCP:127.0.0.1:7001 The queryworker strace will look like this (writing 4 lines): write(14, "4981-1 0.0005 ERR no_domain No+domain+provided\r\n", 48) = 48 write(14, "4981-1 ERR domain_not_found Domain+not+found\r\n", 46) = 46 write(14, "4981-2 0.0005 ERR no_domain No+domain+provided\r\n", 48) = 48 write(14, "4981-2 ERR domain_not_found Domain+not+found\r\n", 46) = 46 And a message like this will appear for "!watch" users: Worker responded with id <undef> (line: [4981-1 ERR domain_not_found Domain+not+found]), but expected id 4981-2, killing This is because ProgManager immediately calls NoteIdleQueryWorker upon writing the first ERR response to the client (at the end of HandleQueryWorkerResponse). This means the idle query worker may immediately start processing a second request before the ProcManager has a chance to process the second ERR response line (from the first request). Preventing err_line() from calling send_to_parent() with "ERR" if querystarttime is undef prevents this issue, but there may be better ways to fix this bug. A similar, preventative fix may be appropriate for ok_line().

This saves us from reinventing it in every test and will help us detect stuck tests more easily.

I've had this test get stuck intermittently in different places, this should make it easier to track down stuck tests.

Workers do not receive nor respond to messages as soon as the ProcManager dispatches the request to stop/start them, so wait until ProcManager no longer knows about a process before proceeding.

$class->{hashtype} is undef by default for classes where no checksums are configured. Since clients can force checksum verification in create_close regardless of class, we can end up with a Checksum object for FIDs regardless of which class the FID is in.

delete, file_info, get_paths, rename, file_debug, updateclass were all broken when handling a key named "0".

We need to ensure neither replicate (nor delete) are changing the devids list when fixing an FID. This should ensure we're safely modifying the devid list for a given FID when forgetting about bad ones.

We should not waste time stat()-ing FIDs that no longer exist at all.

In replicate, we validate via existing FID checksum regardless of class. This failed when the class.hashtype was altered after uploading but before replication. The following sequence of events caused replication to fail: 1. class.hashtype = MD5 2. FID created and stores MD5:... 3. FID enqueued for replication 4. class.hashtype = NONE 5. FID begins replicating Replication (Step 5) failed since the existing MD5 digest is trying to validate against a (now) non-existent class digest. Since we stored the checksum in the database anyways, calculate and validate anyways as an admin could've only wanted to alter a class temporarily, not permanently. An admin may also decide to switch checksum algorithms. Fsck now logs hash algorithm mismatches as "BALG" and emits a descriptive message to syslog

This helps prevent the reaper process from dying if the DB disconnected us for idleness. This should fix #75: ("reaper dies if DB connection closes") http://code.google.com/p/mogilefs/issues/detail?id=75

I missed this when moving try_for() into Test.pm

We called MogileFS::Client::update_class incorrectly without the key. Additionally, the test for checking the number of copies was also incorrect.

[ew: added trivial test] Signed-off-by: Eric Wong <[email protected]>

Signed-off-by: Eric Wong <[email protected]>

Changelog diff is: diff --git a/CHANGES b/CHANGES index 64455f4..bd7e38a 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,29 @@ +2013-01-06: Release version 2.66 + + * add a hook to cmd_updateclass (Daniel Frett <[email protected]>) + + * support updating the class to the default class which has an id of 0 (Daniel Frett <[email protected]>) + + * reaper: validate DB connection before reaping (Eric Wong <[email protected]>) + Fixes occasional crash in reaper process. + + * improve handling of classes which change hash algorithm (Eric Wong <[email protected]>) + + * fsck: skip non-existent FIDs properly (Eric Wong <[email protected]>) + + * fsck: use replicate lock when fixing FID (Eric Wong <[email protected]>) + + * query: allow "0" key on all commands which take keys (Eric Wong <[email protected]>) + + * prevent reqid mismatches (and queryworker death) (Eric Wong <[email protected]>) + Fixes crash case with specific error types. + + * fix use_dest_devs for rebalance (Pyry Hakulinen <[email protected]>) + Fixes "use_dest_devs" argument during rebalance. + + * Fix "skip_devcount" during rebalance (Pyry Hakulinen <[email protected]>) + Now actually skips updating devcount column during rebalance. + 2012-08-13: Release version 2.65 * Postgres advisory lock instead of table-based lock (Robin H. Johnson <[email protected]>)

Signed-off-by: Eric Wong <[email protected]>

This way the queryworker will know it won't have to wait again for the monitor to run. This allows users to (manually) set higher intervals in Monitor.pm without noticing ill effects.

* set_observed_utilization() is a no-op, and can be safely removed. * Looking up the device via factory does not incur DB hit since the factory changes of May 2011. * Really avoids propagating invalid devids with correct ordering of the hash assignment. This prevents an invalid devid from hitting even the {devutil}->{cur} hash which lasts the lifetime of the monitor process.

As HTTP/1.1 servers tend to disconnect idle connections over time, recently-used queryworkers are more likely to have reusable HTTP connections. This can reduce the number of open HTTP sockets across the cluster during non-peak periods. This may improve performance in two ways: * recently-used worker processes should have better memory locality * can avoid the chance for TCP slow-start-after-idle behavior to kick in for DB connections. The downside of this patch is memory/CPU usage between workers may appear lopsided and probably confuse users. This change should also make potential memory leaks more noticeable.

Normally, disabling Nagle's algorithm would have little effect on typical MogileFS traffic: < read one request from client - process request in queryworker > write one response to client < read one request from client - process request in queryworker > write one response to client < read one request from client - process request in queryworker > write one response to client ... However, in certain cases, clients may pipeline requests (and sort responses on the client side). This causes tracker traffic to end up like this: < read multiple requests from client - process requests in parallel on multiple queryworkers > write one response to client > write one response to client > write one response to client ... Since Nagle's algorithm waits for an ACK from each response the server writes before sending the next response, it limits the rate at which the client can receive responses. Informal testing over loopback running the "file_info" command on two batches of 1000 keys each (2000 keys total) consistently reveals a small, ~20-60ms reduction (580-600ms -> 540-580ms) on a somewhat active machine with four queryworkers (and four cores). Like SO_KEEPALIVE, TCP_NODELAY is inherited from the listener by accepted sockets in every system I've checked, so there's no additional overhead in userspace when accepting new clients.

The default (deferred) transaction mode in SQLite delays locking, potentially leading to "database is locked" errors on concurrent access. Immediate transactions lock the database immediately, preventing unnecessary errors at the cost of reduced concurrency. I've still occasionally encountered a "database is locked" or two on my SQLite deployment with many workers over the months. Tested on MySQL, Postgres, and DBD::SQLite 1.29 and 1.37. This feature appeared in DBD::SQLite 1.30, but the extra attribute for DBI->connect is harmless for drivers which do not support this attribute. ref: http://search.cpan.org/dist/DBD-SQLite/lib/DBD/SQLite.pm Using the following instrumentation patch, I have not hit busy/locked errors while putting my SQLite-based MogileFS instance through heavy activity (fsck, uploads, deletes): --- a/lib/MogileFS/Store/SQLite.pm +++ b/lib/MogileFS/Store/SQLite.pm @@ -164,7 +164,12 @@ use constant SQLITE_LOCKED => 6; # A table in the database is locked sub was_deadlock_error { my $err = $_[0]->dbh->err or return 0; - ($err == SQLITE_BUSY || $err == SQLITE_LOCKED); + if ($err == SQLITE_BUSY || $err == SQLITE_LOCKED) { + Mgd::log('info', "DB locked"); + 1; + } else { + 0; + } } sub was_duplicate_error {

* pull/26/head: if one really wants to be root - let him be Moved utf-8 config to http block [#58] support nginx server type in command line options [#58] relocate all temp_path's to a temp path specific to mogstored [#58] utilize non-daemon mode for nginx >= 1.0.9 [#58] die if nginx fails to start clean up formatting, no functional changes [#58] store the nginx pid in the prefix dir and reduce the scope of the variable [#58] relocate the prefix directory to keep nginx from conflicting with other running copies. Thanks Gernot for the heads up [#58] only specify the root once in the server directive instead of for each configured location [#58] remove a couple unnecessary configuration directives per Gernot's recommendation [#58] disable logging and move the pid to the data docroot to make nginx backend less architecture dependent [#58] fix the code generating sections for each device [#58] load the Nginx server file so it can be used [#58] load the latest version of the nginx module

Debian squeeze (stable as of 2013/01) uses nginx 0.7.67, so there are likely many users still using this older version. Attempting to specify a dummy {uwsgi,scgi}_temp_path causes errors at startup for me. According to the the nginx CHANGES file, uwsgi appeared in 0.8.40 and scgi appeared in 0.8.42. ref: http://nginx.org/en/CHANGES

also, only calculate the actual version once

This removes two DB calls from the latency-critical queryworker process. This may widen a race condition with reused explicit FIDs, but explicit FIDs are a bad idea anyways and reusing FIDs likely had problems before this change. I've also removed the Postgres-specific delete_fidid() function. commit 7dbfb44 (Make postgres use new delete worker code) removed the Postgres-specific code path and made it functionally identical to the generic version.

The lack of a mogstored sidechannel listener should not be fatal to a replication worker (or any other worker). This bug only affects checksums users who misconfigure mogstored.

Signed-off-by: Eric Wong <[email protected]>

Additionally, log redundant calls to err_line so we have a chance at figuring out what is causing redundant calls to err_line()

Failed set_server_settings calls just die with errors, so we need to track and log that to syslog. We'll also report we had a database error back to the client (but avoid propagating the exact error message, in case there is any sensitive information).

Signed-off-by: Eric Wong <[email protected]>

Calling Mogstored::HTTPServer::Perlbal->start() creates a kqueue descriptor. kqueue descriptors are invalidated across fork, so we must avoid kqueue creation until after daemonization. We continue starting non-Perlbal HTTP servers before daemonization, as error reporting can be easier if stderr/stdout are not redirected to /dev/null. ref: http://code.google.com/p/mogilefs/issues/detail?id=72 Cc: [email protected]

This version is still racy after several years. More importantly, it's missing the change in commit 5d01811 which allows the default class to be overridden.

Race conditions in create_class are unlikely to be a problem in normal usage, but this will discourage code duplication which can lead to maintainability issues.

A default class may enter the class table if its settings (e.g. mindevcount) are altered. The queryworker does not allow removing the default class, so the only way to remove it is to remove it when the domain goes away.

This is used in several places, and will make code easier to maintain going forward.

Now, all of our job classes may be controlled via "!want"

prevents dogpiling on some slowish queries if you DB is hosed, or if you have many tempfile rows that need processing.

Changelog diff is: diff --git a/CHANGES b/CHANGES index bd7e38a..f0f578c 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,36 @@ +2013-02-02: Release version 2.67 + + * Serialize tempfile reaping (dormando <[email protected]>) + + * reaper: ensure worker can be stopped via "!want" (Eric Wong <[email protected]>) + + * domain removal also removes its default class (Eric Wong <[email protected]>) + + * store: wrap create_class in a transaction to avoid races (Eric Wong <[email protected]>) + + * mogstored: fix kqueue usage with daemonization (Eric Wong <[email protected]>) + + * Filter the devices before we do an expensive sort. (Dave Lambley <[email protected]>) + + * httpfile: avoid killing worker on down sidechannel (Eric Wong <[email protected]>) + + * move checksum and tempfile delete to delete worker (Eric Wong <[email protected]>) + + * sqlite: use immediate transactions to prevent busy errors (Eric Wong <[email protected]>) + + * disable Nagle's algorithm for accepted clients (Eric Wong <[email protected]>) + + * ProcManager: favor using recently-used queryworkers (Eric Wong <[email protected]>) + + * Do both sorts in one method, to save on shared initialisation. (Dave Lambley <[email protected]>) + + * Pull out device sorting into it's own method for overriding. (Dave Lambley <[email protected]>) + + * Reseed the random number generator after forking. (Dave Lambley <[email protected]>) + + * support nginx server type in mogstored command line options (Daniel Frett <[email protected]>) + (also Gernot Vormayr <[email protected]>, others) + 2013-01-06: Release version 2.66 * add a hook to cmd_updateclass (Daniel Frett <[email protected]>)

Without specifying an ESCAPE character for LIKE queries, the '\' we use for escaping is treated as a literal and improperly matched keys with '\' in them under SQLite. This is only needed for SQLite, as the SQLite language reference makes no reference of a default ESCAPE character in http://www.sqlite.org/lang_expr.html ESCAPE is supported in MySQL and Postgres, too; and defaults to '\'. We specify it anyways to reduce code differences between different databases. Tested on MySQL 5.1.66 and Postgres 8.4.13 on Debian 6.0 and SQLite 3.7.13 on Debian 7.0

If we support non-SQL DBs in the future, escaping rules could become store-specific, so Worker/Query is not the right place for it. Since '%' and '\' may be escaped just like any other character, we may also allow these characters as prefixes by properly escaping them. Tested on MySQL 5.1.66 and Postgres 8.4.13 on Debian 6.0 and SQLite 3.7.13 on Debian 7.0

Although never officially supported in MogileFS, some users will manage to change "dead" devices to another state. When running fsck, this may cause the desperate search to continually fail as any files found and added to file_on table will just be reaped. Reported-by: Ask Bjørn Hansen <[email protected]> Subject: fsck/FOND not adding a row to file_on Message-ID: <[email protected]>

note_on_device may croak, so avoid logging FOND until we've successfully called note_on_device to ensure the fsck log is consistent with what was done.

[ew: squashed Dave's change to make IO::AIO optional] Signed-off-by: Eric Wong <[email protected]>

…e don't find space on devices with known space free, try the unknowns. Signed-off-by: Eric Wong <[email protected]>

Logging times_out_of_qworkers in ProcessQueues is not accurate: recently-idle queryworkers may not be noticed and marked idle while ProcessQueues is looping and draining the @IdleQueryWorkers pool. Instead, only log times_out_of_qworkers when new requests are enqueued.

* bogomips/list_keys: list_keys: escape in Store, allow [%\\] as prefix list_keys: consistent ESCAPE usage across DB types

* bogomips/pending_queries: ProcManager: only log times_out_of_qworkers for new queries

IO::AIO 2.4 on Debian stable lacks IO::AIO::FADV_SEQUENTIAL constant, causing compilation to fail on the bareword. Accessing the constant as a subroutine call (via "()") avoids the bareword and defers the error to runtime (which is trapped by eval). Tested under IO::AIO 2.4 on Debian stable and IO::AIO 4.15 on Debian testing (verified fadvise64() syscall under strace).

This is unlikely to be an issue in fsck, fsck checks file size/existence before digesting the file.

fsck digests are deprioritized and serialized in mogstored, so it's nearly impossible to tell what's in the queue before our request. Since fsck is not latency critical, extend the timeout for that. We also need to account for normal seek/network latency for non-fsck digest requests, so add node_timeout to that. These bugs were mostly hidden since we are relying on <> to read, which may incur watchdog timeouts.

MogileFS::DeviceState was never updated for the 2.40 drain changes. The broken-since-2.40 should_have_files sub caused ReplicationPolicy::MultipleHosts to overreplicate files, as it was not counting drain devices in the total disks check. Thanks to Tim on for reporting this to the mailing list at [email protected]

With many fsck workers and slow fsck (due to checksumming large files and/or high network latency), it may be possible for fsck workers to start working on the same FID without a lock. ref: ML Subject: "FSCK Status/Log Entries"

These subroutines are unused.

Log the correct name of the failed function and the error string associated with the OS errno to aid in debugging. ML Ref: Date: Mon, 8 Jul 2013 17:14:40 -0700 (PDT) From: Tim <[email protected]> To: [email protected] Message-Id: <[email protected]> Subject: MogileFS crashes

Mogstored/SideChannelClient.pm may hit the following on I/O error: $self->write("ERR read $uri at $offset failed\r\n"); Be prepared to show that error to tracker watchers (and any other possible errors mogstored may return in the future).

This can be useful when MultipleHosts is too noisy when hosts differ greatly in storage capacity. The intended target of this policy is a low-priority backup cluster where a single host contains the bulk of the storage with a handful of random machines helping out. The MultipleHosts policy can be too noisy with log messages about running out of suggestions in this case.

Auto-reconnect is probably always unsafe while a holding a lock on all networked databases. While we do not use the builtin auto-reconnect functionality of MySQL, any auto-reconnect implementation should be affected by the same issues upon connection failure: https://dev.mysql.com/doc/refman/5.6/en/auto-reconnect.html With auto-reconnect, we could be operating under the false assumption we have a lock after the reconnect when we do not. For now, the easiest method of recovery is to just let the worker die while working on the current task and have the ProcManager restart it.

Dropping a connection while holding an advisory lock with MySQL or Postgres will cause a fatal error, so hold the connection open until the next time the dbh is requested without holding a lock.

And when running without a job_master, do not spawn job_master-dependent workers (delete, fsck, replicate) as those workers will never get work. Running a queryworker+monitor in a remote datacenter makes sense with the MogileFS::Network plugins since the "create_close" size verification is faster and more reliable if the queryworker is in the same datacenter as the client, even if the master DB is in a remote datacenter. Being in a remote datacenter, (master)DB-intensive operations from delete, fsck and replicate workers can encounter high latency and an unreliable link, so admins may disable those workers in this situation. However, disabling delete, fsck, and replicate workers individually still allows the job_master to fill the initial queues (which is never processed) and prevent other trackers from processing items for 1000 seconds. Future commits may allow job_master to ignore certain queues if there are zero workers for that queue, but for now, stopping job_master entirely should be sufficient for most users with trackers in a different datacenter than the DB. P.S. It also makes sense to disable reaper in remote datacenters, too, but reaper does not rely on job_master.

The monitor may send large state events for large installations with many hosts, devices, domains, or classes. The 1K default is too small and leads to excessive syscalls and string operations. This increases startup performance for a mock instance with 10K domains and 10K non-default classes. Using the parent_ping function and "No simple reply" warning as an informal benchmark, this change reduces the loop time from 12 to 10 loops.

This join() takes about 20ms on my mock instance with 10K domains and 10K classes, so it has some impact on startup performance.

This was excessively expensive for my instance with 10K domains and 10K classes. Applying state information without incurring IPC/scheduling costs allows non-monitor workers to start up within ~4 seconds of the monitor starting up.

Changelog diff is: diff --git a/CHANGES b/CHANGES index f0f578c..b74f7f4 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,35 @@ +2013-08-07: Release version 2.68 + + * optimize monitor worker for large installs (Eric Wong <[email protected]>) + + * allow startup without job_master (and dependent workers) (Eric Wong <[email protected]>) + + * store: do not disconnect for max_handles while locked (Eric Wong <[email protected]>) + + * store: do not auto-reconnect while holding a lock (Eric Wong <[email protected]>) + + * add naive MultipleDevice replication policy (Eric Wong <[email protected]>) + + * httpfile: log mogstored I/O errors when checksumming (Eric Wong <[email protected]>) + + * ProcManager: log socketpair errors correctly (Eric Wong <[email protected]>) + + * fix "drain" handling used by MultipleHosts replpolicy (Eric Wong <[email protected]>) + + * httpfile: correct timeouts for sidechannel digest (Eric Wong <[email protected]>) + + * httpfile: correct FILE_MISSING check in digest_mgmt (Eric Wong <[email protected]>) + + * mogstored: avoid bareword on IO::AIO w/o fadvise (Eric Wong <[email protected]>) + + * ProcManager: only log times_out_of_qworkers for new queries (Eric Wong <[email protected]>) + + * Don't emit warnings if we're lacking the space free of a device. If we don't find space on devices with known space free, try the unknowns. (Dave Lambley <[email protected]>) + + * list_keys: escape in Store, allow [%\\] as prefix (Eric Wong <[email protected]>) + + * list_keys: consistent ESCAPE usage across DB types (Eric Wong <[email protected]>) + 2013-02-02: Release version 2.67 * Serialize tempfile reaping (dormando <[email protected]>)

We will be using Danga::Socket in more (possibly all) workers, not just the Monitor and Reaper. Resetting in workers that do not use Danga::Socket is harmless and will not allocate epoll/kqueue descriptors until the worker actually uses Danga::Socket.

In order to migrate to the upcoming Danga::Socket-based HTTP API, we'll first refactor monitor to use the new API (but preserve LWP usage behind-the-scenes). DEBUG=1 users will see the elapsed time for all device refreshes each time monitor runs. While we're at it, also guard against race conditions on the PUT/GET test by double-checking on failure. (A long-standing TODO item) also squashed the following commit: use conn_timeout in monitor, node_timeout in other workers This matches the behavior in MogileFS:Server 2.65. It makes sense to use a different, lower timeout in monitor to quickly detect overloaded nodes and avoid propagating their liveness for a monitoring period. It also makes sense to use a higher value for node_timeout in other workers since other actions are less fault-tolerant. For example, a timed-out size check in create_close may cause a client to eventually reupload the file, creating even more load on the cluster.

Net::HTTP::NB is usable with Danga::Socket and may be used to make HTTP requests in parallel. The new connection pool supports persistent connection pooling similar to LWP::ConnCache. Total connection capacity is enforced to prevent out-of-FD situations on the workers. Unlike LWP::ConnCache, MogileFS::ConnectionPool is designed for use with concurrent, active connections. It also supports queueing (when any enforced capacity or system limits are reached) and relies on Danga::Socket for scheduling queued connections. In addition to total capacity limits, MogileFS::ConnectionPool also supports limiting concurrency on a per-destination basis to avoid potentially overloading a single destination. Currently, we limit ourselves to 20 connections from a single worker (matching the old LWP limit) and also limit ourselves to 20 connections to a single host (again matching our previous LWP behavior).

In the future, this will allow JobMaster to write concurrently to ProcManager (or even individual workers) without blocking. (tweaked to accomodate "!want 0 job_master" support)

This backoff handling in HTTPFile is redundant for several reasons: * We rely on the monitor worker anyways to inform us of unreachable hosts * Monitor runs much faster nowadays, giving us a smaller window for out-of-date information about host reachability * HTTPFile->size no longer connects to the sidechannel port, only HTTP, so we waste fewer syscalls on failure if we a host went down before the last monitor run.

This allows us to us to speed up fsck on high latency clusters by issuing parallel HEAD requests.

This allows us to use the same HTTP connections between digest and HTTP size checks, reducing the number of open connections we need in the Fsck worker.

This simplifies the delete subroutine and should reduce the number of sockets created during rebalance.

This allows us to avoid running ourselves out of local ports when handling massive delete storms. Eventually, we can parallelize deletes in a manner similar to fsck size checking.

This can reduce latency for folks still stuck with MKCOL. This creates no new sockets for replicate and monitor in all cases, as connections to the HTTP DAV server are already used in those workers. This only adds new persistent connections to the queryworker if GET-only HTTP ports are configured (queryworker already may call HTTPFile->size).

For setups stuck needing MKCOL, we can parallelize directory vivification for multi-destination uploads.

There's no reason we should ever skip Content-Length validation if we know which FID we're replicating and have an FID object handy. Conflicts: lib/MogileFS/Worker/Replicate.pm

This should reduce the amount of TIME-WAIT sockets and TCP handshakes when replicating, especially with small files. An attempt was previously made to use the Net::HTTP::NB API directly, but that resulted in complicated callback nesting and state management needed to throttle the reader if the sender socket were blocked in any way. There were many bugs in the early version of this code as a result of the complicated code. Even after all the bugs got fixed, a small performance reduction due to the extra buffer copies was difficult to avoid. Thus I started using the synchronous version to keep the code simple and fast while preserving the ability to use persistent sockets to avoid excessive TIME-WAIT and handshaking for small file replication.

MogileFS::ConnectionPool::conn_get may return undef on some errors, so we must account for that and not kill the replicate worker.

Send the entire error message (including intended host:port so it is more informative when it propagates to Connection::HTTP::err_response. We also do not need to log the error in ConnectionPool, as the error will be logged by the caller. While we're at it, fix the documentation and a spelling error in err_response, too.

We need to ensure we don't blow up a worker process if a server is shutdown and a connection attempted before the monitor notices.

We will want similar logic for Mogstored sidechannel to avoid retrying on timeout.

String representations of small floating point values may be in (scientific) E notation, so we must ensure the entire string is free of decimal digits before considering it a configuration key.

Workers only need to inherit the minimum amount necessary from the parent ProcManager. Keeping the socket of unrelated workers in each worker is wasteful and may contribute to premature resource exhaustion. Additionally, we will be using Danga::Socket in more (possibly all) workers, not just the Monitor and Reaper. Resetting in workers that do not use Danga::Socket is harmless and will not allocate epoll/kqueue descriptors until the worker actually uses Danga::Socket.

The timeout we're removing includes time spent in the queue waiting to even start, so reporting it in the syslog is confusing, especially since we already log the timeout via Connection::Poolable This avoids a confusing sequence of error messages like the following: [monitor(666)] node_timeout: 2 (elapsed: 2.00099802017212): GET http://127.0.0.1:7500/dev666/usage [monitor(666)] Timeout contacting 127.0.0.1 dev 666 (http://127.0.0.1:7500/dev666/usage): took 2.25 seconds out of 2 allowed Now, we only display the first message.

Changelog diff is: diff --git a/CHANGES b/CHANGES index b74f7f4..a6b2872 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,26 @@ +2013-08-18: Release version 2.70 + + * This release features a very large rewrite to the Monitor worker to run + checks in parallel. There are no DB schema changes. + + * replicate: use persistent connection from pool if possible (Eric Wong <[email protected]>) + + * replicate: enforce expected Content-Length in http_copy (Eric Wong <[email protected]>) + + * create_open: parallelize directory vivification (Eric Wong <[email protected]>) + + * device: reuse HTTP connections for MKCOL (Eric Wong <[email protected]>) + + * delete worker uses persistent HTTP connections (Eric Wong <[email protected]>) + + * httpfile: use HTTP connection pool for DELETE (Eric Wong <[email protected]>) + + * httpfile: use Net::HTTP::NB, remove LWP::UserAgent (Eric Wong <[email protected]>) + + * fsck: parallelize size checks for any given FID (Eric Wong <[email protected]>) + + * monitor: refactor/rewrite to use new async API (Eric Wong <[email protected]>) + 2013-08-07: Release version 2.68 * optimize monitor worker for large installs (Eric Wong <[email protected]>)

Clarified by Brad Fitzpatrick

Marking an entire host as "readonly" before a host maintenance window can useful and easier than marking each device "readonly" and reduces the likelyhood a device will be incorrectly marked as "alive" again when it is intended to stay down.

This allows the monitor to eventually notice a client socket is totally gone if a machine death was not detected earlier. We enable TCP keepalive everywhere else, too.

This defines the size of the HTTP connection pool. This affects all workers at the moment, but is likely most interesting to the Monitor as it affects the number of devices the monitor may concurrently update. This defaults to 20 (the long-existing, hard-coded value). In the future, there may be a(n easy) way to specify this on a a per-worker basis, but for now it affects all workers.

Blindly attempting to write to a socket before a TCP connection can be established returns EAGAIN on Linux, but not on FreeBSD 8/9. This causes Danga::Socket to error out, as it won't attempt to buffer on anything but EAGAIN on write() attempts. Now, we buffer writes explicitly after the initial socket creation and connect(), and only call Danga::Socket::write when we've established writability. This works on Linux, too, and avoids an unnecessary syscall in most cases. Reported-by: Alex Yakovenko <[email protected]>

Otherwise we'll end up constantly waking up when there's nothing to write.

The timeout check may run on a socket before epoll_wait/kevent has a chance to run, giving the application no chance for any readiness callbacks to fire. This prevents timeouts in the monitor if the database is slow during synchronous UPDATE device calls (or there are just thousands of active connections).

HTTP requests time out because we had to wait synchronously for DBI, this is very noticeable on a high-latency connection. So avoid running synchronous code while asynchronous code (which is subject to timeouts) is running..

With enough devices and high enough network latency to the DB, we bump into the watchdog timeout of 30s easily.

Issuing many UPDATE statements slow down monitoring on high latency connections between the monitor and DB. Under MySQL, it is possible to do multiple UPDATEs in a single statement using CASE/WHEN syntax. We limit ourselves to 10000 devices per update for now, this should keep us comfortably under most the max_allowed_packet size of most MySQL deployments (where the default is 1M). A compatibility function is provided for SQLite and Postgres users. SQLite users are not expected to run this over high-latency NFS, and interested Postgres users should submit their own implementation.

mark_fidid_unreachable has not been used since MogileFS 2.35 commit 53528c7 ("Wipe out old replication code.", r1432)

No longer used since commit ebf8a5a ("Mass nuke unused code and fix most tests") in MogileFS 2.50

"is not unique" => "UNIQUE constraint failed". String matching is lovely.

Changelog diff is: diff --git a/CHANGES b/CHANGES index a6b2872..441b328 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,29 @@ +2014-12-15: Release version 2.72 + + * Work with DBD::SQLite's latest lock errors (dormando <[email protected]>) + + * remove update_host_property (Eric Wong <[email protected]>) + + * remove users of unreachable_fids table (Eric Wong <[email protected]>) + + * monitor: batch MySQL device table updates (Eric Wong <[email protected]>) + + * monitor: defer DB updates until all HTTP requests are done (Eric Wong <[email protected]>) + + * connection/poolable: defer expiry of timed out connections (Eric Wong <[email protected]>) + + * connection/poolable: disable watch_write before retrying write (Eric Wong <[email protected]>) + + * connection/poolable: do not write before event_write (Eric Wong <[email protected]>) + + * add conn_pool_size configuration option (Eric Wong <[email protected]>) + + * enable TCP keepalives for iostat watcher sockets (Eric Wong <[email protected]>) + + * host: add "readonly" state to override device "alive" state (Eric Wong <[email protected]>) + + * add LICENSE file to distro (dormando <[email protected]>) + 2013-08-18: Release version 2.70 * This release features a very large rewrite to the Monitor worker to run

Due to a bug the MultipleNetworks replication policy <[email protected]>, a network split caused an instance to explode with overreplicated files. Since every too_happy pruning increases failcount, it could end up taking days due to clean up a file with far too many replicas.

The readonly host state was not enabled via mogdbsetup and could not be used although the code supports it, making the schema version bump to 16 a no-op. This bumps the schema version to 17. Add a test using mogadm to ensure the setting is changeable, as the existing test for this state did not rely on the database. This was also completely broken with Postgres before, as Postgres currently offers no way to modify constraints in-place. Constraints must be dropped and re-added instead. Note: it seems the upgrade_add_device_* functions in Postgres.pm are untested as well and never got used. Perhaps they ought to be removed entirely since those device columns predate Postgres support.

Perl buffered IO is only reading 8K at a time (or only 4K on older versions!) despite us requesting to read in 1MB chunks. This wastes syscalls and can affect TCP window scaling when MogileFS is replicating across long fat networks (LFN). While we're at it, this fixes a long-standing FIXME item to perform proper timeouts when reading headers as we're forced to do sysread instead of line-buffered I/O. ref: https://rt.perl.org/Public/Bug/Display.html?id=126403 (and confirmed by strace-ing replication workers)

* bogomips/fix-readonly: enable DB upgrade for host readonly state

* bogomips/fsck-recheck: fsck: this avoid redundant fsck log entries

* bogomips/fsck-found-order: fsck: do not log FOND if note_on_device croaks

* bogomips/prune-too_happy-v3: replicate: reduce backoff for too_happy FIDs

* bogomips/resurrect-device: reaper: detect resurrection of "dead" devices

Perl 5.18 stable and later (commit a7b39f85d7caac) introduced a warning for restarting `each` after hash modification. While we accounted for this undefined behavior and documented it in the past, this may still cause maintenance problems in the future despite our current workarounds being sufficient. In any case, keeping idle sockets around is cheap with modern APIs, and conn_pool_size was introduced in 2.72 to avoid dropping idle connections at all; so _conn_drop_idle may never be called on a properly configured tracker. Mailing list references: <CABJfL5jiAGC+5JzZjuW7R_NXs1DShHPGsKnjzXrPbjWOy2wi3g@mail.gmail.com> <[email protected]>

On *BSD platforms, the accept()-ed clients inherit the O_NONBLOCK file flag from the listen socket. This is not true on Linux, and I noticed sockets blocking on write() syscalls via strace. Checking the octal 04000 (O_NONBLOCK) flag in /proc/$PID/fdinfo/$FD for client TCP sockets confirms O_NONBLOCK was not set. This also makes us resilient to spurious wakeups causing event_read to get stuck, as documented in the Linux select(2) manpage.

Make client query processing less aggressive and more fair by only enqueueing a single worker request at a time. Pipelined requests in the read buffer will only be handled after successful writes, and any incomplete writes will block further request processing. Furthermore, add a watchdog for clients we're writing to expire clients which are not reading our responses. Danga::Socket allows clients to use an infinite amount of space for buffering, and it's possible for dead sockets to go undetected for hours by the OS. Use a watchdog to kick out any sockets which have made no forward progress after two minutes.

This avoids the odd case where the first write completes, but the second one (for 3 bytes: ".\r\n") does not complete, causing a client to having both read and write watchability enabled after the previous commit to stop reads when writes do not complete. This would not be fatal, but breaks the rule where clients should only be reading or writing exclusively, never doing both; as that could lead to pathological memory usage. This also reduces client wakeups and TCP overhead with TCP_NODELAY sockets by avoiding a small packet (".\r\n") after the main response.

Otherwise it'll be possible to pipeline admin (!) commands and event_read will trigger EOF before all the admin commands are processed in read_buf.

* client-backpressure: client: always disable watch_read after a command client: use single write for admin commands tracker: client fairness, backpressure, and expiry client connection should always be nonblocking

* bogomips/replicate-nobuf: replicate: avoid buffered IO on reads

* bogomips/conn-pool-each: ConnectionPool: avoid undefined behavior for hash iteration

If DevFID::size_on_disk encounters an unreadable (dead) device AND there are no HTTP requests pending; we must ensure Danga::Socket runs the PostLoopCallback to check if the event loop is complete. Do that by scheduling another timer to run immediately.

* fsck-timeout: fsck: avoid infinite wait on dead devices

Changelog diff is: diff --git a/CHANGES b/CHANGES index 441b328..e053851 100644 --- a/CHANGES +++ b/CHANGES @@ -1,3 +1,29 @@ +2018-01-18: Release version 2.73 + + * fsck: avoid infinite wait on dead devices (Eric Wong <[email protected]>) + + * client: always disable watch_read after a command (Eric Wong <[email protected]>) + + * client: use single write for admin commands (Eric Wong <[email protected]>) + + * tracker: client fairness, backpressure, and expiry (Eric Wong <[email protected]>) + + * client connection should always be nonblocking (Eric Wong <[email protected]>) + + * ConnectionPool: avoid undefined behavior for hash iteration (Eric Wong <[email protected]>) + + * replicate: avoid buffered IO on reads (Eric Wong <[email protected]>) + + * enable DB upgrade for host readonly state (Eric Wong <[email protected]>) + + * replicate: reduce backoff for too_happy FIDs (Eric Wong <[email protected]>) + + * fsck: this avoid redundant fsck log entries (Eric Wong <[email protected]>) + + * fsck: do not log FOND if note_on_device croaks (Eric Wong <[email protected]>) + + * reaper: detect resurrection of "dead" devices (Eric Wong <[email protected]>) + 2014-12-15: Release version 2.72 * Work with DBD::SQLite's latest lock errors (dormando <[email protected]>)

Commits on Dec 23, 2012

Moved utf-8 config to http block

notti committed Dec 23, 2012

Configuration menu

View commit details

Copy full SHA for 02ed229

Browse repository at this point

Copy the full SHA

02ed229 View commit details

Browse the repository at this point in the history

Failed to connect to database: Access denied for user 'mogilefs #1

Are you sure you want to change the base?

Failed to connect to database: Access denied for user 'mogilefs #1

Commits on Mar 30, 2012

Commits on Apr 12, 2012

Commits on Apr 14, 2012

Commits on Apr 21, 2012

Commits on Apr 22, 2012

Commits on Apr 23, 2012

Commits on Apr 29, 2012

Commits on May 2, 2012

Commits on May 4, 2012

Commits on May 9, 2012

Commits on May 11, 2012

Commits on May 12, 2012

Commits on May 17, 2012

Commits on May 18, 2012

Commits on May 19, 2012

Commits on May 30, 2012

Commits on Jun 20, 2012

Commits on Jun 21, 2012

Commits on Jun 22, 2012

Commits on Aug 12, 2012

Commits on Aug 13, 2012

Commits on Aug 14, 2012

Commits on Nov 3, 2012

Commits on Nov 12, 2012

Commits on Nov 13, 2012

Commits on Dec 23, 2012

Commits on Dec 24, 2012

Commits on Jan 5, 2013

Commits on Jan 6, 2013

Commits on Jan 7, 2013

Commits on Jan 9, 2013

Commits on Jan 11, 2013

Commits on Jan 12, 2013

Commits on Jan 13, 2013

Commits on Jan 15, 2013

Commits on Jan 17, 2013

Commits on Jan 18, 2013

Commits on Jan 19, 2013

Commits on Feb 3, 2013

Commits on Feb 7, 2013

Commits on Feb 12, 2013

Commits on Feb 19, 2013

Commits on Feb 23, 2013

Commits on Feb 26, 2013

Commits on Feb 27, 2013

Commits on Mar 9, 2013

Commits on Mar 30, 2013

Commits on Apr 1, 2013

Commits on Jul 10, 2013

Commits on Aug 4, 2013

Commits on Aug 8, 2013

Commits on Aug 10, 2013

Commits on Aug 19, 2013

Commits on Dec 15, 2014

Commits on Dec 16, 2014

Commits on Apr 17, 2015

Commits on Jun 12, 2015

Commits on Dec 17, 2015

Commits on Feb 9, 2017

Commits on Feb 13, 2017

Commits on Apr 6, 2017

Commits on Apr 7, 2017

Commits on May 8, 2017

Commits on May 9, 2017

Commits on Jun 7, 2017

Commits on Sep 18, 2017

Commits on Jan 19, 2018