Remote snapshotter locks cache during microVM vsock dial #687

austinvazquez · 2022-06-24T16:39:36Z

Context:

The demux snapshotter utilizes a snapshotter caching mechanism for funneling requests to the appropriate remote snapshotter.

This cache enables two things. One, we use it for performance reasons. Creating the proxy object can be an expensive operation. Two, the cache backs service discovery for metrics proxy.

Snapshotter requests can occur in parallel. So we need to protect memory. With the existing implementation, perform the following operations:

Acquire reader's lock.
Fetch snapshotter from cache.
Release reader's lock.
If cache hit, done.
If cache miss, acquire writer's lock.
Fetch snapshotter from cache.
if cache hit, jump to 10.
If cache miss, continue.
Create cache entry using fetch function.
Release writer's lock.

We utilize the double check lock to ensure no system resources are leaked if two threads populate the cache entry concurrently.
e.g.
Thread A - acquire reader's lock, cache miss, release reader's lock, and context switched.
Thread B - acquire reader's lock, cache miss, release reader's lock, and context switched.
Note: at this point both threads will have had a cache miss and are on course to populate the cache.
Thread A - acquire writer's lock, populate cache entry, release writer's lock.
Thread B - acquire writer's lock, populate cache entry, release writer's lock.
Note: at this point the cache entry from Thread A is leaked. While garbage collection will resolve the object itself, these entries are used to manage system resources which enable metrics proxy. In this case, a system port where the metrics proxy HTTP server is running.

Challenge:

The issue is the writer's lock is held during cache entry fetch which we have observed can be an expensive operation on some systems. The ideal solution would be to release the lock after a writer's lock cache miss; however, we must be cognizant of the above scenario and avoid leaking resources.

austinvazquez · 2022-06-24T19:35:05Z

If we are to stick to the current mechanisms with a single RWMutex and map, then the solution may look like this.

// Get fetches and caches the snapshotter for a given key.
func (cache *SnapshotterCache) Get(ctx context.Context, key string, fetch SnapshotterProvider) (*proxy.RemoteSnapshotter, error) {
	cache.mutex.RLock()
	snapshotter, ok := cache.snapshotters[key]
	cache.mutex.RUnlock()

	if !ok {
		newSnapshotter, err := fetch(ctx, key)
		if err != nil {
			return nil, err
		}
		cache.mutex.Lock()
		snapshotter, ok = cache.snapshotters[key]
		defer cache.mutex.Unlock()

		if !ok {
			cache.snapshotters[key] = newSnapshotter
			snapshotter = newSnapshotter
		} else {
			newSnapshotter.Close()
		}
	}
	return snapshotter, nil
}

Although I am the author and even I admit it looks somewhat ugly. Could potentially be improved by breaking out the separate pieces of functionality into named functions. I also have a cache refactor out for review which reworks having the fetch function as a required instance variable which will eliminate the need to pass it through.

// Get fetches and caches the snapshotter for a given key.
func (cache *SnapshotterCache) Get(ctx context.Context, key string, fetch SnapshotterProvider) (*proxy.RemoteSnapshotter, error) {
	cache.mutex.RLock()
	snapshotter, ok := cache.snapshotters[key]
	cache.mutex.RUnlock()

	if !ok {
		var err error
		if snapshotter, err = cache.pullThrough(ctx, key, fetch); err != nil {
			return nil, err
		}
	}

	return snapshotter, nil
}

func (cache *SnapshotterCache) pullThrough(ctx context.Context, key string, pull SnapshotterProvider) (*proxy.RemoteSnapshotter, error) {
	snapshotter, err := pull(ctx, key)
	if err != nil {
		return nil, err
	}
	cache.mutex.Lock()
	defer cache.mutex.Unlock()

	if s, ok := cache.snapshotters[key]; ok {
		// Entry pulled through by another thread. Cleanup resouces allocated.
		snapshotter.Close()
		return s, nil
	}
	cache.snapshotters[key] = snapshotter
	return snapshotter, nil
}

austinvazquez added kind/bug Something isn't working area/snapshotter exp/expert labels Jun 24, 2022

austinvazquez mentioned this issue Jun 24, 2022

fix: fetch snapshotter proxy object without holding cache lock #685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote snapshotter locks cache during microVM vsock dial #687

Remote snapshotter locks cache during microVM vsock dial #687

austinvazquez commented Jun 24, 2022

austinvazquez commented Jun 24, 2022

Remote snapshotter locks cache during microVM vsock dial #687

Remote snapshotter locks cache during microVM vsock dial #687

Comments

austinvazquez commented Jun 24, 2022

austinvazquez commented Jun 24, 2022