Remote output health improvements #4185

michel-laterman · 2024-12-06T21:09:31Z

Describe the enhancement:

Currently remote output health is reported (when updateState is called) in the policy-self monitor:

fleet-server/internal/pkg/policy/self.go

Lines 262 to 264 in cf41f38

    
           func reportOutputHealth(ctx context.Context, bulker bulk.Bulk, zlog zerolog.Logger) { 
        
           	//pinging logic 
        
           	bulkerMap := bulker.GetBulkerMap()

This creates a document in the primary ES instance with the output health status:

fleet-server/internal/pkg/dl/output_health.go

Lines 17 to 41 in cf41f38

    
           func CreateOutputHealth(ctx context.Context, bulker bulk.Bulk, doc model.OutputHealth) error { 
        
           	return createOutputHealth(ctx, bulker, FleetOutputHealth, doc) 
        
           } 
        
           func createOutputHealth(ctx context.Context, bulker bulk.Bulk, index string, doc model.OutputHealth) error { 
        
           	if doc.Timestamp == "" { 
        
           		doc.Timestamp = time.Now().UTC().Format(time.RFC3339) 
        
           	} 
        
           	doc.DataStream = &model.DataStream{ 
        
           		Dataset:   "fleet_server.output_health", 
        
           		Type:      "logs", 
        
           		Namespace: "default", 
        
           	} 
        
           	body, err := json.Marshal(doc) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	id, err := uuid.NewV4() 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	_, err = bulker.Create(ctx, index, id.String(), body, bulk.WithRefresh()) 
        
           	return err 
        
           }

.

However policy self monitor may not be a good place to have these updates as the output bulker health signal is not actually used by the monitor.
Additionally gathering a reference to all bulkers may cause some concurrency issues as seen in #4170.

We may want to have remote bulkers start a heartbeat goroutine that would use the primary bulker to write their status directly; This would address both issues.

The text was updated successfully, but these errors were encountered:

cmacknz · 2024-12-06T21:38:45Z

We may want to have remote bulkers start a heartbeat goroutine that would use the primary bulker to write their status directly; This would address both issues.

This is also the first alternative I thought of when I first saw what the code was doing. I don't think we'd have to worry about the number of goroutines, because there aren't going to be 1000s of remote outputs unless there is some crazy bug somewhere.

michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Dec 6, 2024

michel-laterman mentioned this issue Dec 6, 2024

Remove race condition when accessing remote bulker map #4171

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote output health improvements #4185

Remote output health improvements #4185

michel-laterman commented Dec 6, 2024

cmacknz commented Dec 6, 2024

Remote output health improvements #4185

Remote output health improvements #4185

Comments

michel-laterman commented Dec 6, 2024

cmacknz commented Dec 6, 2024