Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port ErrorMonitor class from ruby connectors #2671

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

artem-shelkovnikov
Copy link
Member

@artem-shelkovnikov artem-shelkovnikov commented Jul 2, 2024

Porting the ErrorMonitor class from https://github.com/elastic/connectors-ruby/blob/main/lib/utility/error_monitor.rb

ErrorMonitor class is there to add transient error handling and ignore errors until certain threshold.

For example, default error monitor class can be used from any class implementing BaseDataSource with the following code:

async def fetch_some_info(self, id):
    with self.with_error_monitoring():
        yield self.client.fetch_some_not_critical_entity(id)

Alternative way of usage:

try:
    entity = self.client.query(id=id)
    self.error_monitor.track_success()
except Exception as ex:
    self.error_monitor.track_error(ex)

What it will do:

  • Log the error
  • Ignore it UNLESS too many errors happened.

What does "too many errors" mean?

  1. Too many errors happened over the sync - default number is 1000
  2. Too many errors happened consecutively within the blocks with self.with_error_monitoring() - default number is 10
  3. Too many errors happened in the window of X blocks with self.with_error_monitoring() - by default it's 15%

This PR already makes use of ErrorMonitor for generic framework operations.

1 error monitor is created and passed to: Connector, Sink and Extractor. Sink and Extractor already track errors, connectors can track errors on per connector basis.

Sink uses error monitor to track errors while ingesting data - each bulk operation is checked and each document that was not ingested into Elasticsearch triggers an "error" tracking, each successfully ingested document produces "success mapping". In practice with default settings it means, for example, that if 100 documents are sent to Elasticsearch and 16 of them fail, sync will fail.

Same happens for downloads - if a download succeeds, we track is as "success" in Error Monitor, if it fails, it's a "failure" and Error Monitor will transiently skip it unless too many errors happened.

If sync is terminated due to this error, Kibana UI will show reason (too many errors in total, too many consequent errors, error rate is too high) along with the last error that happened.

So, as an example, if during a sync 200 documents fail to be downloaded and 801 documents fail to be ingested to Elasticsearch, sync will fail.

seanstory
seanstory previously approved these changes Jul 2, 2024
Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.
I'm fine to have the configuration be a follow-up. Just as long as it isn't used before it's configurable.

connectors/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few little nites, but largely! looks good!

config.yml.example Outdated Show resolved Hide resolved
config.yml.example Outdated Show resolved Hide resolved
config.yml.example Outdated Show resolved Hide resolved
Comment on lines +484 to +485
except ForceCanceledError:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also re-raise aio CanceledError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure it's needed. I'll double check and add if it is

connectors/es/sink.py Show resolved Hide resolved
connectors/es/sink.py Show resolved Hide resolved
@@ -767,12 +780,14 @@ async def download_and_extract_file(
doc = await self.handle_file_content_extraction(
doc, source_filename, temp_filename
)
self.error_monitor.track_success()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this end up double-counting? Since sink.py also uses the monitor for the lazy downloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, will fix!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what to do with that, we have several potential points of failure/success:

Extraction:

  • File was successfully downloaded
  • File was not successfully downloaded

Ingestion:

  • File was successfully ingested
  • File was not successfully ingested

It's possible to have all the permutations!

For example, file was not downloaded successfully, but we still choose to upload its metadata to Elasticsearch and everything worked out - is it a failure?

I'll open this up to Slack

Comment on lines 590 to 596
def with_error_monitoring(self):
try:
yield
self.error_monitor.track_success()
except Exception as ex:
self._logger.error(ex)
self.error_monitor.track_error(ex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we had in ent-search was the concept of "fatal exceptions". See: https://github.com/elastic/ent-search/blob/a1ddc884ad1543f61a91da7b9511a01090adcac5/connectors/lib/connectors/content_sources/base/extractor.rb#L132-L139

it might be worth adding that here too, for things like aio CanceledError, OOME, etc, to make sure we fail-fast when necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call. OOME can't get caught, but stuff like CancelledError + having a custom fatal exception class should help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, since we don't have code that uses this I'd address it later once we actually use this code somewhere, for now this doesn't make any difference.

I'd also like to separately add handlers for terminal exceptions - for example our download function and some other do catch Exception that can catch and "handle" cancelled errors in the weird way

@seanstory
Copy link
Member

Looks like this may need a rebase - I think you've pulled in other commits you may not have meant to.

@artem-shelkovnikov artem-shelkovnikov force-pushed the artem/add-error-monitor branch from 4a92c61 to 7dba43e Compare July 24, 2024 10:05
@seanstory
Copy link
Member

Related: #1582 (maybe even "fixes"?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants