Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate DIAMetrics principles into a benchmarking suite for brackit #39

Open
3 tasks
AlvinKuruvilla opened this issue May 24, 2022 · 0 comments
Open
3 tasks

Comments

@AlvinKuruvilla
Copy link
Contributor

Background on DIAMetrics

DIAMetrics is an end-to-end benchmarking and performance framework for query engines developed by Google.

Componenets

Note that there are more details than mentioned here; this is only as an overview, and if we need to add details about more parts, we can do that further down the line

Workload Extractor:

According to the paper, this component extracts a "representative workload" from a live production workload. "DIAMetrics employs a workload extractor and summarizer, which is a feature-based way to ‘mine’ the query logs of a customer and extract a subset of queries that adequately represent the workload of the customer."
For our current purposes, I feel like the best way we can utilize a component like this is to pinpoint a set of heavy workloads that we can keep a list of and then just run those workloads for the time being. To this end, I am working on a PR that will hopefully bring more XQuery files for us to run against from this repository. I will update this issue with a PR number so that we can keep track of everything.

Data and Query Scrambler

This component aims to help protect sensitive data and create variations of the representative sets to prevent sensitive data leakage. The paper lists off a few ways that they achieve this, but for the time being, we can put less emphasis on this part since we will use this internally for the moment.

Workload Runner

According to the paper, this component "allows users to specify various combinations of workloads and systems to be benchmarked. For instance, we may want to run TPC-H on various query engines over various storage formats to see which storage format is the best option for which engine." The runner can either schedule runs of specific engines or spin up and manage (including cleanup and shutdown) entire engine instances for the runs

Monitoring

There are two parts to this:

  1. Visualization Framework - which brings up dashboards
  2. Alerting Framework - which compares workload performance to historical data and alerts when there iareconcerns

TODO (more to come as we get further along)

  • Merge in more XQuery files from xquerl
  • Figure out workloads that do not perform well and add them to brackit
  • Extract representative workloads somehow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant