Determine how release statistics should be stored #30

preaction · 2018-06-20T16:44:20Z

Presently, the per-release summary statistics are stored in two tables: release_data and release_summary. These two tables have the exact same schema, but slightly different uses:

The release_data table stores one row per test report. One of the pass, fail, na, unknown columns will have a 1 in it.
The release_summary table stores one row per distribution version. The pass, fail, na, and unknown columns will have the count of each test report grade.

In essence, the release_summary table is the sum of all the related release_data rows (this is also technically a duplication of the cpanstats table (which, technically is a duplication of test_report table with data extracted from the JSON)).

Now that we have a dedicated database server with a few more CPU cycles than we had previously, we can look at how we store this data: Do we need the intermediate state of the release_data table, or can we just store the release_summary? Or, should we avoid the further step of summing the values and storing them in release_summary and just keep release_data? Or can we get rid of these tables entirely and just build this data on-the-fly from cpanstats?

The text was updated successfully, but these errors were encountered:

barbie · 2018-06-20T21:23:23Z

Hi, The only reason these tables existed was for caching purposes, as querying the old db for this info could take down the site ... especially for ADAMK! The colour bar on the site (green-amber-red = pass-na-fail) used the release_summary to build the cache (summary) to avoid expensive calls for each page load. I think the release_summary was also used by MetaCPAN, to generate their view of CPAN Testers for each distro. See release-summary.cgi Cheers, Barbie.

…

-- Birmingham.pm - http://birmingham.pm.org YAPC Surveys - http://yapc-surveys.org Perl Jam - http://perljam.info

On Wed, Jun 20, 2018 at 5:44 PM, Doug Bell ***@***.***> wrote: Presently, the per-release summary statistics are stored in two tables: release_data and release_summary. These two tables have the exact same schema, but slightly different uses: - The release_data table stores one row per test report. One of the pass, fail, na, unknown columns will have a 1 in it. - The release_summary table stores one row per distribution version. The pass, fail, na, and unknown columns will have the count of each test report grade. In essence, the release_summary table is the sum of all the related release_data rows (this is also technically a duplication of the cpanstats table (which, technically is a duplication of test_report table with data extracted from the JSON)). Now that we have a dedicated database server with a few more CPU cycles than we had previously, we can look at how we store this data: Do we need the intermediate state of the release_data table, or can we just store the release_summary? Or, should we avoid the further step of summing the values and storing them in release_summary and just keep release_data? Or can we get rid of these tables entirely and just build this data on-the-fly from cpanstats? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADW1iST92oRnFZz64IV4VQ_dUFEh7bLks5t-nvkgaJpZM4Uvn2k> .

preaction · 2018-06-20T21:44:35Z

Yep, presently the release summary APIs use the release_summary table directly, and that's still how MetaCPAN is getting their data. Even if the underlying data storage is changed, the APIs must remain the same.

But, we may not need to generate and store the derived data anymore. I find it highly unlikely, for the same reasons you mentioned, but it might be possible to do all of this on-the-fly.

But, if it ends up that we do need to generate and store the derived data, we might not need both steps to be stored. It might be possible to drop release_data and keep release_summary. When new reports come in, they increment the correct value in the release_summary table. I'm not sure I like this idea, because it's a lot more work for the database to validate that the summary data is correct.

More likely, the release summary data may be able to be generated on-the-fly from the release_data table using a bunch of SUM(...) functions in MySQL. It's easy to validate release_data against cpanstats: Two simple queries (one 1:1 LEFT JOIN, one 1:1 RIGHT JOIN) can find what records are incorrect.

I'm not confident that any improvements can be made here, but it's something we can look into. The smaller the schema we have, the easier it will be to start deriving all this data for other languages (like Perl 6). Also, if we can derive this data easily, we can offer more query options from the API side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine how release statistics should be stored #30

Determine how release statistics should be stored #30

preaction commented Jun 20, 2018

barbie commented Jun 20, 2018 via email

preaction commented Jun 20, 2018

Determine how release statistics should be stored #30

Determine how release statistics should be stored #30

Comments

preaction commented Jun 20, 2018

barbie commented Jun 20, 2018 via email

preaction commented Jun 20, 2018