improve couchbase queries for selected subset of stations #822
Replies: 15 comments 3 replies
-
Just starting to look at CouchBase and try to understand the data model, so please pardon me if these questions/opinions/ideas seem irrelevant/trivial ... I kind of lean towards a usage-driven schema design rather than domain driven schema design,. What I mean by that is design schema based on what queries our applications are going to use, rather than a domain (meteorological) understanding of the data structure. Pretty much a de-normalized schema. This definitely means more duplicated data in the DB, but generally results in better application performance, of-course at the cost of more ingest overhead. Noticed that we are only using 1 bucket named mdata. More later ... Let the discussion begin :) |
Beta Was this translation helpful? Give feedback.
-
We had a development meeting on Friday October 7th and I want to recap the meeting, and various communications that have happened since then here.
[3:32](https://oar-gsl.slack.com/archives/C03SARCF6AH/p1665178325185829)
Need to pull everything because of subrepos, but I’m glad it all works!
|
Beta Was this translation helpful? Give feedback.
-
I did work recently (while waiting for my mdatatest bucket to duplicate and made some interesting discoveries which I documented in Molly's google document on the subject here.... https://docs.google.com/document/d/1XzCJrFlDerKTyHD1_snyrLPGsxKfN5fORai0-MtbxHM/edit# |
Beta Was this translation helpful? Give feedback.
-
The bottom line was recorded in this slack verification-dev communication....
|
Beta Was this translation helpful? Give feedback.
-
Randy, thanks for providing this summary.
…On Mon, Oct 10, 2022 at 2:15 PM randytpierce ***@***.***> wrote:
We had a development meeting on Friday October 7th and I want to recap the
meeting, and various communications that have happened since then here.
Please refer to NOAA-GSL/MATS#755
<#755>.
Meeting notes:
1. We were able to resolve the aggregation problem and Molly has now
tested the solution, from slack channel verification-dev (Molly)...
[3:32](https://oar-gsl.slack.com/archives/C03SARCF6AH/p1665178325185829)
Need to pull everything because of subrepos, but I’m glad it all works!
1. We devised a three prong approach to resolve the speed of
substation plots. see issue #755
We now have a three prong approach to this problem. (Jeff and Gopa please correct any mis-statements about the approaches)
1) I (randy) will create a new bucket named 'mdatatest' that has a subset of our data in it with the mdata.data portion of each data object a map instead of an array. We will see if we can make the queries operate faster on that data structure. If that works we can change the ingest and modify all the data either in place or in a new bucket or in some other way that we figure out later.
2) Jeff is going to explore modifying our existing query by possibly breaking it up into subqueries.
3) Gopa is going to try to explore better indexing methods.
It makes sense to me to make these into new issues so the first one will be VxIngest issue#755
—
Reply to this email directly, view it on GitHub
<https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3842518>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG6HZOQWO4AIXZTHGBRSGU3WCR2ODANCNFSM6AAAAAAQ43YYYI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Bonny Strong
NOAA/GSL and CIRA
office: (719) 301-6195 or home: (970) 669-1188
|
Beta Was this translation helpful? Give feedback.
-
Sure,
I'm trying to capture all of this in one place and also trying to move us
toward using discussions. We had a very productive meeting, but ended up
with notes in a zillion places. Felt a little like herding cats, and I was
one of the cats.
randy
On Mon, Oct 10, 2022 at 3:21 PM bonnystrong ***@***.***>
wrote:
… Randy, thanks for providing this summary.
On Mon, Oct 10, 2022 at 2:15 PM randytpierce ***@***.***>
wrote:
> We had a development meeting on Friday October 7th and I want to recap
the
> meeting, and various communications that have happened since then here.
> Please refer to NOAA-GSL/MATS#755
> <#755>.
> Meeting notes:
>
> 1. We were able to resolve the aggregation problem and Molly has now
> tested the solution, from slack channel verification-dev (Molly)...
>
> [3:32](https://oar-gsl.slack.com/archives/C03SARCF6AH/p1665178325185829)
> Need to pull everything because of subrepos, but I’m glad it all works!
>
>
> 1. We devised a three prong approach to resolve the speed of
> substation plots. see issue #755
>
> We now have a three prong approach to this problem. (Jeff and Gopa
please correct any mis-statements about the approaches)
>
> 1) I (randy) will create a new bucket named 'mdatatest' that has a
subset of our data in it with the mdata.data portion of each data object a
map instead of an array. We will see if we can make the queries operate
faster on that data structure. If that works we can change the ingest and
modify all the data either in place or in a new bucket or in some other way
that we figure out later.
>
> 2) Jeff is going to explore modifying our existing query by possibly
breaking it up into subqueries.
>
> 3) Gopa is going to try to explore better indexing methods.
>
> It makes sense to me to make these into new issues so the first one will
be VxIngest issue#755
>
>
> —
> Reply to this email directly, view it on GitHub
> <
https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3842518
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AG6HZOQWO4AIXZTHGBRSGU3WCR2ODANCNFSM6AAAAAAQ43YYYI
>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
--
Bonny Strong
NOAA/GSL and CIRA
office: (719) 301-6195 or home: (970) 669-1188
—
Reply to this email directly, view it on GitHub
<https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3842924>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPRLIQRYNZGHLNGUP5LWCSCF7ANCNFSM6AAAAAAQ43YYYI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Beta Was this translation helpful? Give feedback.
-
Finally I think this might help us. Of course we still have to add the actual contingency calculations but I don't expect them to be time intensive.
success 2 min ago 34.7s 504 docs 566893 bytes This for a month of data. What it gives is an array of objects 'stats' for each fcstValidEpoch. In the array are objects each of which is keyed by a station name with a value which is an object that has the model ceiling value and the obs ceiling value. This is where the contingencies should be put. Please try it out. I demonstrated to myself that is is linear with regards performance because a week was somewhere around 10 seconds and a month is somewhere around 35 seconds, which seems rational to me.
|
Beta Was this translation helpful? Give feedback.
-
I got the hint for this from 'vsr1' here the trick is two FOR statements. We might want to discuss this. |
Beta Was this translation helpful? Give feedback.
-
The plan is still implying that the majority of time is being spent in the LET statement. |
Beta Was this translation helpful? Give feedback.
-
I think we have it. This query aggregates all contingency table elements by valid epoch. I'll work on aggregating by station next:
|
Beta Was this translation helpful? Give feedback.
-
You are never satisfied! ;)
…On Wed, Oct 12, 2022 at 10:04 AM Molly Smith ***@***.***> wrote:
Yay!!! If you want to test the full range of MATS plot types, can you also
produce versions that aggregate by forecast lead time and hour of the day?
—
Reply to this email directly, view it on GitHub
<https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3862072>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPQ3F4AJNRCCEA7GU63WC3OPRANCNFSM6AAAAAAQ43YYYI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Beta Was this translation helpful? Give feedback.
-
I think it ias this... (previous one had missing *)
SELECT m.mfve,
ARRAY_SUM(stats[*].hit) as hits,
ARRAY_SUM(stats[*].miss) as misses,
ARRAY_SUM(stats[*].false_alarm) as fa,
ARRAY_SUM(stats[*].correct_negative) as cn
FROM (
SELECT trimData AS odata,
ofve
FROM mdata
LET ofve = mdata.fcstValidEpoch,
trimData = ARRAY d FOR d IN mdata.data WHEN d.name IN
['KEWR','KJFK','KJRB','KLDJ','KLGA','KNYC','KTEB'] END
WHERE mdata.type="DD"
AND mdata.docType='obs'
AND mdata.version="V01"
AND mdata.subset="METAR"
AND mdata.fcstValidEpoch BETWEEN 1661990400 AND 1661990400 + 3600 * 24 *
30) o,
(
SELECT trimData AS m0data,
mfve
FROM mdata
LET mfve = mdata.fcstValidEpoch,
trimData = ARRAY d FOR d IN mdata.data WHEN d.name IN
['KEWR','KJFK','KJRB','KLDJ','KLGA','KNYC','KTEB'] END
WHERE mdata.type="DD"
AND mdata.docType='model'
AND model ="HRRR_OPS"
AND fcstLen=6
AND mdata.version="V01"
AND mdata.subset="METAR"
AND mdata.fcstValidEpoch BETWEEN 1661990400 AND 1661990400 + 3600 * 24 *
30) m
LET stats = ARRAY (FIRST {'hit':CASE WHEN mv.Ceiling < 3000.0 AND
ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'miss':CASE WHEN mv.Ceiling < 3000.0 AND NOT ov.Ceiling < 3000.0 THEN 1
ELSE 0 END,
'false_alarm':CASE WHEN NOT mv.Ceiling < 3000.0 AND ov.Ceiling < 3000.0
THEN 1 ELSE 0 END,
'correct_negative':CASE WHEN NOT mv.Ceiling < 3000.0 AND NOT ov.Ceiling <
3000.0 THEN 1 ELSE 0 END}
FOR ov IN o.odata WHEN ov.name = mv.name END) FOR mv IN m.m0data END
WHERE m.mfve=o.ofve
…On Wed, Oct 12, 2022 at 9:44 AM Jeff Hamilton ***@***.***> wrote:
I think we have it. This query aggregates all contingency table elements
by valid epoch. I'll work on aggregating by station next:
SELECT m.mfve,
ARRAY_SUM(stats[
*].hit) as hits, ARRAY_SUM(stats[*].miss) as misses,
ARRAY_SUM(stats[
*].false_alarm) as fa, ARRAY_SUM(stats[*].correct_negative) as cn
FROM (
SELECT trimData AS odata,
ofve
FROM mdata
LET ofve = mdata.fcstValidEpoch,
trimData = ARRAY d FOR d IN mdata.data WHEN d.name IN
['KEWR','KJFK','KJRB','KLDJ','KLGA','KNYC','KTEB'] END
WHERE mdata.type="DD"
AND mdata.docType='obs'
AND mdata.version="V01"
AND mdata.subset="METAR"
AND mdata.fcstValidEpoch BETWEEN 1661990400 AND 1661990400 + 3600 * 24 *
30) o,
(
SELECT trimData AS m0data,
mfve
FROM mdata
LET mfve = mdata.fcstValidEpoch,
trimData = ARRAY d FOR d IN mdata.data WHEN d.name IN
['KEWR','KJFK','KJRB','KLDJ','KLGA','KNYC','KTEB'] END
WHERE mdata.type="DD"
AND mdata.docType='model'
AND model ="HRRR_OPS"
AND fcstLen=6
AND mdata.version="V01"
AND mdata.subset="METAR"
AND mdata.fcstValidEpoch BETWEEN 1661990400 AND 1661990400 + 3600 * 24 *
30) m
LET stats = ARRAY (FIRST {'hit':CASE WHEN mv.Ceiling < 3000.0 AND
ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'miss':CASE WHEN mv.Ceiling < 3000.0 AND NOT ov.Ceiling < 3000.0 THEN 1
ELSE 0 END,
'false_alarm':CASE WHEN NOT mv.Ceiling < 3000.0 AND ov.Ceiling < 3000.0
THEN 1 ELSE 0 END,
'correct_negative':CASE WHEN NOT mv.Ceiling < 3000.0 AND NOT ov.Ceiling <
3000.0 THEN 1 ELSE 0 END}
FOR ov IN o.odata WHEN ov.name = mv.name END) FOR mv IN m.m0data END
WHERE m.mfve=o.ofve
—
Reply to this email directly, view it on GitHub
<https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3861908>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPXW3BA2MCRRN3BSDALWC3MEHANCNFSM6AAAAAAQ43YYYI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Beta Was this translation helpful? Give feedback.
-
Just kidding, of course. We will certainly do them all!
…On Wed, Oct 12, 2022 at 10:21 AM Molly Smith ***@***.***> wrote:
Well, good to make sure they all work! :)
—
Reply to this email directly, view it on GitHub
<https://github.com/NOAA-GSL/VxIngest/discussions/165#discussioncomment-3862201>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGDVQPVHE2JYJKO3DTSWZDDWC3QQJANCNFSM6AAAAAAQ43YYYI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Randy Pierce
|
Beta Was this translation helpful? Give feedback.
-
Expected results for a station map query aggregated over the stations themselves over time (only two stations shown): [ |
Beta Was this translation helpful? Give feedback.
-
This is the old and slow map station query but it does return a result. |
Beta Was this translation helpful? Give feedback.
-
This discussion is for improving the couchbase queries that select for a specific subset of stations, especially for model obs pairs.
Beta Was this translation helpful? Give feedback.
All reactions