Stats updates #94

sbesson · 2020-08-28T15:09:40Z

Summary of changes

recompute the stats for studies published between prod72 and prod86 - depends on the changes stats.py: fix calculation of Sets idr-utils#18

/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod72 idr0065-camsund-crispri idr0067-king-yeastmeiosis -vv >> /tmp/prod72.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod73 idr0075-cabirol-honeybee -vv >> /tmp/prod73.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod80 idr0056-stojic-lncrnas idr0073-schaadt-immuneinfiltrates -vv >> /tmp/prod80.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod81 idr0064-goglia-erkdynamics idr0083-lamers-sarscov2/ > /tmp/prod81.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod82 idr0081-georgi-adenovirus -v > /tmp/prod82.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod84 idr0070-kerwin-hdbr idr0077-valuchova-flowerlightsheet idr0079-hartmann-lateralline > /tmp/prod84.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod85 idr0084-oudelaar-alphaglobin idr0086-miron-micrographs idr0087-paci-nuclearimport -vv >> /tmp/prod85.tsv
/opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod86 idr0048-abdeladim-chroms idr0085-walsh-mfhrem -vv >> /tmp/prod84.tsv

Most changes should be minor and only adjust the data size (and the conversion to TB), number of files, average file size and average image dimensions.

adjust manually the raw data size/number of files for idr0043 using the numbers from Add size/number of files for all the published and upcoming HPA runs idr0043-uhlen-humanproteinatlas#32

regenerated the release stats for prod73 to prod86 using the releases.py introduced in Add first version of script computing the aggregated stats for a release idr-utils#19

venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod73 --release-date 2020-01-16 --db-size 423 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv 
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod80 --release-date 2020-03-03 --db-size 431 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod81 --release-date 2020-04-27 --db-size 366 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod82 --release-date 2020-05-19 --db-size 367  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod83 --release-date 2020-06-15  --db-size 367  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod84 --release-date 2020-06-30 --db-size 358  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod85 --release-date 2020-07-22 --db-size 359 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod86 --release-date 2020-08-12 --db-size 385 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv

Run the latest version of the stats.py script

dominikl · 2020-08-31T15:19:10Z

Looks good. There might be only one issue. I should have remarked that on IDR/idr0043-uhlen-humanproteinatlas#32 but got myself confused with the IDR release numbers. If you used the command of that PR, then the figures for HPA run 08 are already included (20200609-ftp), but that's going to be in the next IDR release 0.8.7.

sbesson · 2020-08-31T19:13:59Z

Sorry @dominikl, I should have clarified my intent. This should be adjusting the size of idr0043 to match the current state of IDR i.e. the raw data size for runs 1-7. I will open a follow-up PR to update releases.tsv between prod72 and prod86.

For the imminent prod87, these numbers will need to be readjusted to include the data from run 8 including the new number of images/planes/etc and the new raw data metrics. This can be either done as part of this PR or as a follow-up.

pwalczysko · 2020-09-02T16:05:16Z

_data/studies.tsv

+idr0086-miron-micrographs	experimentD	prod85	1161	2	0			0	11	10546	0.004104873039	4104873039	34	120.73155997058824	1018 x 602 x 503 x 2 x 1
+idr0087-paci-nuclearimport	experimentA	prod85	1157	38	0			0	456	50976	0.04848585023	48485850230	1370	35.391131554744526	640 x 640 x 1 x 3 x 37
+idr0048-abdeladim-chroms	experimentA	prod86	1201	1	0			0	2	4479	0.129647463006	129647463006	127	1020.8461654015748	11034 x 9271 x 747 x 3 x 1
+idr0085-walsh-mfhrem	experimentA	prod86	1202	3	0			0	7	15206	0.115773571855	115773571855	25	4630.9428742	2076 x 1681 x 1160 x 2 x 1


I can see that these two lines were added at the end of the document, but should they not be rather inserted into the correct line to keep ascending order by the study number ?

I don't think it is a requirement. My assumption is that this TSV file should extend as studies get released so the natural order will rather to have the Introduced column in ascending order.

Consumers of the TSV file like https://idr.openmicroscopy.org/about/studies.html should be able to do the filtering and sorting by their column of choice.

pwalczysko · 2020-09-02T16:34:38Z

As for the studies.tsv, I think when I did the workflow above and inserted my changes into studies.tsv, then I have following diescrepancies with the changes in this PR

-idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676108 92622284676108   3433139        26.978891526415913      3000 x 3000 x 1 x 3 x 1
+idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        105.931394470523998     105931394470524 3927370 27.62532601     3000 x 3000 x 1 x 3 x 1

The lower line being my changes, the upper one the changes in this PR.
I just added all the numbers from IDR/idr0043-uhlen-humanproteinatlas#32 and corrected the number of Files and the size and size in TB (3 columns).
Not sure how this should have been done.

pwalczysko · 2020-09-02T16:41:58Z

Correction on the studies.tsv comment #94 (comment) above:

When I take just runs 1-7 for the HPA numbers, as indicated in the comment #94 (comment) above, I have as a diff with this PR only

-idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676108 92622284676108   3433139        26.978891526415913      3000 x 3000 x 1 x 3 x 1
+idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676107995      92622284676108  3433139 26.97889153     3000 x 3000 x 1 x 3 x 1

which amounts to a rounding error only afai can see.

Edit: No, sorry, there is a discrepancy in the second number from the left, but this is because I did not change that one at all (not sure how to count that)

Edit 2: I have recounted the average file size by dividing the size number with number of files, and now it is really just rounding errors.

pwalczysko · 2020-09-02T17:25:48Z

After I have run the script from IDR/idr-utils#19 on my studies.tsv file, and removed the empty spaces between lines and overwritten the present lines with the new blcok created by the script, I have a perfect match on releaswes.tsv with the diff in this PR.

I made 2 comments on the other PR IDR/idr-utils#19 (comment) and IDR/idr-utils#19 (comment)

sbesson · 2020-09-02T18:17:50Z

@pwalczysko so barring the round errors on HPA (which will be updated with prod87) and the RFEs for releases.py script which I will handle separately, objections to merging this and keep improving the logic as new studies get added? ADding idr0082 would be a nice next step.

pwalczysko · 2020-09-02T18:36:44Z

objections to merging this and keep improving the logic as new studies get added?

No objections

ADding idr0082 would be a nice next step.

I will try tomorrow

pwalczysko · 2020-09-03T11:43:50Z

Re

ADding idr0082 would be a nice next step.

@sbesson Tried following:

On idr-next, as the /uod/idr/metadata/idr0082... subfolder is not present (yet?), attempted following

clone the most recent state of idr-util into home directory ~/idr-util
copy the bulk.yml from /uod/idr/metadata/ into /tmp
clone the idr0082... gitlab repo into /tmp

See below

cd /tmp
git clone ...# clone the idr0082 repo from gitlab
scp /uod/idr/metadata/bulk.yml .
/opt/omero/server/venv3/bin/python ~/idr-utils/scripts/stats.py --release prod87 idr0082-pennycuick-lesions -vv >> /tmp/prod87-pw.tsv

This results in a smooth run of the scipt and creation of the prod87-pw.tsv file.
But, see below - the tsv file does not contain expected numbers.
Stopping here for now.

idr0082-pennycuick-lesions      experimentA     prod87  MISSING 0       0                       0       0       0       0       0

sbesson added 9 commits August 28, 2020 15:56

Update published stats for prod85

29acfd1

Run the latest version of the stats.py script

Update set numbers for idr0048 and idr0085

043e366

Update stats for prod84

7d992ef

Update stats for prod82

ac09842

Update raw numbers for idr0043 up to prod86

5ccd55c

Update stats for prod81

a81c511

Update stats for prod80

4c90f27

Update stats for prod73

766b488

Update stats for prod72

b02b633

sbesson added 2 commits August 31, 2020 22:26

Recompute release stats for prod73 -> prod85

214a42f

Add release stats for prod86

8a74332

sbesson mentioned this pull request Sep 1, 2020

Add first version of script computing the aggregated stats for a release IDR/idr-utils#19

Merged

pwalczysko reviewed Sep 2, 2020

View reviewed changes

sbesson merged commit 3e9d629 into IDR:master Sep 2, 2020

sbesson deleted the stats_updates branch September 3, 2020 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats updates #94

Stats updates #94

sbesson commented Aug 28, 2020 •

edited

Loading

dominikl commented Aug 31, 2020

sbesson commented Aug 31, 2020

pwalczysko Sep 2, 2020

sbesson Sep 2, 2020

pwalczysko commented Sep 2, 2020

pwalczysko commented Sep 2, 2020 •

edited

Loading

pwalczysko commented Sep 2, 2020

sbesson commented Sep 2, 2020

pwalczysko commented Sep 2, 2020

pwalczysko commented Sep 3, 2020

Stats updates #94

Stats updates #94

Conversation

sbesson commented Aug 28, 2020 • edited Loading

dominikl commented Aug 31, 2020

sbesson commented Aug 31, 2020

pwalczysko Sep 2, 2020

Choose a reason for hiding this comment

sbesson Sep 2, 2020

Choose a reason for hiding this comment

pwalczysko commented Sep 2, 2020

pwalczysko commented Sep 2, 2020 • edited Loading

pwalczysko commented Sep 2, 2020

sbesson commented Sep 2, 2020

pwalczysko commented Sep 2, 2020

pwalczysko commented Sep 3, 2020

sbesson commented Aug 28, 2020 •

edited

Loading

pwalczysko commented Sep 2, 2020 •

edited

Loading