Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats updates #94

Merged
merged 11 commits into from
Sep 2, 2020
Merged

Stats updates #94

merged 11 commits into from
Sep 2, 2020

Conversation

sbesson
Copy link
Member

@sbesson sbesson commented Aug 28, 2020

Summary of changes

  • recompute the stats for studies published between prod72 and prod86 - depends on the changes stats.py: fix calculation of Sets idr-utils#18

    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod72 idr0065-camsund-crispri idr0067-king-yeastmeiosis -vv >> /tmp/prod72.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod73 idr0075-cabirol-honeybee -vv >> /tmp/prod73.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod80 idr0056-stojic-lncrnas idr0073-schaadt-immuneinfiltrates -vv >> /tmp/prod80.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod81 idr0064-goglia-erkdynamics idr0083-lamers-sarscov2/ > /tmp/prod81.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod82 idr0081-georgi-adenovirus -v > /tmp/prod82.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod84 idr0070-kerwin-hdbr idr0077-valuchova-flowerlightsheet idr0079-hartmann-lateralline > /tmp/prod84.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod85 idr0084-oudelaar-alphaglobin idr0086-miron-micrographs idr0087-paci-nuclearimport -vv >> /tmp/prod85.tsv
    /opt/omero/server/venv3/bin/python idr-utils/scripts/stats.py --release prod86 idr0048-abdeladim-chroms idr0085-walsh-mfhrem -vv >> /tmp/prod84.tsv
    

    Most changes should be minor and only adjust the data size (and the conversion to TB), number of files, average file size and average image dimensions.

  • adjust manually the raw data size/number of files for idr0043 using the numbers from Add size/number of files for all the published and upcoming HPA runs idr0043-uhlen-humanproteinatlas#32

  • regenerated the release stats for prod73 to prod86 using the releases.py introduced in Add first version of script computing the aggregated stats for a release idr-utils#19

    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod73 --release-date 2020-01-16 --db-size 423 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv 
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod80 --release-date 2020-03-03 --db-size 431 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod81 --release-date 2020-04-27 --db-size 366 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod82 --release-date 2020-05-19 --db-size 367  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod83 --release-date 2020-06-15  --db-size 367  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod84 --release-date 2020-06-30 --db-size 358  >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod85 --release-date 2020-07-22 --db-size 359 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    venv/bin/python scripts/releases.py  /opt/IDR/idr.openmicroscopy.org/_data/studies.tsv --release prod86 --release-date 2020-08-12 --db-size 385 >> /opt/IDR/idr.openmicroscopy.org/_data/releases.tsv
    

@dominikl
Copy link
Member

Looks good. There might be only one issue. I should have remarked that on IDR/idr0043-uhlen-humanproteinatlas#32 but got myself confused with the IDR release numbers. If you used the command of that PR, then the figures for HPA run 08 are already included (20200609-ftp), but that's going to be in the next IDR release 0.8.7.

@sbesson
Copy link
Member Author

sbesson commented Aug 31, 2020

Sorry @dominikl, I should have clarified my intent. This should be adjusting the size of idr0043 to match the current state of IDR i.e. the raw data size for runs 1-7. I will open a follow-up PR to update releases.tsv between prod72 and prod86.

For the imminent prod87, these numbers will need to be readjusted to include the data from run 8 including the new number of images/planes/etc and the new raw data metrics. This can be either done as part of this PR or as a follow-up.

idr0086-miron-micrographs experimentD prod85 1161 2 0 0 11 10546 0.004104873039 4104873039 34 120.73155997058824 1018 x 602 x 503 x 2 x 1
idr0087-paci-nuclearimport experimentA prod85 1157 38 0 0 456 50976 0.04848585023 48485850230 1370 35.391131554744526 640 x 640 x 1 x 3 x 37
idr0048-abdeladim-chroms experimentA prod86 1201 1 0 0 2 4479 0.129647463006 129647463006 127 1020.8461654015748 11034 x 9271 x 747 x 3 x 1
idr0085-walsh-mfhrem experimentA prod86 1202 3 0 0 7 15206 0.115773571855 115773571855 25 4630.9428742 2076 x 1681 x 1160 x 2 x 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that these two lines were added at the end of the document, but should they not be rather inserted into the correct line to keep ascending order by the study number ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a requirement. My assumption is that this TSV file should extend as studies get released so the natural order will rather to have the Introduced column in ascending order.

Consumers of the TSV file like https://idr.openmicroscopy.org/about/studies.html should be able to do the filtering and sorting by their column of choice.

@pwalczysko
Copy link
Contributor

As for the studies.tsv, I think when I did the workflow above and inserted my changes into studies.tsv, then I have following diescrepancies with the changes in this PR

-idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676108 92622284676108   3433139        26.978891526415913      3000 x 3000 x 1 x 3 x 1
+idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        105.931394470523998     105931394470524 3927370 27.62532601     3000 x 3000 x 1 x 3 x 1

The lower line being my changes, the upper one the changes in this PR.
I just added all the numbers from IDR/idr0043-uhlen-humanproteinatlas#32 and corrected the number of Files and the size and size in TB (3 columns).
Not sure how this should have been done.

@pwalczysko
Copy link
Contributor

pwalczysko commented Sep 2, 2020

Correction on the studies.tsv comment #94 (comment) above:

When I take just runs 1-7 for the HPA numbers, as indicated in the comment #94 (comment) above, I have as a diff with this PR only

-idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676108 92622284676108   3433139        26.978891526415913      3000 x 3000 x 1 x 3 x 1
+idr0043-uhlen-humanproteinatlas        experimentA     prod52  501     7000    0       7000            0       3432371 10297113        92.622284676107995      92622284676108  3433139 26.97889153     3000 x 3000 x 1 x 3 x 1

which amounts to a rounding error only afai can see.

Edit: No, sorry, there is a discrepancy in the second number from the left, but this is because I did not change that one at all (not sure how to count that)

Edit 2: I have recounted the average file size by dividing the size number with number of files, and now it is really just rounding errors.

@pwalczysko
Copy link
Contributor

After I have run the script from IDR/idr-utils#19 on my studies.tsv file, and removed the empty spaces between lines and overwritten the present lines with the new blcok created by the script, I have a perfect match on releaswes.tsv with the diff in this PR.

I made 2 comments on the other PR IDR/idr-utils#19 (comment) and IDR/idr-utils#19 (comment)

@sbesson
Copy link
Member Author

sbesson commented Sep 2, 2020

@pwalczysko so barring the round errors on HPA (which will be updated with prod87) and the RFEs for releases.py script which I will handle separately, objections to merging this and keep improving the logic as new studies get added? ADding idr0082 would be a nice next step.

@pwalczysko
Copy link
Contributor

objections to merging this and keep improving the logic as new studies get added?

No objections

ADding idr0082 would be a nice next step.

I will try tomorrow

@sbesson sbesson merged commit 3e9d629 into IDR:master Sep 2, 2020
@sbesson sbesson deleted the stats_updates branch September 3, 2020 07:16
@pwalczysko
Copy link
Contributor

Re

ADding idr0082 would be a nice next step.

@sbesson Tried following:

On idr-next, as the /uod/idr/metadata/idr0082... subfolder is not present (yet?), attempted following

  • clone the most recent state of idr-util into home directory ~/idr-util
  • copy the bulk.yml from /uod/idr/metadata/ into /tmp
  • clone the idr0082... gitlab repo into /tmp

See below

cd /tmp
git clone ...# clone the idr0082 repo from gitlab
scp /uod/idr/metadata/bulk.yml .
/opt/omero/server/venv3/bin/python ~/idr-utils/scripts/stats.py --release prod87 idr0082-pennycuick-lesions -vv >> /tmp/prod87-pw.tsv

This results in a smooth run of the scipt and creation of the prod87-pw.tsv file.
But, see below - the tsv file does not contain expected numbers.
Stopping here for now.

idr0082-pennycuick-lesions      experimentA     prod87  MISSING 0       0                       0       0       0       0       0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants