You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of this is the split between "Plates" and "Datasets". I also often have to figure it out by context. Happy to have the output format from the script be made more explicit.
What is Bytes, does that have to be used for Size (TB) and Size?
Bytes from stats.py was my first attempt at a size via SQL. It was pointed out that 1) my query was wrong and 2) it doesn't match what fs usage was providing. Best option is likely to remove it.
What about this size?
Size in TB is just an easier to read version of Size
And is the 25 files the # of Files?
Yes.
And how to get Targets?
This is a difficult one, and likely since Eleanor left hasn't been maintained or even defined.
But where to get Files (Million) from?
Again, this is just an easier to read version of Files.
And how to get DB Size (GB)?
I think we have some diversity here. I'd suggest select pg_database_size('idr') is the basis for most of the values.
In addition we have a spreadsheet which is almost but not quite the same format as these tsv files. It'd be good to make sure the solution here is also correct for the spreadsheet (or maybe we can get rid of it?)
👍 for having the solution work for both. I still use the spreadsheet, so until we have everything in one place I'd be 👎 for getting rid of it.
I think xxx of yyy computes the difference between the number of rows in the filepaths or plates tsv and the actual number of datasets/plates imported in the resource. I'd vote for keeping only the second value as it is the one we are reporting.
re Bytes, as mentioned above stats.py returns an estimate of the pixel volume using an OMERO query (sum(sizeXsizeYsizeZsizeCsizeT*2) currently). The known caveats are the pixel type and resolution handling and it returns the bytes size of the fact is returns an uncompressed full-resolution 5D volume which likelyexplains the huge diff with the current value. I would stick to having Size reporting the file size on disk of the raw data imported into the resource i.e. the output of omero fs usage. Proposing to remove Bytes from stats.py to reduce the confusion. Maybe rename Size as Raw data size to be explicit?
Re Targets, this is a metric that is quite valuable but cannot simply be queried for the reasons described above as it requires some knowledge on the study itself. Given it has not been maintained for a while, happy to discuss removing it from the maintained stats format for now until we properly get back to it.
Re csv vs spreadhseet, I am pretty sure the headers were matching when I created the tsv files. If that's not the case, I am all for re-aligning it as it should work as cut-n-paste
Proposed actions:
review and agree on the column names and definitions of studies.tsv/releases.tsv and the spreadsheet. Candidate to discuss: Targets, Size, Files anything else?
review and adjust stats.py to produce an output matching the decisions above and which can be used directly and effectively for filling the studies rows in the TSV/spreadsheet. Can we include the output from omero fs usage and the average dimension calculation to the output? Can we simply generate the stats for one study (which might reduce the generation time(?
do we need stats.py or another script to create releases.tsv from studies.tsv with the extra information (database size) ? or work from the spreadsheet?
Reviewing quickly the various columns with the various scripts output
Study/Container are covered by stats.py - maybe a RFE is for the script to split them
Introduced is the version of the current deployment - we could consider querying the Release date value on the container map annotation instead
Internal ID/Sets/Wells are covered by `stats.py
Experiments/Targets are probably the two concepts we need to stop maintaining for now and review as part of a separate project, maybe return empty columns for now?
Acquisitions can be queried
5D Images/Planes are covered by stats.py
Size (TB)/Size/# of Files/avg. size (MB) - size and number of files are returned by omero fs usage and the other columns are derived
Avg. Image Dim (XYZCT) can be queried as mentioned below
The text was updated successfully, but these errors were encountered:
All answers in one place
IDR/idr.openmicroscopy.org#92 (comment)
Part of this is the split between "Plates" and "Datasets". I also often have to figure it out by context. Happy to have the output format from the script be made more explicit.
Bytes from stats.py was my first attempt at a size via SQL. It was pointed out that 1) my query was wrong and 2) it doesn't match what
fs usage
was providing. Best option is likely to remove it.Size in TB
is just an easier to read version ofSize
Yes.
This is a difficult one, and likely since Eleanor left hasn't been maintained or even defined.
Again, this is just an easier to read version of
Files
.I think we have some diversity here. I'd suggest
select pg_database_size('idr')
is the basis for most of the values.👍 for having the solution work for both. I still use the spreadsheet, so until we have everything in one place I'd be 👎 for getting rid of it.
IDR/idr.openmicroscopy.org#92 (comment)
A few additional comments,
xxx of yyy
computes the difference between the number of rows in the filepaths or plates tsv and the actual number of datasets/plates imported in the resource. I'd vote for keeping only the second value as it is the one we are reporting.Bytes
, as mentioned abovestats.py
returns an estimate of the pixel volume using an OMERO query (sum(sizeXsizeYsizeZsizeCsizeT*2) currently). The known caveats are the pixel type and resolution handling and it returns the bytes size of the fact is returns an uncompressed full-resolution 5D volume which likelyexplains the huge diff with the current value. I would stick to havingSize
reporting the file size on disk of the raw data imported into the resource i.e. the output ofomero fs usage
. Proposing to removeBytes
fromstats.py
to reduce the confusion. Maybe renameSize
asRaw data size
to be explicit?Re
Targets
, this is a metric that is quite valuable but cannot simply be queried for the reasons described above as it requires some knowledge on the study itself. Given it has not been maintained for a while, happy to discuss removing it from the maintained stats format for now until we properly get back to it.Re csv vs spreadhseet, I am pretty sure the headers were matching when I created the tsv files. If that's not the case, I am all for re-aligning it as it should work as cut-n-paste
Proposed actions:
studies.tsv/releases.tsv
and the spreadsheet. Candidate to discuss:Targets
,Size
,Files
anything else?stats.py
to produce an output matching the decisions above and which can be used directly and effectively for filling the studies rows in the TSV/spreadsheet. Can we include the output fromomero fs usage
and the average dimension calculation to the output? Can we simply generate the stats for one study (which might reduce the generation time(?stats.py
or another script to createreleases.tsv
fromstudies.tsv
with the extra information (database size) ? or work from the spreadsheet?https://github.com/IDR/SubmissionWorkflow/pull/23#discussion_r472469583
Tested on
prod86
vs
so I'd say these are equivalent.
https://github.com/IDR/SubmissionWorkflow/pull/23#discussion_r472492881
Reviewing quickly the various columns with the various scripts output
Study/Container
are covered bystats.py
- maybe a RFE is for the script to split themIntroduced
is the version of the current deployment - we could consider querying theRelease date
value on the container map annotation insteadInternal ID/Sets/Wells
are covered by `stats.pyExperiments/Targets
are probably the two concepts we need to stop maintaining for now and review as part of a separate project, maybe return empty columns for now?Acquisitions
can be queried5D Images/Planes
are covered bystats.py
Size (TB)/Size/# of Files/avg. size (MB)
- size and number of files are returned byomero fs usage
and the other columns are derivedAvg. Image Dim (XYZCT)
can be queried as mentioned belowThe text was updated successfully, but these errors were encountered: