Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to set channel names #41

Closed
wants to merge 1 commit into from
Closed

Conversation

dominikl
Copy link
Member

Would something like this be useful? I think we have a few cases where we don't explicetely set the channel names via rendering settings. This script would help to set them afterwards, taking the names from the 'Channel' map annotations.

@dominikl
Copy link
Member Author

Note: I've not tested it with SPW yet...

@sbesson
Copy link
Member

sbesson commented May 14, 2021

A few quick thoughts. The PR is timely as we have discussed related concepts during the OMERO.figure workshop preparation. In general, I think it would be extremely useful to be be more specific on the Channels metadata.

Note this script has a lot of overlap with channel_names_from_maps.py which is used for the training workshops to populate channel names. For me this indicates there is a clear need for this type of script when complex rendering settings are not needing.

Trying to go quickly across the IDR to try and identify the variant patterns for the Channels:

As a general rule, I am all for reducing the number of such variants and making the Channels annotation less free-text and more systematic for clients like OMERO.figure but also re-analysis.

In terms of format, I am happy to spend a bit of time reviewing all the IDR studies but I roughly assume all our use cases could be represented by a structure of type:

<Channel1>[:<Target1>][;<Channel2>[:<Channel2>]]...

In terms of tooling, reemphasizing the value of this script (I assume the training one could be eventually superseded). I would vote for having a version that asserts rather than setting the values. This would allow us to flag inconsistencies between channel names and metadata. For studies with rendering settings, this could also be used to confirm the consistency of the channel metadata

/cc @francesw @will-moore @joshmoore @jburel

@will-moore
Copy link
Member

The channel_names_from_maps.py script splits the Channels value with ;. The default in this script is ,. Maybe default should be ;?
And by default, each channel is then split on : with the latter part being used as the channel name. E.g. would be endocytic patch from Sla1-yEGFP: endocytic patch. This PR uses the whole channel string.

I have a PR that adds the option to choose the first part of the channel, e.g. Sla1-yEGFP see ome/training-scripts@8ddbc5d

That PR also adds the option to create Map-Annotations of the form:

Ch0_Stain: lynEGFP
Ch0_Label: cell membranes
Ch1_Stain: atoh1a
Ch1_Label: atoh1a expression marker

which is something I wanted for the OMERO.figure workshop, but is probably a bit too workshop-specific to be part of this script.

@sbesson
Copy link
Member

sbesson commented Jun 4, 2021

To be complete, I have compiled representative examples of the value of the Channels for all studies currently published in IDR (up to prod97). Happy to turn it into an issue on the relevant repository if this is too noisy.

Study Channels
idr0001 GFP:endogenous alpha tubulin 2;Cascade blue:growth media
idr0002 H2B- mCherry/Cy3:chromatin;eGFP:nuclear lamina and report on nuclear envelope breakdown
idr0003 H2B-mCherry:cytosol;GFP:tagged protein;bright field/transmitted:cell
idr0004 DIC:cell structure;YFP:Rad52-YFP protein
idr0005 Hoechst:DNA
idr0006 DAPI:nuclei;TRITC:HA_Flag tagged protein
idr0007 Exp1Cam1:various;Exp1Cam2:various
idr0008 TRITC:phallodin/F-actin;TRITC2:phallodin/F-actin;FITC:alpha-tubulin;Dapi:DNA
idr0009 dapi: DNA;vsvg-cfp: CFP-tsO45G ;pm-647: cell surface tsO45G
idr0010 Dapi/Hoechst 33258: DNA;53bp1/Alexa Fluor 488:53bp1
idr0011 YFP:DAD4; mRFP1:SPC42; DIC: whole cell
idr0012 Alexa 488:tubulin;Hoechst:DNA;Tritc:actin
idr0013 GFP: core histone 2B tagged with GFP to monitor chromosomes
idr0015
idr0016 Hoechst 33342:nucleus;concanavalin A (con A) AlexaFluor488 conjugate:endoplasmic reticulumn;SYTO 14 green fluorescent nucleic acid stain:nucleoli;wheat germ agglutinin (WGA) AlexaFluor594 conjugate:Golgi apparatus and plasma membrane;phalloidin AlexaFluoraFluor594 conjugate: F-actin;MitoTracker Deep Red: mitochondria
idr0017 DAPI:DNA;CY3:Actin
idr0018 RGB
idr0019 DAPI: nuclei;Alexa-488: NF-kappaB;dihydroethidium(DHE): cell bodies
idr0020 Hoescht: nuclei;Anti-Ser10 PhosphoHistone H3: mitotic nuclei;Anti-alpha-tubulin: microtubules;RFP: whole cell
idr0021 442:CENT2; 525:PCNT; 615:CDK5RAP2-C
idr0022
idr0023 greyscale
idr0025 ch00:DAPI(nuclei);ch01:Alexa 488(target protein);ch02:Alexa 555(microtubules)
idr0026 FD5_BLUE:SHG (collagen);FD6_GREEN:dsRed2 (CTL);BD7_RED:Alexa750 (vessels, 70kDa-dextran);BD8_RED:mCherry(Histone-2B-mCherry, B16F10/OVA nuclei)
idr0027
idr0028 AlexaFluor647:YAP/TAZ; AlexaFluor568:alphaTubulin;Phalloidin488: F-actin;Hoechst: nuclei
idr0030 Exp1Cam1:Hoechst:DNA;Exp2Cam2:mouse anti-YAP/TAZ plus Alexa488 anti-mouse:YAP/TAZ;Exp3Cam3:Alexa647 phalloidin:F-actin;Exp4Cam2:rabbit anti-CD44 plus Alexa568 anti-rabbit:CD44
idr0032 RGB
idr0033 Hoechst 33342:nucleus;concanavalin A/AlexaFluor488 conjugate:endoplasmic reticulum;SYTO14 green fluorescent nucleic acid stain:nucleoli and cytoplasmic RNA;wheat germ agglutinin/AlexaFluor594 conjugate (WGA):Golgi apparatus and plasma membrane;phalloidin/AlexaFluor594 conjugate:F_actin;MitoTracker Deep Red:mitochondria
idr0034 DAPI:nuclei;Alexa 488:EdU, proliferation;CellMask:plasma membrane;Brightfield:cell outline
idr0035 DAPI:DNA;Phallodin:F-actin;B-tubulin
idr0036 Hoechst 33342:nucleus;concanavalin A (con A) AlexaFluor488 conjugate:endoplasmic reticulumn;SYTO 14 green fluorescent nucleic acid stain:nucleoli;wheat germ agglutinin (WGA) AlexaFluor594 conjugate:Golgi apparatus and plasma membrane;phalloidin AlexaFluor594 conjugate:F-actin;MitoTracker Deep Red: mitochondria
idr0037 DAPI:nuclei;Alexa 488:EdU, proliferation;CellMask:plasma membrane;Brightfield:cell outline
idr0038 Wt1-GFP:Wt1tm1Nhsn cells expressing GFP in cytoplasm, green;PNA-rh:Peanut agglutanin conjugated with rhodamine labelling basement membranes, red
idr0040 BF: Brightfield; CFP:nuclei; YFP: pAGA1-dPSTR; RFP:pFIG1-dPSTR; BF1: Brightfield out of focus
idr0041 490-552:GFP; 587-621:mCherry | 622-695:SiR-DNA; 622-695:Dy-481XL
idr0042 RGB
idr0043
idr0044
idr0045 EB3
idr0047 TRANS: brightfield of the cells; DAPI: fluoresecently stained DNA; TMR: TAMRA labeld onligo nucleotide probes that bind to an STL1 mRNA; CY5: Cy5 labeld onligo nucleotide probes that bind to an CTT1 mRNA
idr0048 Red: Brainbow Red; Green: Brainbow Green; Blue: Brainbow Blue
idr0050 Ch1: Actin, Ch2: Cell, Ch3: Microtubules
idr0051 GFP
idr0052 NCAPD2, DNA, NEG_Dextran
idr0053
idr0054 CD3-170Er, CD19-169Tm, CD324/E-Cadherin-158Gd, CD206-168Er, Bcl6-163Dy, CD141/BDCA3-165Ho, alphaSMA-141Pr, IL-21-164Dy, CD185/CXCR5-151Eu, CD45-152Sm, empty, CXCL13-157Gd, CD1c/BDCA1-biotin + Neutravidin-173Yb, CD303/BDCA2-147Sm, CD11b-149Sm, CD45RA-155Gd, CD123-143Nd, CD68-171Yb, HLA-DR-174Yb, CD279/PD-1-175Lu, CD370/Clec9A-161Dy, CD11c-159Tb, ICOS-148Nd, DNA1-191/193Ir, CD56-176Yb, DNA2-191/193Ir, CD14-156Gd
idr0056 alpha-tubulin (microtubule cytoskeleton), CEP215/CDK5RAP272 (centrosomes) Alexa-Fluor 568 Phalloidin (actin cytoskeleton), Hoechst (DNA).
idr0061 Alexa 555
idr0062 LaminB1 / Dapi
idr0063 GFP = URA3
idr0064 405:ErkKTR-BFP; 561:H2B-RFP
idr0065 phase:Phase contrast,Cy3:amiC-Sp1-bc2,Cy5:kilR-Sp2-bc12,TxR:pbpG-Sp4-bc22,fam:yabI-Sp1-bc32
idr0066 EGFP:GlyT2positive neurons
idr0067 FITC = Hsp104-eGFP, mCherry = Htb1-mCherry
idr0069
idr0070 Brightfield
idr0071 DAPI:Cy3:A594:Cy5:Cy7
idr0072 EGFP (protein of interest), DRAQ5 (DNA)
idr0073 RGB
idr0075 Alexa488
idr0076 Total HH3-In113, Xe126, I127, Xe131, Xe134, H3K27me3-La139, Ce140, CK5-Pr141, Fibronectin-Nd142, CK19-Nd143, CK8_18-Nd144, Twist-Nd145, CD68-Nd146, CK14-Sm147, SMA-Nd148, Vimentin-Sm149, C-myc-Nd150, HER2-Eu151, CD3-Sm152, p-Total HH3-Eu153, p-ERK1/2-Sm154, Slug-Gd155, ER-Gd156, PR-Gd158, p53-Tb159, CD44-Gd160, EpCAM-Dy161, CD45-Dy162, GATA3-Dy163, CD20-Dy164, Beta-catenin-Ho165, CAIX-Er166, E_cadherin-Er167, Ki67-Er168, EGFR-Tm169, pS6-Er170, Sox9-Yb171, vWF-CD31-Yb172, mTOR-Yb173, CK7-Yb174, panCK-Lu175, cPARP-cCasp3-Yb176, DNA1-Ir191, DNA2-Ir193, Hg202, Pb204, Pb206, Pb207, Pb208, ArAr80
idr0077 561nm L, 488nm L, 561nm R, 488nm R
idr0078 Sla1-yEGFP: endocytic patch; Sac6-tdTomato: actin patch; Dextran Alexa 647: cell outline
idr0079 lynEGFP:cell membranes; atoh1a:atoh1a expression marker
idr0080 Hoechst 33342 (DNA); Concanavalin A/Alexa 488 (endoplasmic reticulum); 488 Long (nucleoli and cytoplasmic RNA); Phalloidin/Alexa 568 and wheat-germ agglutinin/Alexa 555 (actin cytoskeleton, golgi, and plasma membrane (AGP)); MitoTracker Deep Red/Alexa 647 (mitochondria)
idr0081 Hoechst:nuclei; GFP:infection
idr0082
idr0083
idr0084 blue: DAPI nuclear stain; green: FITC alpha-globin nascent transcripts
idr0085 Ch1: CMDiI staining, Ch2: microvascular staining
idr0086 594: EdU, 488: IF signal, DAPI
idr0087 C0 (Hoechst), C1(mitoTracker), C2(cargo fluorescence)
idr0088 Ch1 (blue): Nuclei/Cytoplasm, Ch2 (green): TUBA1B, Ch3 (red): RELA
idr0089 H3K4me3, H3K27me3, DAPI
idr0090 BF:Brightfield; DAPI:DNA; GFP:Cytosolic GFP; Cy3:Red Blood Cell; Cy5:Mitochondria
idr0091 Phase Contrast, GFP, GFP-raw
idr0092 bright-field
idr0093 DNA;Nascent RNA;PCNA;Succinimidyl ester
idr0094 cell body
idr0095 Phase, mCherry, YFP
idr0097 green:GFP; yellow:base T & base A; red:base C & base A
idr0098 grayscale
idr0099 eGFP (488nm)
idr0100 Axon [green]; Nucleus [blue]; Oligodendrocyte [red]
idr0103 Two channels (blue 440-480 nm and green 500-540 nm) CCF2
idr0106 alpha-SMA-FITC; VEGFR3-Alexa546; A549-mCherry
idr0109 Phase contrast: Cells

@gwaybio
Copy link

gwaybio commented Nov 24, 2021

I am looking to compile a metadata matrix of stain by label (as they are defined here) where every entry in the matrix indicates how many wells exist in IDR with this stain:label combination. I am retrieving channel info for all studies using the IDR API.

However, I am running into many of the metadata issues that this issue describes.

Namely, there are many different ways that IDR compiles the study submitters coding of this information (many examples are listed in #41 (comment)).

I see 5 fundamentally different ways (there could be others too) that channel info is coded:

Structure Example
Stain1:Label1;Stain2:Label2;... GFP:endogenous alpha tubulin 2;Cascade blue:growth media
Stain1: Label1;Stain2: Label2;... dapi: DNA;vsvg-cfp: CFP-tsO45G ;pm-647: cell surface tsO45G
StainNumberIndicator:StainLabelCombo ch00:DAPI(nuclei);ch01:Alexa 488(target protein) and Exp1Cam1:various;Exp1Cam2:various
Stain1 (Label1); Stain2 (Label2);... Hoechst 33342 (DNA); Concanavalin A/Alexa 488 (endoplasmic reticulum)
Channels value missing from annotations API but it is listed in free text E.g. idr0069, which was previously documented in image.sc

Furthermore, many of the stains and labels, I assume, refer to the same thing, but they are coded slightly differently (e.g. DAPI:nuclei vs. dapi:DNA).

My purpose for writing this note is to help provide what I'm seeing as a user, to describe how I would like to use the channel metadata parameter specifically, and to let the folks contributing to this PR know that I am interested in it's resolution (in case it makes any difference!)

My ultimate goal is to use IDR metadata to help me select specific datasets to re-analyze.

@sbesson
Copy link
Member

sbesson commented Nov 24, 2021

@gwaygenomics thanks for raising this important issue. Summarising briefly the state of the channel metadata in IDR:

  • theChannels column in the annotation file primarily reflects the representation of the submitter
  • some minimal curation happens but there is currently no authoritative set of ontologies used for channels unlike other concepts like Organism, Compound -see https://idr.openmicroscopy.org/about/linked-resources.html
  • the channel metadata is always transformed into tables and in the majority of the cases as map annotations (Others)
  • another relevant location is the channel name, e.g. as displayed under Image Details or in the viewer. The name is either read from the original image file format or set via the API

Overall, I think our biggest challenge comes from the heterogeneity of use cases. Primarily we are dealing with diverse imaging modalities so the channel metadata of a brightfield RGB dataset is fundamentally different from the channel metadata of a fluorescent cell-painting assay. As you pointed out there are also various concepts associated with a channel including the marker, the stain, the filter, the biological structure.

We agree that reducing the divergence and effectively moving towards a standard IDR representation of the channel metadata is key to allow consumers like you to effectively mine the data. My postulate is that trying to solve all IDR use cases at once is impractical and this partly explains why there is no progress here.

Trying to think how to move this forward, I suspect we need to build a first implementation probably for a subset of data and start iterating over it. I think the role of the resource consumers like you is absolutely key and it would be great if you had the capacity to help driving this specification effort. A few initial questions:

  • can we restrict the scope of this work to either a study type or a subset of studies that would be the most useful to you?
  • giving the existing metadata content but ignoring the current encoding, could you express a representation that would effectively communicate what you need to query?

@gwaybio
Copy link

gwaybio commented Nov 24, 2021

I would be delighted to serve as a use-case for improving channel metadata standards, and I agree with the challenges you've presented.

Focus

I am interested in IDR screens, particularly those with imaging of multiple fluorescent channels. For example, I would like to analyze heterogeneous imaging datasets that have at least nuclei stained as a common structure. Simplistically, I'm thinking something like this:

Study Nuclei Mito Other (including brightfield) Cell Type Perturbation
idrxxx Cancer Drugs
idryyy Neuron Media
idrzzz Fibroblast CRISPR

Metadata coding

Key elements to an effective metadata coding, that would be helpful for me are:

  • Consistent nomenclature describing marker, stain, filter, and biological structure/organelle
    • Case (e.g. DAPI vs. dapi)
    • Plurality (e.g. Nucleus vs. Nuclei)
    • Structure resolution (e.g. Nucleus vs. DNA)
  • Consistent data structure presenting channel information in the API
    • Key label pair (e.g. DAPI:DNA vs. DAPI (DNA))
    • More specific nomenclature distinguishing the specific data structure (see below)

To achieve my aim, and if I could influence the most effective setup for me (without knowing all of the current limitations of course!), I would have liked to see the information coded in the API as the following:

# Wishlist API
{
'parent': {
      'id': 14529,
      'class': 'ImageI',
      # The name could also be split out by well, field, and spot separately, although IIRC, this info is elsewhere too
      'name': 'DTT p1 [Well 77, Field 1 (Spot 229)]'
    },
    'date': '2016-12-13T23:14:34+00:00',
  },
  'class': 'MapAnnotationI',
  'values': {
    'strain': 'Y6545',
    'environmental_stress': 'dithiothreitol',
    # Add category to distinguish fluorescent from brightfield
    'category': 'fluorescent',
    'channels': {
      # To enable future filtering of datasets with specific channel counts
      'count': 3,
      # To enable proper indexing to channels info
      'channel_keys': ['ch1', 'ch2', 'ch3'],
      # Ordering is arbitrary, but having a key will enable faster indexing in future search functionality
      'ch1': {
        # Translate user input into a common dictionary, or enforce form fill out by drop down selection menus
        # Use a simple "key: label" pair, as to not worry about parsing delimiters in strings and/or lists
        'stain': 'H2B-mCherry',
        'structure': 'cytosol',
        'filter': '561',
        'filter_unit': 'nm'
      } ,
      'ch2': {
        'stain': 'GFP',
        # It would be great if the API specified the specific protein
        'structure': 'protein',
        'filter': '469',
        'filter_unit': 'nm'
      } ,
      'ch3': {
        'stain': 'brightfield',
        'structure': 'wholecell',
        'filter': '',
        'filter_unit': ''
      }
    }
  }

Current annotation API (see IDR/idr.openmicroscopy.org#149 (comment)))

# Current API output
 {
  'id': 6631107,
  'ns': 'openmicroscopy.org/omero/bulk_annotations',
  'description': None,
  'owner': {
    'id': 2
  },
  'date': '2016-12-13T23:14:34+00:00',
  'permissions': {
    'canDelete': False,
    'canAnnotate': False,
    'canLink': False,
    'canEdit': False
  },
  'link': {
    'id': 23067151,
    'owner': {
      'id': 2
    },
   'parent': {
      'id': 14529,
      'class': 'ImageI',
      'name': 'DTT p1 [Well 77, Field 1 (Spot 229)]'
    },
    'date': '2016-12-13T23:14:34+00:00',
    'permissions': {
      'canDelete': False,
      'canAnnotate': False,
      'canLink': False,
      'canEdit': False
    }
  },
  'class': 'MapAnnotationI',
  'values': [
    ['Strain', 'Y6545'],
    ['Environmental Stress', 'dithiothreitol'],
    ['Channels', 'H2B-mCherry:cytosol;GFP:tagged protein;bright field/transmitted:cell'],
    ['Has Phenotype', 'yes'],
    ['Phenotype Annotation Level', 'experimental condition and gene']
  ]
}

Other comment

I'd also like, in general, to be able to describe the wealth of data that exist currently in IDR. This means cataloging the biological and technical diversity of the publicly available images, and doing so requires API hits, which are complicated by metadata inconsistencies. I believe, that one barrier to reanalyzing these data is low awareness, and a timely description of what's available will help raise awareness.

@sbesson
Copy link
Member

sbesson commented Nov 30, 2021

Thanks @gwaygenomics, definitely good to start looking at structure. A couple of feedback and questions to continue this discussion

From the storage perspective

  • OMERO MapAnnotations are effectively ordered lists of key/value pairs. This means there is no easy way to represent hierarchies. For e.g. genes, we are using separate map annotations with the same namespace but in the case of channels, there is also the problem of channel indexing
  • a potential alternative representation for this relationship would be to store the channel metadata as MapAnnotation i.e. key/value pairs associated with the Channel objects themselves
  • implementation-wise, the last solution will likely require additional API either to filter annotations via channel https://idr.openmicroscopy.org/webclient/api/annotations/?type=map&channel=<channel_id> and possibly expose channels and map annotations in the JSON API.

From the specification perspective:

  • the minimal set of keys mentioned in Script to set channel names #41 (comment) are : stain, structure, filter and filter_unit
  • the two latter ones are I believed covered by the properties of the Channel element namely EmissionWavelength, EmissionWavelengthUnit. This is partly where my proposal of Channel annotation comes from to avoid duplicating these structures
  • for stain and structure, are you aware of a reference controlled vocabulary? I totally second unifying the terms e.g. DAPI vs dapi but what should be the reference to decide which variant should be used?

@gwaybio
Copy link

gwaybio commented Nov 30, 2021

Thanks @sbesson! I appreciate the context. It seems this change will be difficult, but I also think that it is worthwhile to standardize.

are you aware of a reference controlled vocabulary?

Structure

In chatting with Melissa Haendel's group, they recommend using gene ontology (GO) as the canonical standard for subcellular anatomy.

I put together this: https://github.com/WayScience/organelles/blob/main/organelles.tsv which could serve as a starting point for standardizing structure.

Stain

I am not aware of any standardization efforts. Maybe we can start one?

I found two resources that might be helpful here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants