Determining Studies and Assays for ISA #33

ptth222 · 2023-08-11T19:28:40Z

ptth222
Aug 11, 2023
Maintainer

We have previously discussed how to separate studies and assays from a protocol sequence (work from the end to find a collection protocol), but ISA data sets can have multiple studies and assays and we have not discussed how to classify records in the MESSES JSON based on this. For the MWTab format there is essentially one study and one assay per file, so we have not had to deal with this before. We have a studies table, so we can identify and name multiple studies that way, but we do not connect the other tables' records to a study. A fairly simple solution is to add a "study.id" to records, which is something the extraction used to do automatically and is still available through extract tags. Similarly, we could add a "assay_id" to records. This would require the user to do the extraction a little differently for ISA than MWTAB and these fields would be required. I think I may have a way around this though.

I think there is a way to examine the entities and work out how many studies and assays there should be. First, you compute all of the sample lineages. Then, you use the lineages to find all of the unique protocol sequences. The number of unique protocol sequences should be equal to the number of assays. Next, for each protocol sequence you split the sequence into the study sequence and assay sequence and then the number of unique study sequences should be equal to the number of studies. There is one complication. If a protocol is a factor then you can get extra unique study protocol sequences, but this can be factored out. I have already wrote some code to do this, so I am going to provide some examples now.

Example sample sequences:

[
['15_C1-20_allogenic_7days_UKy_GCH_rep3',
 '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3',
 '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICSM_A'],

['15_C1-20_allogenic_7days_UKy_GCH_rep3',
 '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3',
 '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-protein']
]

Example protocol sequences with protocol factor issue:

[['naive',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'polar_extraction',
  'IC-FTMS_preparation'],
 ['naive',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'protein_extraction'],
 ['syngenic',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'polar_extraction',
  'IC-FTMS_preparation'],
 ['syngenic',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'protein_extraction'],
 ['allogenic',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'polar_extraction',
  'IC-FTMS_preparation'],
 ['allogenic',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'protein_extraction']]

Notice the treatment factor protocols, naive, allogenic, and syngenic, create extra sequences.

Example protocol sequences with the factor protocols normalized:

[['factor_protocol_0',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'polar_extraction',
  'IC-FTMS_preparation'],
 ['factor_protocol_0',
  'mouse_tissue_collection',
  'tissue_quench',
  'frozen_tissue_grind',
  'protein_extraction']]

You can see that this would indicate 2 assays: 1 for the polar extraction and 1 for the protein extraction. To separate the assay and subject sequences you would work backward until you hit the "mouse_tissue_collection" and you would end up with a single study sequence of ["factor_protocol_0", "mouse_tissue_collection"]. Looking at this again, I feel like "tissue_quench" and "frozen_tissue_grind" should be part of the study since it is common to both assays. We might need to rethink how to separate studies and assays.

With this breakdown we would be able to add study IDs and assay IDs to the entities, protocols, and measurements. There is still the issue of connecting the studies in the study table to the groups of entities, protocols, and measurements we can identify as being in the same study. For instance, let's say we determined there are at least 2 studies based on the protocol sequences. We can identify which entities should be grouped into the 2 different studies, but we still don't have a way of knowing which study from the study table should go with each identified group. I think if we required "study.id" on the subjects this would solve the problem without too much burden. The assays can just be numbered since we don't have an assay table.

Basically, my idea is that there would be a preprocess step for ISA that would do the above and add "study.id" and "assay_id" fields to records appropriately and then the conversion directives could rely on these fields and make things simpler. I'm thinking this could be tied to an option such as "--determine_studies". If they don't give the option then the expectation is that the user would provide the ID fields.

I think we are definitely going to have to meet to talk through this.

Should the protein and polar extractions be 2 different assays? It fits with ISA, but when you compare it to how we handle things in MWTab it's strange. The protein weight is just 1 measurement that's on the entity, we don't create measurement records for it, and there is no data file.

Also something to note: The polar extraction and protein extraction both have the same sample as input, so when translated to ISA they will end up in 2 different processes, and each process will have the same input but different outputs. I think this should be an error for ISA, but I tried it and it worked fine. I also checked their validation code and they don't validate anything like that. I have added it to the list of things to ask them about. It is kind of strange. The way it works out it looks like you do 2 different protocols to the same sample which is obviously not possible. What actually happens is that the sample is split into polar, lipids, and protein, but we created 3 protocols instead of 1. Those protocols have files describing them, so if you read those it's probably clear, but just looking at the JSON it's kind of strange.

hunter-moseley · 2023-08-12T03:13:02Z

hunter-moseley
Aug 12, 2023
Maintainer

I like the idea of a "--determine_studies" option. But the study.id may need to be either on subjects or samples. In the Hildebrandt datasets, we had 6 tissues from the same subjects. So, I believe the separate studies were determined at the sample level and not the subject level.

…

On Fri, Aug 11, 2023 at 3:28 PM ptth222 ***@***.***> wrote: We have previously discussed how to separate studies and assays from a protocol sequence (work from the end to find a collection protocol), but ISA data sets can have multiple studies and assays and we have not discussed how to classify records in the MESSES JSON based on this. For the MWTab format there is essentially one study and one assay per file, so we have not had to deal with this before. We have a studies table, so we can identify and name multiple studies that way, but we do not connect the other tables' records to a study. A fairly simple solution is to add a " study.id" to records, which is something the extraction used to do automatically and is still available through extract tags. Similarly, we could add a "assay_id" to records. This would require the user to do the extraction a little differently for ISA than MWTAB and these fields would be required. I think I may have a way around this though. I think there is a way to examine the entities and work out how many studies and assays there should be. First, you compute all of the sample lineages. Then, you use the lineages to find all of the unique protocol sequences. The number of unique protocol sequences should be equal to the number of assays. Next, for each protocol sequence you split the sequence into the study sequence and assay sequence and then the number of unique study sequences should be equal to the number of studies. There is one complication. If a protocol is a factor then you can get extra unique study protocol sequences, but this can be factored out. I have already wrote some code to do this, so I am going to provide some examples now. Example sample sequences: [ ['15_C1-20_allogenic_7days_UKy_GCH_rep3', '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3', '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICSM_A'], ['15_C1-20_allogenic_7days_UKy_GCH_rep3', '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3', '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-protein'] ] Example protocol sequences with protocol factor issue: [['naive', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'polar_extraction', 'IC-FTMS_preparation'], ['naive', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'protein_extraction'], ['syngenic', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'polar_extraction', 'IC-FTMS_preparation'], ['syngenic', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'protein_extraction'], ['allogenic', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'polar_extraction', 'IC-FTMS_preparation'], ['allogenic', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'protein_extraction']] Notice the treatment factor protocols, naive, allogenic, and syngenic, create extra sequences. Example protocol sequences with the factor protocols normalized: [['factor_protocol_0', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'polar_extraction', 'IC-FTMS_preparation'], ['factor_protocol_0', 'mouse_tissue_collection', 'tissue_quench', 'frozen_tissue_grind', 'protein_extraction']] You can see that this would indicate 2 assays: 1 for the polar extraction and 1 for the protein extraction. To separate the assay and subject sequences you would work backward until you hit the "mouse_tissue_collection" and you would end up with a single study sequence of ["factor_protocol_0", "mouse_tissue_collection"]. Looking at this again, I feel like "tissue_quench" and "frozen_tissue_grind" should be part of the study since it is common to both assays. We might need to rethink how to separate studies and assays. With this breakdown we would be able to add study IDs and assay IDs to the entities, protocols, and measurements. There is still the issue of connecting the studies in the study table to the groups of entities, protocols, and measurements we can identify as being in the same study. For instance, let's say we determined there are at least 2 studies based on the protocol sequences. We can identify which entities should be grouped into the 2 different studies, but we still don't have a way of knowing which study from the study table should go with each identified group. I think if we required "study.id" on the subjects this would solve the problem without too much burden. The assays can just be numbered since we don't have an assay table. Basically, my idea is that there would be a preprocess step for ISA that would do the above and add "study.id" and "assay_id" fields to records appropriately and then the conversion directives could rely on these fields and make things simpler. I'm thinking this could be tied to an option such as "--determine_studies". If they don't give the option then the expectation is that the user would provide the ID fields. I think we are definitely going to have to meet to talk through this. Should the protein and polar extractions be 2 different assays? It fits with ISA, but when you compare it to how we handle things in MWTab it's strange. The protein weight is just 1 measurement that's on the entity, we don't create measurement records for it, and there is no data file. Also something to note: The polar extraction and protein extraction both have the same sample as input, so when translated to ISA they will end up in 2 different processes, and each process will have the same input but different outputs. I think this should be an error for ISA, but I tried it and it worked fine. I also checked their validation code and they don't validate anything like that. I have added it to the list of things to ask them about. It is kind of strange. The way it works out it looks like you do 2 different protocols to the same sample which is obviously not possible. What actually happens is that the sample is split into polar, lipids, and protein, but we created 3 protocols instead of 1. Those protocols have files describing them, so if you read those it's probably clear, but just looking at the JSON it's kind of strange. — Reply to this email directly, view it on GitHub <#33>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADEP7B5JL4MQGOO3I7HRG6LXU2BXHANCNFSM6AAAAAA3NKI7OA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still. --------------------------------------------------------------- Email: ***@***.*** (work) ***@***.*** (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

0 replies

ptth222 · 2023-08-14T18:49:44Z

ptth222
Aug 14, 2023
Maintainer Author

We met and discussed this. The conclusion was that the method for determining studies and assays is good, but instead of an option we look to see if the record(s) have those IDs already and don't overwrite them. We also decided that the method could indicate that there should be 1 study, but if the user wants to they can break it into more studies by simply specifying them on the samples. Since they have to put the study IDs on the samples for us to be able to connect the predicted studies to the actual studies in the study table this is the simplest way to do it. The same inputs being on multiple protocols was deemed not a problem.

0 replies

ptth222 · 2023-11-07T04:08:15Z

ptth222
Nov 7, 2023
Maintainer Author

I have now spent a lot of time coding this to create the processSequences for studies and assays. How I ended up doing it is a little different from what is described above.

The way the ISA code is it is easier to have complex or lengthy sample inheritance in the studies than the assays. The assays really want to be a simple document with just 1 sample to extract (or just 1 sample) and then analytical or statistical processes and files. What I ended up doing was to assume that the entities connected to a measurement are an extract and need to be in an assay. An example is '15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-polar-ICSM_A', this is an extract from the colon tissue. If the extract assumption is not good then the user can put a "isa_type" field on the entity and indicate that it is a sample and not an extract. This means the "-protein" samples would be in a study and not an assay since we don't have a measurement object for them, but it is simple enough to add dummy measurements for them so they would be in an assay. To determine the number of assays per study I find all the unique protocol sequences like what is shown above and assume each unique sequence is a separate assay. Users can override this by adding an "assay_id" to the measurement.

Studies are determined by finding all childless entities and then looking at the required "study.id" attribute on them to group those lineages into studies.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining Studies and Assays for ISA #33

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Determining Studies and Assays for ISA #33

ptth222 Aug 11, 2023 Maintainer

Replies: 3 comments

hunter-moseley Aug 12, 2023 Maintainer

ptth222 Aug 14, 2023 Maintainer Author

ptth222 Nov 7, 2023 Maintainer Author

ptth222
Aug 11, 2023
Maintainer

hunter-moseley
Aug 12, 2023
Maintainer

ptth222
Aug 14, 2023
Maintainer Author

ptth222
Nov 7, 2023
Maintainer Author