Fix header detection for tables with sparse numerical data #77

sbesson · 2022-06-05T20:23:41Z

Fixes #76

Reproducible scenario

First create a minimal dataset/image hierarchy e.g. as follows:

touch test1.fake test2.fake
dataset=$(omero obj new Dataset name=sparse_table)
omero import -T $dataset test1.fake test2.fake

CSV files with sparse string data such as the one below are correctly handled by the current HEAD of omero-metadata.

$ cat sparse_string_column.csv 
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1,high,low
test2.fake,0.5,2,,low

The columns with missing values are mapped as s/StringCOlumn and the missing value are turned into empty strings where running omero metadata populate --file sparse_string_column.csv $dataset e.g.

CSV files with sparse numerical columns such as the one below currently fail during the population command:

$ cat sparse_numeric_column.csv
Image name,meas1,meas2,meas3,meas4
test1.fake,1.1,1.2,high,low
test2.fake,,2.1,,low

Here, the meas1 column is currently mapped into a d header type/DoubleColumn by the pandas detection logic, With the default omero metadata populate --file command, the table population fails with ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN`.

Proposed changes

Since the library already includes some logic allowing the user to control whether NaN values are allowed in the OMERO.table (introduced in #60), this PR proposes the following changes

the MetadataControl.detect_headers API supports an extra keep_default_na argument (True by default). Its value is forwarded to pandas.read_csvand determines controls how pandas should handle NA (missing) values and whether the column is detected asdvss` (see https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
unit tests are added to cover the new MetadataControl.detect_headers API with different combination of tables/arguments
the populate command is updated to use the value of args.allow_nan (False by default) as pass it as keep_default_na to MetadataControl.detect_headers

fe73a17 adds a cosmetic change defining GNU-style aliases of the command-line arguments (--manual-header, --allow-nan) using hyphen as separator. The existing underscore separated flags are preserved.

Testing

With these changes, annotating of sparse CSV tables using the default header detection should be functional in all cases.

if the tabular data is dense or containing sparse string columns, the behavior of the command should be unchanged
if the tabular data is dense or containing sparse numerical columns,, the behavior of the command will depend on the --allow-nan flag
```
omero metadata populate --file sparse_numeric_column.csv $data
```
will detect the sparse numeric column as aStringColumn and store the missing values as empty strings
```
omero metadata populate --file sparse_numeric_column.csv --allow-nan $data
```
will detect the sparse numeric column as a DoubleColumn and store missing values as nan

snoopycrimecop · 2022-06-06T00:52:46Z

Conflicting PR. Removed from build OMERO-plugins-push#1169. See the console output for more details.
Possible conflicts:

PR Support images_by_name for Screen and Plate #64 will-moore 'Support images_by_name for Screen and Plate'
- test/integration/metadata/test_populate.py

--conflicts

snoopycrimecop · 2022-06-11T00:50:16Z

Conflicting PR. Removed from build OMERO-plugins-push#1176. See the console output for more details.
Possible conflicts:

PR Support images_by_name for Screen and Plate #64 will-moore 'Support images_by_name for Screen and Plate'
- test/integration/metadata/test_populate.py

~~--conflicts~~ Conflict resolved in build OMERO-plugins-push#1179. See the console output for more details.

muhanadz

Review summary

In conclusion, what was described in the PR description is reasonable, valid, and in my opinion necessary.

Following the steps described, without the PR there are cases where the script will fail (with a ValueError exception) in the presence of a sparse numeric column and the lack of --allow-nan. This PR fixes that behavior such that in the case of a sparse numeric column you can choose to have the sparse numeric column read as 'd' (DoubleColumn) or 's' (StringColumn) by passing --allow-nan or not

My current understanding is that nan values or empty values will be processed regardless of if they're in a numeric or string column. And the default is to read the nan values as string type.

If so I'd suggest adding something like "without --allow-nan, by default, the column will be read as a string column in the presence of nans in the column" somewhere either in the help of the argument or the README.

Review details

Using `sparse_string_column.csv` (4th column sparse):

Without PR detected ['s', 'd', 'l', 's', 's']. Added empty string to final result.
With PR changes, detected ['s', 'd', 'l', 's', 's'] and with --allow-nan, similary detected ['s', 'd', 'l', 's', 's']. Added empty string to final result.

Using `sparse_numeric_column.csv (2nd and 4th column sparse)`

Without PR detected ['s', 'd', 'd', 's', 's'], however will get the following exception:

ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN

With PR changes, detected ['s', 's', 'd', 's', 's'] which a dded empty string to final result and with --allow-nan detected ['s', 'd', 'd', 's', 's'] which added 'nan' to final result.

sbesson · 2022-06-28T08:31:10Z

Thanks @muhanadz for the review. 27050f2 should amend the README where most of the information about the library usage is captured at the moment.

muhanadz

Looks good to me

sbesson · 2022-06-28T10:44:56Z

Thanks @muhanadz. The new column detection behavior is now released as omero-metadata 0.11.0 🎉

sbesson added 7 commits June 2, 2022 15:23

Convert detect_headers unit tests into class

03ebc18

Allow MetadataControl.detect_header to control default NA

7ff94a2

Add unit tests for the detection of sparse columns

31d562b

Use --allow_nan flag in MetadataControl.detect_headers

1f637bd

Allow fixtures to override the value assertion with their expectations

1ffac2a

Add GNU-style hyphen-separated version of CLI arguments

fe73a17

Use --allow-nan flag to determine NA behavior in detect_headers

89b8084

sbesson requested review from will-moore and muhanadz June 5, 2022 20:23

sbesson added this to the 0.11.0 milestone Jun 10, 2022

sbesson mentioned this pull request Jun 21, 2022

Updating README #71

Merged

muhanadz reviewed Jun 23, 2022

View reviewed changes

sbesson added 2 commits June 28, 2022 09:15

Merge remote-tracking branch 'origin/master' into detect_headers_sparse

72afd32

Add note to README about the missing value handling

27050f2

sbesson requested a review from muhanadz June 28, 2022 08:31

muhanadz approved these changes Jun 28, 2022

View reviewed changes

sbesson merged commit dea8834 into ome:master Jun 28, 2022

sbesson deleted the detect_headers_sparse branch September 12, 2022 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix header detection for tables with sparse numerical data #77

Fix header detection for tables with sparse numerical data #77

sbesson commented Jun 5, 2022

snoopycrimecop commented Jun 6, 2022

snoopycrimecop commented Jun 11, 2022 •

edited

Loading

muhanadz left a comment •

edited

Loading

sbesson commented Jun 28, 2022

muhanadz left a comment

sbesson commented Jun 28, 2022 •

edited

Loading

Fix header detection for tables with sparse numerical data #77

Fix header detection for tables with sparse numerical data #77

Conversation

sbesson commented Jun 5, 2022

Reproducible scenario

Proposed changes

Testing

snoopycrimecop commented Jun 6, 2022

snoopycrimecop commented Jun 11, 2022 • edited Loading

muhanadz left a comment • edited Loading

Choose a reason for hiding this comment

Review summary

Review details

Using sparse_string_column.csv (4th column sparse):

Using sparse_numeric_column.csv (2nd and 4th column sparse)

sbesson commented Jun 28, 2022

muhanadz left a comment

Choose a reason for hiding this comment

sbesson commented Jun 28, 2022 • edited Loading

snoopycrimecop commented Jun 11, 2022 •

edited

Loading

muhanadz left a comment •

edited

Loading

Using `sparse_string_column.csv` (4th column sparse):

Using `sparse_numeric_column.csv (2nd and 4th column sparse)`

sbesson commented Jun 28, 2022 •

edited

Loading