-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote the Enum Labels and Ordering pattern to the Table Schema spec? #875
Comments
Would V2 be an opportunity to implement this pattern as a top-level field type rather than as an enum constraint on other types? e.g, instead of :
something like:
As a top-level type, this makes the definition much simpler / cleaner / easier to parse & implement in type systems, because you can immediately detect it is a categorical / ordinal enum type rather than needing to drill into constraints. Also, we can drop the "enum" prefix on enumLabels and enumOrdered because it's clear the attributes are being applied in an "enum" field scope. (and these props don't make sense on other primitive types!) Given that we already have types like "year" implemented as top-level types rather than constraints on primitive types, I think there's an argument for categorical & ordinal enums to receive similar top-level standing given their extensive use in the biobehavioral, medical and social sciences as @pschumm referenced... The addition of a new field type like this would not violate @peterdesmet 's proposed rules for spec changes (#858 comment), that is:
Thoughts? |
@khusmann It's quite a substantial change, but it would indeed be nice to give some more love to What we loose is not being able to declare a non-string {
"name": "project_start",
"type": "enum",
"values": [
"01/03/2023",
"01/04/2023"
],
"format": "%d/%m/%y"
} An implementation would have to guess somehow that
Thoughts? |
@peterdesmet excellent points. Conceptually, I'm imagining the By contrast, I see an
With a definition like this, I'd expect implementations to read Perhaps the use of the The key argument here is that categorical / ordinal fields are a conceptual type distinct from string, number, date, etc. They are not just constrained string or numeric fields... they are fundamentally a different type of field with different properties, which I think affords them their own first-class field definition, rather than having their existence be implied from a validation constraint. |
I’m in favour of a |
I very much like this idea, and note that it will emphasize the correspondence between the proposed For encoded data, I presume the corresponding specification would be:
I suppose there is some ambiguity between |
Thanks for bringing up the encoded categoricals @pschumm, I think this is worth more discussion. I think the By contrast, as we've discussed before in previous threads on categoricals, numeric encodings of categoricals are physical values. When gender is encoded 1: Female, 2: Male, the logical values of the categorical field are Therefore, for the proposed categorical type, I would argue encodings should be expressed in a field more akin to
This way, the That said, I see two issues with this approach, which I outline below with potential solutions:
Conceptually, how do we identify the levels of this kind of field? Relatedly, how do we envision this variable being imported into Pandas / R / Julia? One approach could be to make it a categorical with levels ["Poor", "2", "3", "4", "Excellent"], which would be represented by the following field definition:
(Where physical values without codes simply pass through untransformed)
(Where Alternatively, we could support logical missing values in the same manner to the encoded categorical approach I'm proposing above, by adding a
This would enable (This would also work on a per-field I realize adding a
Thoughts? |
@khusmann thanks for pushing this forward. Some thoughts below. TL;DR: I think I prefer @pschumm original approach: I find it more straightforward and think it can be even simplified further. I do agree with the correction you made, where missing values should have the physical values (
# Data are strings, values are not defined
data <- c("Male", "Man", "Male", "Female", "Lady", "Undefined")
factor(data)
#> [1] Male Man Male Female Lady Undefined
#> Levels: Female Lady Male Man Undefined
# Data are strings, values are defined (but don't map entirely)
values <- c("Male", "Man", "Female", "Lady", "Nonbinary", "Declined")
factor(data, levels = values)
#> [1] Male Man Male Female Lady <NA>
#> Levels: Male Man Female Lady Nonbinary Declined
# Data are integers, values are not defined
data <- c(1, 2, 1, 3, 4, -99)
factor(data)
#> [1] 1 2 1 3 4 -99
#> Levels: -99 1 2 3 4
# Data are integers, values are defined (but don't map entirely)
values <- c(1, 2, 3, 4, 5, 6)
factor(data, levels = values)
#> [1] 1 2 1 3 4 <NA>
#> Levels: 1 2 3 4 5 6 Created on 2024-02-16 with reprex v2.1.0
data <- c(1, 2, 1, 3, 4, -99)
values <- c(1, 2, 3, 4, 5, 6)
labels <- c("Male", "Male", "Female", "Female", "Nonbinary", "Declined")
factor(data, levels = values, labels = labels)
#> [1] Male Male Male Female Female <NA>
#> Levels: Male Female Nonbinary Declined Created on 2024-02-16 with reprex v2.1.0 We would opt to simplify the current
Resulting syntax: {
"fields": [
{
"name": "gender",
"type": "categorical",
"values": [1, 2, 3, 4, 5, 6]
"labels": ["Male", "Male", "Female", "Female", "Nonbinary", "Declined"]
"ordered": false,
"missingValues": ["NA"]
}
],
} |
I agree with @peterdesmet that values should contain the physical values present in the data—that is, as he notes, more consistent with other elements of the standard and more intuitive. It also makes the schema easier to read by a human being if the data contain a mix of fields, some represented in the file by their logical values and others represented by codes. I also like the proposed simplification of changing the The one thing I find counterintuitive would be to permit Finally, I would just note that I agree Finally, let me say how much I appreciate you guys engaging so deeply here. I feel strongly that once we arrive at a final resolution, this will have an enormous impact on the utility of Frictionless in the disciplines within which I work (and probably others). And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!). |
I find my proposal to have {
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
{"value": 1, "label": "Male"},
{"value": 2, "label": "Male"},
{"value": 3, "label": "Female"},
{"value": 4, "label": "Female"},
{"value": 5, "label": "Nonbinary"},
{"value": 6 }
],
"ordered": false,
"missingValues": ["NA"]
}
]
}
My suggestion to allow fewer values was based on
Yes (even if we do talk about it 😄 ) 🍻 |
Love it @peterdesmet! Personally, I would prefer the restriction that we list all of the categories and don't have to secondarily include an |
I like where we're going with this! Especially @peterdesmet 's proposed I agree, putting the levels in a map with I also agree with @pschumm that we should list all the values / categories rather than requiring a second (Side note: For Where I still have reservations with @peterdesmet 's latest approach, however, is that from the field definition it is not immediately clear what the logical levels of the categorical should be. In the example, it at first looks like there are 6 logical levels, and it requires sorting through the labels to find that there are actually only 4, because two get collapsed. I'd argue that R's Therefore, I think we should require that
Yes, Male/Man and Female/Lady can be grouped, but they are still qualitatively different responses and therefore distinct logical levels. If a user wants to collapse those levels, they can do so in a subsequent transformation step. I'll also note that another advantage of @peterdesmet's list-of-objects approach is that logical level objects can be extended via user defined properties – for example, the text of the item in the survey question:
You'll notice I also put level 6 (Declined) in And as @peterdesmet said, when no
Would be imported in R as
I hope we can make that happen one of these days! I really appreciate everyone's engagement on this as well :) |
Ah, good to know! Then for the example I just gave we'd have:
For a REDCap export. That looks quite nice. |
👍 With the minor correction that |
All of this looks good to me. I agree with @khusmann that collapsing is a transformation; while Stata permits you to label two different integer values with the same label, those are still treated as separate analytically and appear separately in output (just with the same label). Note that in the discussion above, both @khusmann and @peterdesmet are using field-specific
I don't know if this addresses your comment above, but other field types already invoke validation (e.g., a field with type |
Regarding |
Agreed! I am also strongly in favor of field-specific
Also agree. Well said. One more thought – we want to offer the shortcut of string levels? So: {
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
"Male",
"Man",
"Female",
"Lady",
"Nonbinary"
],
"ordered": false,
"missingValues": ["NA", "Declined"]
}
]
} would be syntactic sugar for: {
"fields": [
{
"name": "gender",
"type": "categorical",
"categories": [
{ "value": "Male" },
{ "value": "Man" },
{ "value": "Female" },
{ "value": "Lady" },
{ "value": "Nonbinary" }
],
"ordered": false,
"missingValues": ["NA", "Declined"]
}
]
} In summary, this would make the complete type signature of the proposed field as follows: type CategoricalField = {
name: string,
title?: string,
description?: string,
example?: string,
format?: "default",
type: "categorical",
categories: ({ value: string | number, label?: string } | string)[],
ordered?: boolean,
constraints?: {
"required"?: boolean,
"unique"?: boolean
},
missingValues?: string[]
} |
Thanks for suggesting this @khusmann; I was so focused on the other details that it didn't even occur to me. Indeed, as I think I've mentioned before, this is my preferred way to distribute data (i.e., labels rather than integer codes) since it makes them useable with the broadest range of software. So this simplified specification would be very nice (not to mention very readable). I definitely favor including this option. |
My 2 cents:
|
Correction: I guess the shortcut is a case of a union type (array of strings vs array of objects). Other than my readability concerns, it does allow backward compatibility for missingValues. In general, union types offer a lot of flexibility to keep things backward compatible and are often elegant (e.g. not having to add a new |
Just added comments to #873 re: union types. There, I argue that the kind of "union type syntactic sugar" I'm proposing here should be considered on a case-by-case basis. I would argue that the proposed syntactic sugar here is worth using because:
|
With this proposal (I'd call it inline-categories), the list of categories must be given on each field again and again. If categories have many values and/or if they are used in mutliple fields, it may make sense to allow referencing {
"name": "suportStatement1",
"type": "categorical",
"categories": "agreementLevel"
}, {
"name": "suportStatement2",
"type": "categorical",
"categories": "agreementLevel"
} and elsewhere {
"categoryTypes": {
"agreementLevel": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
]
}
} The value of See Codes and Codelists in Avram schema language for the same idea (codes == categories). |
@nichtich Agreed -- I think that's exactly the direction we want to go, and "inline-categories" gives us the first step. |
Recall that #888 covers similar ground; we might want to consider both ideas together. |
DONE in #68 |
This pattern includes two additions to the Table Schema to facilitate working with categorical data across a broad range of commonly-used, analytic software. It is fully backward compatible, and would substantially increase the usability of Frictionless data packages in biomedical, epidemiological and social research. It has been discussed and revised extensively here.
The text was updated successfully, but these errors were encountered: