Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote the Enum Labels and Ordering pattern to the Table Schema spec? #875

Closed
pschumm opened this issue Jan 30, 2024 · 24 comments
Closed
Assignees
Milestone

Comments

@pschumm
Copy link
Contributor

pschumm commented Jan 30, 2024

This pattern includes two additions to the Table Schema to facilitate working with categorical data across a broad range of commonly-used, analytic software. It is fully backward compatible, and would substantially increase the usability of Frictionless data packages in biomedical, epidemiological and social research. It has been discussed and revised extensively here.

@khusmann
Copy link
Contributor

khusmann commented Feb 14, 2024

Would V2 be an opportunity to implement this pattern as a top-level field type rather than as an enum constraint on other types? e.g, instead of :

{
  "fields": [
    {
      "name": "physical_health",
      "type": "string",
      "constraints": {
        "enum": [
          "Poor",
          "Fair",
          "Good",
          "Very good",
          "Excellent",
        ]
      }
      "enumOrdered": true
    }
  },
  "missingValues": ["Don't know","Refused","Not applicable"]
}

something like:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "enum",
      "values": [
          "Poor",
          "Fair",
          "Good",
          "Very good",
          "Excellent",
       ],
      "ordered": true
    }
  },
  "missingValues": ["Don't know","Refused","Not applicable"]
}

As a top-level type, this makes the definition much simpler / cleaner / easier to parse & implement in type systems, because you can immediately detect it is a categorical / ordinal enum type rather than needing to drill into constraints. Also, we can drop the "enum" prefix on enumLabels and enumOrdered because it's clear the attributes are being applied in an "enum" field scope. (and these props don't make sense on other primitive types!)

Given that we already have types like "year" implemented as top-level types rather than constraints on primitive types, I think there's an argument for categorical & ordinal enums to receive similar top-level standing given their extensive use in the biobehavioral, medical and social sciences as @pschumm referenced...

The addition of a new field type like this would not violate @peterdesmet 's proposed rules for spec changes (#858 comment), that is:

  1. It would not invalidate previous datapackage.json files
  2. A datapackage with the new enum field type would be invalid for software that did not support the new field type yet, but this is OK

Thoughts?

@peterdesmet
Copy link
Member

@khusmann It's quite a substantial change, but it would indeed be nice to give some more love to enum, making values, labels and ordered properties at the same level as bareNumber and groupChar are for numbers.

What we loose is not being able to declare a non-string type, like date. Let's say project_start is non ISO date and can only have 2 values (an enum, min/max is no replacement):

  {
      "name": "project_start",
      "type": "enum",
      "values": [
        "01/03/2023",
        "01/04/2023"
      ],
      "format": "%d/%m/%y"
   }

An implementation would have to guess somehow that project_start is to be interpreted as a date field. The alternatives are:

  • Leave type as date, but lift enum up from constraints as values.
  • Use type enum, but only allow string values.

Thoughts?

@khusmann
Copy link
Contributor

khusmann commented Feb 14, 2024

@peterdesmet excellent points.

Conceptually, I'm imagining the enum type to be specifically for representing categorical / ordinal variables, that is, a field with a distinct set of levels that may or may not be ordered. So I would argue we should only allow string values in enum types, where they act as labels to represent these abstract levels.

By contrast, I see an enum constraint to be a validation rule on an existing type. So in your project_start example, I'd represent this the traditional way -- it's a date field, but with a validation rule:

  {
      "name": "project_start",
      "type": "date",
      "constraints": [
        enum: ["01/03/2023", "01/04/2023"]
      ],
      "format": "%d/%m/%y"
   }

With a definition like this, I'd expect implementations to read project_start as a date type, not as a categorical / factor type. It's a "date with validation constraints"; not a categorical / ordinal variable.

Perhaps the use of the enum keyword is confusing being used in both these contexts, and we should use a different name for the type? (e.g. "type": "categorical")

The key argument here is that categorical / ordinal fields are a conceptual type distinct from string, number, date, etc. They are not just constrained string or numeric fields... they are fundamentally a different type of field with different properties, which I think affords them their own first-class field definition, rather than having their existence be implied from a validation constraint.

@peterdesmet
Copy link
Member

I’m in favour of a ”type": “categorical” as a separate concept of constraints.enum. It’s non-breaking and will make implementation more contained and straightforward.

@pschumm
Copy link
Contributor Author

pschumm commented Feb 15, 2024

I very much like this idea, and note that it will emphasize the correspondence between the proposed categorical type in Frictionless and the dtype category in Pandas, factors in R, or a CategoricalVector in Julia. It will also simplify the necessary code, a point @khusmann made above.

For encoded data, I presume the corresponding specification would be:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [1,2,3,4,5]
      "ordered": true,
      "labels": {
        "1": "Poor",
        "2": "Fair",
        "3": "Good",
        "4": "Very good",
        "5": "Excellent"
      }
    }
  ],
  "missingValues": ["Don't know","Refused","Not applicable"]
}

I suppose there is some ambiguity between "values": [1,2,3,4,5] and "values": ["1","2","3","4","5"]; the quotes around the keys in the labels property being necessary merely due to the JSON spec. But I think this is a very minor issue that can be dealt with in a note to implementors (my instinct would be to permit either).

@khusmann
Copy link
Contributor

khusmann commented Feb 15, 2024

Thanks for bringing up the encoded categoricals @pschumm, I think this is worth more discussion.

I think the values array in the categorical type should always be specified as logical values, that is, string labels that represent the abstract levels of the categorical. This would also have the nice side effect of always defining the levels of the Pandas / R / Julia categorical type when it is imported.

By contrast, as we've discussed before in previous threads on categoricals, numeric encodings of categoricals are physical values. When gender is encoded 1: Female, 2: Male, the logical values of the categorical field are ["Male", "Female"], even though its physical values are ["1", "2"].

Therefore, for the proposed categorical type, I would argue encodings should be expressed in a field more akin to trueValues and falseValues on boolean types, that is, mappings from logical values to physical values. Something like:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [
           "Poor",
           "Fair",
           "Good",
           "Very good",
           "Excellent",
       ],
       "ordered": true,
       "codes": {
          "Poor": "1",
          "Fair": "2",
          "Good": "3",
          "Very good": "4",
          "Excellent": "5"
       }
    }
  ],
}

This way, the values field is always holding logical values. (and we can always use those as the names of levels when importing into Pandas / R / Julia / etc.)

That said, I see two issues with this approach, which I outline below with potential solutions:

  1. Partially labeled scales. Some scales do not have labels for all their levels. For example, suppose physical_health only had "Poor" and "Excellent" anchors, and the rest of the levels were unnamed. (e.g. the question was "On a scale from 1 to 5, 1 being Poor, and 5 being Excellent, how do you rate your health?")

Conceptually, how do we identify the levels of this kind of field? Relatedly, how do we envision this variable being imported into Pandas / R / Julia?

One approach could be to make it a categorical with levels ["Poor", "2", "3", "4", "Excellent"], which would be represented by the following field definition:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [
           "Poor",
           "2",
           "3",
           "4",
           "Excellent",
      ],
      "ordered": true,
      "codes": {
        "Poor": "1",
        "Excellent": "5"
      }
    }
  ],
}

(Where physical values without codes simply pass through untransformed)

  1. Labeled missingness. In the current spec, missingValues are always defined as physical values. So I think @pschumm 's earlier example (using the labels approach) would actually look something like:
{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [1,2,3,4,5]
      "ordered": true,
      "labels": {
        "1": "Poor",
        "2": "Fair",
        "3": "Good",
        "4": "Very good",
        "5": "Excellent",
        "-97": "Don't know",
        "-98": "Refused",
        "-99": "Not applicable"
      }
    }
  ],
  "missingValues": ["-97", "-98", "-99"]
}

(Where missingValues now holds physical values (codes), instead of logical values (labels)) (edit: changed missing values to all negative for clarity)

Alternatively, we could support logical missing values in the same manner to the encoded categorical approach I'm proposing above, by adding a missingCodes field:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [
           "Poor",
           "Fair",
           "Good",
           "Very good",
           "Excellent",
      ],
      "ordered": true,
      "codes": {
        "Poor": "1",
        "Fair": "2",
        "Good": "3",
        "Very good": "4",
        "Excellent": "5"
      }
    }
  ],
  "missingValues": ["Don't know","Refused","Not applicable"]
  "missingCodes": {
    "Don't know": "-97",
    "Refused": "-98",
    "Not applicable", "-99"
  }
}

This would enable missingValues to now be specified with logical values, and mirror the behavior of the categorical type. I think this is actually really nice and consistent, because missing values are indeed a categorical type!

(This would also work on a per-field missingValues basis as well, and be useful for all field types, not just categorical fields)

I realize adding a missingCodes field would be another big change but I think it is intertwined with the spec for categorical types for the reasons I mention above. That said, it is still within @peterdesmet 's proposed rules for V2 spec changes:

  1. It would not invalidate previous datapackage.json files
  2. A datapackage with the new missingCodes field would be invalid for software that did not support the new field yet, but this is OK

Thoughts?

@peterdesmet
Copy link
Member

@khusmann thanks for pushing this forward. Some thoughts below.

TL;DR: I think I prefer @pschumm original approach: I find it more straightforward and think it can be even simplified further. I do agree with the correction you made, where missing values should have the physical values ("missingValues": ["-97", "-98", "-99"]).

  1. I think values should hold the physical values. This is already the case for trueValues and falseValues (e.g. lists the physical "True" for logical true) and missingValues (e.g. lists the physical "-99" for logical null) and it would be good to align with that.

  2. It is still possible to derive levels directly from values:

  • It would be the values as they appear in the data (i.e. physical).
  • It can list more values than are present in the data.
  • It can list fewer values than are present in the data. Those extra values would be considered null.
  • It is an optional property. Default behaviour could be similar to the default behaviour of the levels parameter in factor():

    The default is the unique set of values taken by as.character(x), sorted into increasing order of x.

# Data are strings, values are not defined
data <- c("Male", "Man", "Male", "Female", "Lady", "Undefined")
factor(data)
#> [1] Male      Man       Male      Female    Lady      Undefined
#> Levels: Female Lady Male Man Undefined
# Data are strings, values are defined (but don't map entirely)
values <- c("Male", "Man", "Female", "Lady", "Nonbinary", "Declined")
factor(data, levels = values)
#> [1] Male   Man    Male   Female Lady   <NA>  
#> Levels: Male Man Female Lady Nonbinary Declined

# Data are integers, values are not defined
data <- c(1, 2, 1, 3, 4, -99)
factor(data)
#> [1] 1   2   1   3   4   -99
#> Levels: -99 1 2 3 4
# Data are integers, values are defined (but don't map entirely)
values <- c(1, 2, 3, 4, 5, 6)
factor(data, levels = values)
#> [1] 1    2    1    3    4    <NA>
#> Levels: 1 2 3 4 5 6

Created on 2024-02-16 with reprex v2.1.0

  1. You might notice that some levels should be bundled (e.g. Male/Man, Female/Lady). This can be achieved with the labels property. factor() in R supports this with the labels parameter, where:

    Duplicated values in labels can be used to map different values of x to the same factor level.

data <- c(1, 2, 1, 3, 4, -99)
values <- c(1, 2, 3, 4, 5, 6)
labels <- c("Male", "Male", "Female", "Female", "Nonbinary", "Declined")
factor(data, levels = values, labels = labels)
#> [1] Male   Male   Male   Female Female <NA>  
#> Levels: Male Female Nonbinary Declined

Created on 2024-02-16 with reprex v2.1.0

We would opt to simplify the current value: label proposal for labels to an array in the same order and with as many elements as values. That would also avoid the 9 vs "9" issue. Not sure if there is functionality we would loose and it is clear enough.

  1. If we consider any value in the data not listed in values as null, then there is likely no need for missingValues either:
  • The field could be used as it is in any other field, to directly consider a physical value as null, before any further processing steps.
  • If you want to guarantee that any data value is also listed, you can use constraints.enum
  • Or we could have more requirements for values. Personally I prefer that it is optional and doesn't need to encompass all present values.

Resulting syntax:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "values": [1, 2, 3, 4, 5, 6]
      "labels": ["Male", "Male", "Female", "Female", "Nonbinary", "Declined"]
      "ordered": false,
      "missingValues": ["NA"]
    }
  ],
}

@pschumm
Copy link
Contributor Author

pschumm commented Feb 16, 2024

I agree with @peterdesmet that values should contain the physical values present in the data—that is, as he notes, more consistent with other elements of the standard and more intuitive. It also makes the schema easier to read by a human being if the data contain a mix of fields, some represented in the file by their logical values and others represented by codes.

I also like the proposed simplification of changing the labels property to an array of the same length as values. The only problem I see is that it will be more difficult to read in cases where the number of values/labels is large (e.g., prescription medications in a drug database). I believe this is the exception rather than the rule, but even in the case of a modest number of value/labels (e.g., ~6 or more), it could make manual edits to a schema more error-prone. I could go either way here.

The one thing I find counterintuitive would be to permit values to have fewer values than are present in the data, once all values declared in missingValues have been accounted for. IMO that would make the schema more difficult to read and interpret, and perhaps more importantly, would permit potentially serious errors to pass silently during validation. More values than are present in the data, sure, that makes sense. But fewer values strikes me as being too implicit. What would the harm be in requiring that all of the observed values be present in the case of a categorial variable?

Finally, I would just note that I agree missingValues should always contain physical values (as they do now). The example I gave above was intentional; for example, this is how REDCap exports data by default (i.e., categorical variables get exported with their numeric codes, except for defined missing values which are reprsented by their labels). So we should accommodate that case even if we wouldn't choose to write data that way. Thus, in the example here, software such as Python or R that cannot represent multiple types of missing values would ignore (i.e., treat as null) the values in missingValues, while software such as Stata, SAS or SPSS that can represent multiple types of missing values could automatically incorporate the values in missingValues as extended missing values (Stata or SAS) or negative integers (SPSS).

Finally, let me say how much I appreciate you guys engaging so deeply here. I feel strongly that once we arrive at a final resolution, this will have an enormous impact on the utility of Frictionless in the disciplines within which I work (and probably others). And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).

@peterdesmet
Copy link
Member

I find my proposal to have values and labels to be two arrays of the same length a bit clunky and hard to read (especially for many values). I think we can combine it into one property:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "categories": [
        {"value": 1, "label": "Male"},
        {"value": 2, "label": "Male"},
        {"value": 3, "label": "Female"},
        {"value": 4, "label": "Female"},
        {"value": 5, "label": "Nonbinary"},
        {"value": 6 }
      ],
      "ordered": false,
      "missingValues": ["NA"]
    }
  ]
}
  • categories as a name aligns well with "type": "categorical".
  • categories (levels/values) is still an array, so it is possible to order them (ordered).
  • value is the physical value. In contrast with the first proposal for labels/enumLabels, values don't need to be wrapped in double quotes, since they are not keys.
  • label is directly associated with value, fastly improving readability.
  • label should be optional (since not all data providers will want to provide it), but now it is optional at a value level. Implementations should just use the value if a label is not provided.

But fewer values strikes me as being too implicit. What would the harm be in requiring that all of the observed values be present in the case of a categorial variable?

My suggestion to allow fewer values was based on factor() being able to deal with those. We already have constraints.enum to validate unexpected values, but it might be good to include that functionality in categories as well. I'm a bit on the fence if it is a good design decision to have both constraints.enum and categories as methods of defining that or if we should reserve validation for constraints.enum only.

And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).

Yes (even if we do talk about it 😄 ) 🍻

@pschumm
Copy link
Contributor Author

pschumm commented Feb 16, 2024

Love it @peterdesmet! Personally, I would prefer the restriction that we list all of the categories and don't have to secondarily include an enum property to do validation, but I could live with a group decision on that.

@khusmann
Copy link
Contributor

khusmann commented Feb 16, 2024

I like where we're going with this! Especially @peterdesmet 's proposed categories prop!

I agree, putting the levels in a map with value props greatly improves readability and is a lot closer in function to trueValues than what I was proposing earlier.

I also agree with @pschumm that we should list all the values / categories rather than requiring a second enum constraint. Per @peterdesmet 's point, although R's factor allows the data to have values not specified in levels, the "fixed" version fct in forcats does not. I definitely prefer the conservative / strict / explicit approach here.

(Side note: For boolean types, do we consider it a validation error if a value comes up that is not contained in true/falseValues? I cannot find mention in the spec…)

Where I still have reservations with @peterdesmet 's latest approach, however, is that from the field definition it is not immediately clear what the logical levels of the categorical should be. In the example, it at first looks like there are 6 logical levels, and it requires sorting through the labels to find that there are actually only 4, because two get collapsed.

I'd argue that R's factor usage of labels as a way to collapse levels is a transformation of the data, rather than a description of it. (And note that forcats' more strict implementation fct also does not allow collapsing via labels for this reason to encourage explicit use of fct_collapse instead).

Therefore, I think we should require that labels be unique, so that we always have a 1-1 correspondence between items in the categories array, and logical levels of the resulting categorical:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "categories": [
        {"value": 1, "label": "Male"},
        {"value": 2, "label": "Man"},
        {"value": 3, "label": "Female"},
        {"value": 4, "label": "Lady"},
        {"value": 5, "label": "Nonbinary"},
      ],
      "ordered": false,
      "missingValues": ["NA", "6"]
    }
  ]
}

Yes, Male/Man and Female/Lady can be grouped, but they are still qualitatively different responses and therefore distinct logical levels. If a user wants to collapse those levels, they can do so in a subsequent transformation step.

I'll also note that another advantage of @peterdesmet's list-of-objects approach is that logical level objects can be extended via user defined properties – for example, the text of the item in the survey question:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "description": "Select the statement you most agree with",
      "categories": [
        {"value": 1, "label": "Male", "text": "I identify as male"},
        {"value": 2, "label": "Man", "text": "I identify as a man"},
        {"value": 3, "label": "Female", "text": "I identify as female"},
        {"value": 4, "label": "Lady", "text": "I identify as a lady"},
        {"value": 5, "label": "Nonbinary": "text": "I identify as nonbinary"},
      ],
      "ordered": false,
      "missingValues": ["NA", "6"]
    }
  ]
}

You'll notice I also put level 6 (Declined) in missingValues. I think it should go here instead of categories because it should not be considered one of the logical values of the categorical… it is a missing value instead. Yes, we lose the label, but given this current direction, I think we can safely make "missing labels" a separate proposal / discussion. (We will want "missing labels" to be available for all field types, not just categorical fields!)

And as @peterdesmet said, when no label is given, implementations can just use value. For example:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [
           { value: 1, label: "Poor" },
           { value: 2 },
           { value: 3 },
           { value: 4 },
           { value: 5, label: "Excellent" }
      ],
      "ordered": true,
    }
  ]
}

Would be imported in R as factor(c("Poor", "2", "3", "4", "Excellent"))

And I hope if we're ever physically together we can still enjoy sharing a pint together (with no talk of categoricals or value labels!).

I hope we can make that happen one of these days! I really appreciate everyone's engagement on this as well :)

@khusmann
Copy link
Contributor

khusmann commented Feb 16, 2024

The example I gave above was intentional; for example, this is how REDCap exports data by default (i.e., categorical variables get exported with their numeric codes, except for defined missing values which are reprsented by their labels).

Ah, good to know! Then for the example I just gave we'd have:

{
  "fields": [
    {
      "name": "physical_health",
      "type": "categorical",
      "values": [
           { value: 1, label: "Poor" },
           { value: 2 },
           { value: 3 },
           { value: 4 },
           { value: 5, label: "Excellent" }
      ],
      "ordered": true,
      "missingValues": ["Don't know","Refused","Not applicable"]
    }
  ]
}

For a REDCap export. That looks quite nice.

@peterdesmet
Copy link
Member

👍 With the minor correction that values is categories.

@pschumm
Copy link
Contributor Author

pschumm commented Feb 17, 2024

All of this looks good to me. I agree with @khusmann that collapsing is a transformation; while Stata permits you to label two different integer values with the same label, those are still treated as separate analytically and appear separately in output (just with the same label).

Note that in the discussion above, both @khusmann and @peterdesmet are using field-specific missingValues (which at present are not part of the spec). This reinforces my original contention that the issue of field-specific missingValues is closely related to efficient description of categorical variables (in fact, I had originally included it in the pattern but then dropped it to simplify things and because it had already been proposed as a separate pattern). The proposal here would still work without field-specific missingValues, but not quite as well. So I'd like to put in a plug for tackling #861 too.

We already have constraints.enum to validate unexpected values, but it might be good to include that functionality in categories as well. I'm a bit on the fence if it is a good design decision to have both constraints.enum and categories as methods of defining that or if we should reserve validation for constraints.enum only.

I don't know if this addresses your comment above, but other field types already invoke validation (e.g., a field with type integer has to be an integer, even without specifying any further constraints). IMO the value of what we're doing here is defining a categorical variable as a first class type (not just a string with constraints), so for me at least, it doesn't seem inconsistent for it to invoke validation. Perhaps I'm missing something here.

@peterdesmet
Copy link
Member

Regarding field.missingValues: there is a PR for v2 now: #24

@khusmann
Copy link
Contributor

This reinforces my original contention that the issue of field-specific missingValues is closely related to efficient description of categorical variables
The proposal here would still work without field-specific missingValues, but not quite as well.

Agreed! I am also strongly in favor of field-specific missingValues for these reasons.

IMO the value of what we're doing here is defining a categorical variable as a first class type (not just a string with constraints), so for me at least, it doesn't seem inconsistent for it to invoke validation.

Also agree. Well said.


One more thought – we want to offer the shortcut of string levels? So:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "categories": [
        "Male",
        "Man",
        "Female",
        "Lady",
         "Nonbinary"
      ],
      "ordered": false,
      "missingValues": ["NA", "Declined"]
    }
  ]
}

would be syntactic sugar for:

{
  "fields": [
    {
      "name": "gender",
      "type": "categorical",
      "categories": [
        { "value": "Male" },
        { "value": "Man" },
        { "value": "Female" },
        { "value": "Lady" },
        { "value": "Nonbinary" }
      ],
      "ordered": false,
      "missingValues": ["NA", "Declined"]
    }
  ]
}

In summary, this would make the complete type signature of the proposed field as follows:

type CategoricalField = {
  name: string,
  title?: string,
  description?: string,
  example?: string,
  format?: "default",
  type: "categorical",
  categories: ({ value: string | number, label?: string } | string)[],
  ordered?: boolean,
  constraints?: {
    "required"?: boolean,
    "unique"?: boolean
  },
  missingValues?: string[]
}

@pschumm
Copy link
Contributor Author

pschumm commented Feb 18, 2024

Thanks for suggesting this @khusmann; I was so focused on the other details that it didn't even occur to me. Indeed, as I think I've mentioned before, this is my preferred way to distribute data (i.e., labels rather than integer codes) since it makes them useable with the broadest range of software. So this simplified specification would be very nice (not to mention very readable). I definitely favor including this option.

@peterdesmet
Copy link
Member

peterdesmet commented Feb 19, 2024

My 2 cents:

  1. I'm fine with validation for categories and not requiring constraints.enum to do that. As in: all values in the data should be present in categories or it is invalid.
  2. I'm not a big fan of the shortcut. It is readable indeed, but:
    • In the shortcut it is less clear to data publishers if the array should contain values or labels, properties that are easy to confuse to begin with. I think that actually hampers readability.
    • It is more complex for implementations.
    • Are there precedents for allowing two approaches for the same thing in the spec?
    • I think we want to avoid multiple approaches (see Discourage usage of unnecessary union types #873) going forward. And this one goes a bit further than union types.

@peterdesmet
Copy link
Member

Correction: I guess the shortcut is a case of a union type (array of strings vs array of objects). Other than my readability concerns, it does allow backward compatibility for missingValues.

In general, union types offer a lot of flexibility to keep things backward compatible and are often elegant (e.g. not having to add a new roles property over role). I think we need a higher-level discussion on whether we want to allow or discourage these (#873) before we can move this further.

@khusmann
Copy link
Contributor

khusmann commented Feb 19, 2024

In general, union types offer a lot of flexibility to keep things backward compatible and are often elegant (e.g. not having to add a new roles property over role). I think we need a higher-level discussion on whether we want to allow or discourage these (#873) before we can move this further.

Just added comments to #873 re: union types. There, I argue that the kind of "union type syntactic sugar" I'm proposing here should be considered on a case-by-case basis. I would argue that the proposed syntactic sugar here is worth using because:

  1. It's a special case, so won't confuse other parts of the spec

  2. Actually makes things more consistent broadly across the spec, because it would match the signature of missingValues (assuming we accept Support for labeled missingness #880 as well)

  3. As @pschumm mentioned, categorical fields with meaningful string physical values are extremely common, probably more so than encoded categoricals. So I think providing an easy shortcut to this definition makes a lot of sense.

khusmann added a commit to khusmann/datapackage-v2-draft that referenced this issue Apr 2, 2024
khusmann added a commit to khusmann/datapackage-v2-draft that referenced this issue Apr 2, 2024
khusmann added a commit to khusmann/datapackage-v2-draft that referenced this issue Apr 2, 2024
@nichtich
Copy link
Contributor

nichtich commented Apr 3, 2024

With this proposal (I'd call it inline-categories), the list of categories must be given on each field again and again. If categories have many values and/or if they are used in mutliple fields, it may make sense to allow referencing

{
  "name": "suportStatement1",
  "type": "categorical",
  "categories": "agreementLevel"
}, {
  "name": "suportStatement2",
  "type": "categorical",
  "categories": "agreementLevel"
}

and elsewhere

{
  "categoryTypes": {
   "agreementLevel": [
      { "value": 1, "label": "Strongly Disagree" },
      { "value": 2 },
      { "value": 3 },
      { "value": 4 },
      { "value": 5, "label": "Strongly Agree" }
    ]
  }
}

The value of categories could also be an URI to reference an external large list of allowed values.

See Codes and Codelists in Avram schema language for the same idea (codes == categories).

@khusmann
Copy link
Contributor

khusmann commented Apr 3, 2024

@nichtich Agreed -- I think that's exactly the direction we want to go, and "inline-categories" gives us the first step.

@pschumm
Copy link
Contributor Author

pschumm commented Apr 11, 2024

If categories have many values and/or if they are used in mutliple fields, it may make sense to allow referencing

Recall that #888 covers similar ground; we might want to consider both ideas together.

@roll
Copy link
Member

roll commented Jun 5, 2024

DONE in #68

@roll roll closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants