-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend support of CSV with CSV Dialect #528
Comments
More specific, I'd first:
More support of CSV dialect requires at least someone with experience in actually working with messy CSV data (e.g. users of mr) because authors of standards tend to add features without common use cases. |
Hey, that is great and very helpful research, didn't know about any of those cvs dialect standards. I think you suggestions make sense to do. Is it something you would like to help out with coding-wise? might speed things up a it.
Yeap good idea. The only (not great) reason it's called comma now is because that is was it's called in the csv parser used at the moment https://pkg.go.dev/encoding/csv#Reader
Ok, so all lines will be treated as data?
👍 could possibly also move convert to object code into go if doing in jq is slow Maybe the csv decoder could have "dialect" option that is either a string that is a name of dialect or an object with settings? One thing is to figure out if we could still use the csv parser in the golang standard library or needs to find another existing one or write one ourself. |
I'm very motivated but have not coded in Go yet (should be doable and happy to learn) so the "might speed things up" does not apply. So it depends :-) But data formats are my research topic and I heavily use jq so sooner or later I need to dig deeper into fq anyway.
Yes, most CSV parsers don't enable comments by default.
Yes but then you need to manage names of dialects. The only commonly agreed names I know are RFC 4180 and TSV (probably better as
The more dialect aspects are supported, the more the danger of having to write your own CSV library. That's why I'd first limit implementation to compatibility with a subset of CSVD and CSVW. |
Great, there is no hurry, was more if you wanted something fast :) i'm can help out with both go and jq stuff. Maybe a possible route is that i start look at it and see how much work it seems to be, possible some initial PR etc, and then we figure something out? What kind of research are you doing? as a student, phd etc? curious. And i'm of course happy to help out other fq or format related things.
Aha i see. But it's nice that both csvddf and csvw has default values, so a fq decoder could always have that as quite safe fallback for properties not set? Had no idea there was even efforts to standardize CSV like this, seems like good idea, is quite confusing. I've had to explain at least a couple of times that "export it as CVS" is sadly not that straight forward :) also run into issues with numbers in csv, which decimal symbol to use, that seems to the out of scope for csvddf and csvw?
Yes true good point. So maybe try stick with standard library csv reader/writer as see how far it can go? |
I did my PhD thesis on patterns in data formats some years ago and I manage a structured register of data formats (in German, with focus on bibliographic data). |
Interesting and the thesis looks like something i will like to have a look at. As you might have noticed fq currently does not support much when it comes to schemas or generic format description languages, like kaitai stuct etc, at the moment. But I think it should be possible to add in some form, at least for decoding, encoding is different kind of beast, at least for complex formats like mp4 etc. |
Did some research about good test suits, csvw seems to have one in nice format https://github.com/w3c/csvw/tree/gh-pages/tests |
Did an initial PR to try some things out #546 see comments |
First thanks for this great work! CSV format can be troublesome because of its many dialects. If support of CSV is going to be extended, I recommend using the CSV Dialect Specification or a compatible subset of it. By now, fq supports two CSV Dialect properties with differing name and default. An alternative is CSVW dialect description (thanks xkcd #972!). Here is a comparision:
,
,
,
#
#
(spec is ambiguous)Names can be adjusted by aliases, I prefer short names anyway. Remaining properties found in csvddf and csvw are:
"
"
true
true
\r\n
["\r\n", "\n"]
true
true
false
0
0
utf-8
Property doubleQuote differs in meaning between the two (csvw uses it to also set escapeChar). csvddf further has property
caseSensitiveHeader
(default false) but there are discussions to remove it.The text was updated successfully, but these errors were encountered: