Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer #49233

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jabbaugh
Copy link

What changes were proposed in this pull request?

The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?

This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request #17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?

Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?

One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?

No

@jabbaugh jabbaugh force-pushed the jbaugh-add-csv-file-ext branch 2 times, most recently from cede15e to acd7d3e Compare December 18, 2024 21:18
@jabbaugh jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from eee47c5 to de3b891 Compare December 18, 2024 21:50
@jabbaugh jabbaugh changed the title Add File Extension Option to CSV DataSource Writer [SPARK-50616] Add File Extension Option to CSV DataSource Writer Dec 18, 2024
@HyukjinKwon HyukjinKwon changed the title [SPARK-50616] Add File Extension Option to CSV DataSource Writer [SPARK-50616][SQL] Add File Extension Option to CSV DataSource Writer Dec 19, 2024
@@ -86,7 +86,7 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister {
}

override def getFileExtension(context: TaskAttemptContext): String = {
".csv" + CodecStreams.getCompressionExtension(context)
"." + csvOptions.fileExtension + CodecStreams.getCompressionExtension(context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, the user can overwrite almost anything in the file system:

./../../../../../path/<file to overwrite>.xz

Should we do some sanity check of the option?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me. I can add a check that limits the file extension to three characters that are limited to a-z. If other characters are found we can throw an exception stating the limitation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a limit of 3 letters when we set the option.

See: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

@jabbaugh jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from de3b891 to 2ff5f81 Compare December 20, 2024 17:31
@github-actions github-actions bot added the CORE label Dec 20, 2024
@jabbaugh jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from 2ff5f81 to 3c63274 Compare December 20, 2024 22:04
What changes were proposed in this pull request?
The existing CSV DataSource allows one to set the delimiter/separator but does not allow the changing of the file extension. This means that a file can have values separated by tabs but me marked as a ".csv" file. This change allows one to change the file extension to match the delimiter/separator (e.g. ".tsv" for a tab separated value file).

Why are the changes needed?
This PR adds an additional option to set the fileExtension. The end result is that when a separator is set that is not a comma that the output file has a file extension that matches the separator (e.g. file.tsv, file.psv, etc...).

Notes on Previous Pull Request apache#17973
A pull request adding this option was discussed 7 years ago. One reason it wasn't added was:
"I would like to suggest to leave this out if there is no better reason for now. Downside of this is, it looks this allows arbitrary name and it does not gurantee the extention is, say, tsv when the delmiter is a tab. It is purely up to the user."

I don't believe this is a good reason to not let the user set the extension. If we let them set the delimiter/separator to an arbitrary string/char then why not let the user also set the file extension to specify the separator that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv" file extension as the default and has the benefit of allowing other separators to match the file extension.

Does this PR introduce any user-facing change?
Yes. This PR adds one row to the options table for the CSV DataSource documentation to include the "fileExtension" option.

How was this patch tested?
One unit test was added to validate a file is written with the new extension.

Was this patch authored or co-authored using generative AI tooling?
No
@jabbaugh jabbaugh force-pushed the jbaugh-add-csv-file-ext branch from 3c63274 to ec7d8cc Compare December 20, 2024 22:11
@github-actions github-actions bot removed the CORE label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants