Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read csv with "\t" as the separator #20342

Open
2 tasks done
nerdai opened this issue Dec 18, 2024 · 4 comments
Open
2 tasks done

Unable to read csv with "\t" as the separator #20342

nerdai opened this issue Dec 18, 2024 · 4 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars

Comments

@nerdai
Copy link

nerdai commented Dec 18, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use polars::prelude::*;

let parse_options = CsvParseOptions::default().with_separator(b'\t');
let df = CsvReadOptions::default()
    .with_parse_options(parse_options)
    .with_has_header(false)
    .try_into_reader_with_file_path(Some("SMSSpamCollection.tsv".into()))
    .unwrap()
    .finish()
    .unwrap();

Log output

thread 'polars-0' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-io-0.45.1/src/csv/read/read_impl.rs:455:33:
assertion failed: df.height() <= count
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'polars-1' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-io-0.45.1/src/csv/read/read_impl.rs:455:33:
assertion failed: df.height() <= count

Issue description

The .tsv file that I'm trying to read is the SMS-Spam dataset that can be downloaded from the UC Irvine data repository. After downloading the .zip file and adding extension .tsv to SMSSpamCollection file, I run the provided code, which is unable to read in the dataset.

Perhaps I'm not doing this correctly?

Expected behavior

That the df loads the .tsv file correctly.

Installed versions

csv, lazy

@nerdai nerdai added bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars labels Dec 18, 2024
@cmdlineluser
Copy link
Contributor

It loads for me if I change quote_char

pl.read_csv("Downloads/sms+spam+collection/SMSSpamCollection", separator="\t", has_header=False, quote_char=None)
# shape: (5_574, 2)
# ┌──────────┬─────────────────────────────────┐
# │ column_1 ┆ column_2                        │
# │ ---      ┆ ---                             │
# │ str      ┆ str                             │
# ╞══════════╪═════════════════════════════════╡
# │ ham      ┆ Go until jurong point, crazy..… │
# │ ham      ┆ Ok lar... Joking wif u oni...   │
# │ spam     ┆ Free entry in 2 a wkly comp to… │
# │ ham      ┆ U dun say so early hor... U c … │
# │ ham      ┆ Nah I don't think he goes to u… │
# │ …        ┆ …                               │
# │ spam     ┆ This is the 2nd time we have t… │
# │ ham      ┆ Will ü b going to esplanade fr… │
# │ ham      ┆ Pity, * was in mood for that. … │
# │ ham      ┆ The guy did some bitching but … │
# │ ham      ┆ Rofl. Its true to its name      │
# └──────────┴─────────────────────────────────┘

A minimal example.

data = b'foo\t"bar"...'

pl.read_csv(data, separator="\t", has_header=False)
# ComputeError: could not parse `"bar".` as dtype `str` at column 'column_2' (column number 2)

#19432

@nerdai
Copy link
Author

nerdai commented Dec 18, 2024

Thanks, that worked for me as well.

let parse_options = CsvParseOptions::default()
    .with_separator(b'\t')
    .with_quote_char(None);
let df = CsvReadOptions::default()
    .with_parse_options(parse_options)
    .with_has_header(false)
    .try_into_reader_with_file_path(Some("SMSSpamCollection.tsv".into()))
    .unwrap()
    .finish()
    .unwrap();

Should I have known to set this option to None -- this is my first time attempting to use Polars 😅

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Dec 18, 2024

I don't think so.

Polars did previously parse this without error - it is a very recent change. (That's why #19432 was filed as a bug.)

I think perhaps an additional "Hint: " message could be added to suggest quote_char may need changing.

I'm also not sure why you're getting a panic, perhaps that's also a bug. (I just use py-polars.)

@nerdai
Copy link
Author

nerdai commented Dec 18, 2024

Ah i see -- thanks! I agree a more helpful error message would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars
Projects
None yet
Development

No branches or pull requests

2 participants