-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()] #4
Comments
I have the same problem while doing preprocessing locally. I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command: Tried it with with the data from the "scraped" dir provided with the repo. Please find the log in the attached file. I've installed the dependencies using conda, as follows:
|
Hi @vincsous and @RomanPlusPlus Thanks for reporting the issue. Thanks |
Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training. Thanks again |
Hi @akanyaani, thank you for your speedy response. Unfortunately, the problem persists. I still get the same Please find the log in the attached file. |
But it's working on my system could you please print files in that directory. Add print in the pre_process.py train method.
This error comes when text_files does not have any text files. |
Hi @vincsous I will look into that. Thanks |
Hi @akanyaani , I added the line you suggested.
I also checked the "processed.txt" file. It's empty. |
You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.
|
I am also getting this error. My command: Checked the Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine. My os: Running in conda custom environment. My conda env.yaml file:
|
You can run into this error even if your path is correct because the I'd recommend that the def train(data_dir, vocab_size, min_seq_len, max_seq_len):
text_files = glob.glob((data_dir + "/*"))
process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............") In other words, change Better yet, gather the file paths recursively like so: text_files = glob.glob((data_dir + "/**/*")) This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes. |
I encountered this error when running the code on Windows. I fixed this by editing all calls to
The files that are read need to be encoded in UTF-8, but I guess that goes without saying. |
Hi,
Fisrt thanks for your work.
When I am trying to do preprocessing. I get following error message:
RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]
I am using a *.txt file uploaded on my colab.
I would like to know what does it mean and how to fix it.
Thanks
Vincent
The text was updated successfully, but these errors were encountered: