Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read json file for run_pre_training.py #23

Open
NguyenThanhAI opened this issue Jun 25, 2023 · 1 comment
Open

Cannot read json file for run_pre_training.py #23

NguyenThanhAI opened this issue Jun 25, 2023 · 1 comment

Comments

@NguyenThanhAI
Copy link

Hi. Thank you for your great project. But I encounter a problem and I cannot fix it.
I run file helper/create_train.py and produce a json file with format:

{"text": [432, 17738, 6713, 5237, 2925, 30225, 10838, 4848, 1713, 23, 455, 376, 386, 5883, 4, 2870, 4582, 1218, 9569, 4331, 432, 12, 303, 1500, 3959, 15, 9, 454, 4331, 490, 11, 1389, 34376, 1384, 4, 1389, 181, 63, 4026, 2608, 35, 432, 163, 7, 761, 19154, 59480, 28463]}
{"text": [432, 19126, 3269, 1766, 2792, 32059, 10838, 17738, 1383, 29, 71, 303, 1387, 56630, 4, 1494, 21505, 4, 32384, 1231, 718, 1362, 452, 181, 176, 189, 10, 4331, 41, 1391, 1766, 2792, 525, 1750, 2697, 35, 4439, 1607, 24, 386, 5883, 4, 2870, 4582, 4331, 6, 311, 5089, 34, 9, 40406, 2870, 4331, 151, 69, 452, 316, 5191, 124, 4331, 14157, 3959, 15, 21, 316, 5191, 102, 10, 40406, 2870, 36793, 37272, 4, 26, 10, 441, 2697, 1500, 39, 181, 1555, 682, 72, 3959, 15, 454, 490, 10, 4331, 14157, 53, 97, 328, 2135, 8, 2792, 525, 386, 5883, 4, 2870, 4582, 36793, 15831, 19126, 7838, 525, 386, 5883, 4, 2870, 4582, 4331, 11, 1391, 1766, 302, 9, 15012, 1384, 4, 91, 32384, 1231, 718, 1362, 452, 181, 176, 40, 4436, 2608, 302, 59035, 65, 39, 3160, 30, 3974, 44654, 4331, 302, 740, 3160, 15012, 1384, 65, 205, 226, 10, 39, 445, 30, 890, 31485, 1384, 4, 12, 37, 10, 3160, 226, 676, 6, 3160, 151, 3974, 226, 33099, 5, 6663, 713, 302, 1430, 226, 386, 5883, 4, 2870, 4582, 4331, 11, 45247, 40210, 35, 40406, 2870, 4331, 187, 4439, 509, 11, 1389, 13394, 10838, 16939, 251, 1494, 21505, 4, 1189, 10, 229, 1188, 33869, 16580, 1487, 4, 1300, 363, 29986, 1581, 4, 34128, 718, 1362, 452, 181, 176, 10, 4623, 4436, 2608, 49, 5665, 9324, 143, 302, 1430, 13, 226, 386, 5883, 4, 2870, 4582, 36793, 5]}
...

When I use this file for run_pre_training.py, I encounter error

───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:152 in _generate_tables                                                   │
│                                                                              │
│   149 │   │   │   │   │   │   except pa.ArrowInvalid as e:                   │
│   150 │   │   │   │   │   │   │   try:                                       │
│   151 │   │   │   │   │   │   │   │   with open(file, encoding="utf-8") as f │
│ ❱ 152 │   │   │   │   │   │   │   │   │   dataset = json.load(f)             │
│   153 │   │   │   │   │   │   │   except json.JSONDecodeError:               │
│   154 │   │   │   │   │   │   │   │   logger.error(f"Failed to read file '{f │
│   155 │   │   │   │   │   │   │   │   raise e                                │
│                                                                              │
│ /opt/conda/lib/python3.10/json/__init__.py:293 in load                       │
│                                                                              │
│   290 │   To use a custom ``JSONDecoder`` subclass, specify it with the ``cl │
│   291 │   kwarg; otherwise ``JSONDecoder`` is used.                          │
│   292 │   """                                                                │
│ ❱ 293 │   return loads(fp.read(),                                            │
│   294 │   │   cls=cls, object_hook=object_hook,                              │
│   295 │   │   parse_float=parse_float, parse_int=parse_int,                  │
│   296 │   │   parse_constant=parse_constant, object_pairs_hook=object_pairs_ │
│                                                                              │
│ /opt/conda/lib/python3.10/json/__init__.py:346 in loads                      │
│                                                                              │
│   343 │   if (cls is None and object_hook is None and                        │
│   344 │   │   │   parse_int is None and parse_float is None and              │
│   345 │   │   │   parse_constant is None and object_pairs_hook is None and n │
│ ❱ 346 │   │   return _default_decoder.decode(s)                              │
│   347 │   if cls is None:                                                    │
│   348 │   │   cls = JSONDecoder                                              │
│   349 │   if object_hook is not None:                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/json/decoder.py:337 in decode                      │
│                                                                              │
│   334 │   │   containing a JSON document).                                   │
│   335 │   │                                                                  │
│   336 │   │   """                                                            │
│ ❱ 337 │   │   obj, end = self.raw_decode(s, idx=_w(s, 0).end())              │
│   338 │   │   end = _w(s, end).end()                                         │
│   339 │   │   if end != len(s):                                              │
│   340 │   │   │   raise JSONDecodeError("Extra data", s, end)                │
│                                                                              │
│ /opt/conda/lib/python3.10/json/decoder.py:355 in raw_decode                  │
│                                                                              │
│   352 │   │   try:                                                           │
│   353 │   │   │   obj, end = self.scan_once(s, idx)                          │
│   354 │   │   except StopIteration as err:                                   │
│ ❱ 355 │   │   │   raise JSONDecodeError("Expecting value", s, err.value) fro │
│   356 │   │   return obj, end                                                │
│   357                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1860 in          │
│ _prepare_split_single                                                        │
│                                                                              │
│   1857 │   │   │   )                                                         │
│   1858 │   │   │   try:                                                      │
│   1859 │   │   │   │   _time = time.time()                                   │
│ ❱ 1860 │   │   │   │   for _, table in generator:                            │
│   1861 │   │   │   │   │   if max_shard_size is not None and writer._num_byt │
│   1862 │   │   │   │   │   │   num_examples, num_bytes = writer.finalize()   │
│   1863 │   │   │   │   │   │   writer.close()                                │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:155 in _generate_tables                                                   │
│                                                                              │
│   152 │   │   │   │   │   │   │   │   │   dataset = json.load(f)             │
│   153 │   │   │   │   │   │   │   except json.JSONDecodeError:               │
│   154 │   │   │   │   │   │   │   │   logger.error(f"Failed to read file '{f │
│ ❱ 155 │   │   │   │   │   │   │   │   raise e                                │
│   156 │   │   │   │   │   │   │   # If possible, parse the file as a list of │
│   157 │   │   │   │   │   │   │   if isinstance(dataset, list):  # list is t │
│   158 │   │   │   │   │   │   │   │   try:                                   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/packaged_modules/json/json. │
│ py:131 in _generate_tables                                                   │
│                                                                              │
│   128 │   │   │   │   │   │   try:                                           │
│   129 │   │   │   │   │   │   │   while True:                                │
│   130 │   │   │   │   │   │   │   │   try:                                   │
│ ❱ 131 │   │   │   │   │   │   │   │   │   pa_table = paj.read_json(          │
│   132 │   │   │   │   │   │   │   │   │   │   io.BytesIO(batch), read_option │
│   133 │   │   │   │   │   │   │   │   │   )                                  │
│   134 │   │   │   │   │   │   │   │   │   break                              │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/_json.pyx:259 in                       │
│ pyarrow._json.read_json                                                      │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/_json.pyx'                            │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/error.pxi:144 in                       │
│ pyarrow.lib.pyarrow_internal_check_status                                    │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/error.pxi'                            │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/pyarrow/error.pxi:100 in                       │
│ pyarrow.lib.check_status                                                     │
│                                                                              │
│ [Errno 2] No such file or directory:                                         │
│ '/kaggle/working/zalo_ltr_2021/pyarrow/error.pxi'                            │
╰──────────────────────────────────────────────────────────────────────────────╯
ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /kaggle/working/zalo_ltr_2021/Condenser/run_pre_training.py:202 in <module>  │
│                                                                              │
│   199                                                                        │
│   200                                                                        │
│   201 if __name__ == "__main__":                                             │
│ ❱ 202 │   main()                                                             │
│   203                                                                        │
│                                                                              │
│ /kaggle/working/zalo_ltr_2021/Condenser/run_pre_training.py:95 in main       │
│                                                                              │
│    92 │   # Set seed before initializing model.                              │
│    93 │   set_seed(training_args.seed)                                       │
│    94 │                                                                      │
│ ❱  95 │   train_set = load_dataset(                                          │
│    96 │   │   'json',                                                        │
│    97 │   │   data_files=data_args.train_path,                               │
│    98 │   │   block_size=2**25,                                              │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/load.py:1782 in             │
│ load_dataset                                                                 │
│                                                                              │
│   1779 │   try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES          │
│   1780 │                                                                     │
│   1781 │   # Download and prepare data                                       │
│ ❱ 1782 │   builder_instance.download_and_prepare(                            │
│   1783 │   │   download_config=download_config,                              │
│   1784 │   │   download_mode=download_mode,                                  │
│   1785 │   │   verification_mode=verification_mode,                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:872 in           │
│ download_and_prepare                                                         │
│                                                                              │
│    869 │   │   │   │   │   │   │   prepare_split_kwargs["max_shard_size"] =  │
│    870 │   │   │   │   │   │   if num_proc is not None:                      │
│    871 │   │   │   │   │   │   │   prepare_split_kwargs["num_proc"] = num_pr │
│ ❱  872 │   │   │   │   │   │   self._download_and_prepare(                   │
│    873 │   │   │   │   │   │   │   dl_manager=dl_manager,                    │
│    874 │   │   │   │   │   │   │   verification_mode=verification_mode,      │
│    875 │   │   │   │   │   │   │   **prepare_split_kwargs,                   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:967 in           │
│ _download_and_prepare                                                        │
│                                                                              │
│    964 │   │   │                                                             │
│    965 │   │   │   try:                                                      │
│    966 │   │   │   │   # Prepare split will record examples associated to th │
│ ❱  967 │   │   │   │   self._prepare_split(split_generator, **prepare_split_ │
│    968 │   │   │   except OSError as e:                                      │
│    969 │   │   │   │   raise OSError(                                        │
│    970 │   │   │   │   │   "Cannot find data file. "                         │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1749 in          │
│ _prepare_split                                                               │
│                                                                              │
│   1746 │   │   │   gen_kwargs = split_generator.gen_kwargs                   │
│   1747 │   │   │   job_id = 0                                                │
│   1748 │   │   │   with pbar:                                                │
│ ❱ 1749 │   │   │   │   for job_id, done, content in self._prepare_split_sing │
│   1750 │   │   │   │   │   gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_ │
│   1751 │   │   │   │   ):                                                    │
│   1752 │   │   │   │   │   if done:                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/datasets/builder.py:1892 in          │
│ _prepare_split_single                                                        │
│                                                                              │
│   1889 │   │   │   # Ignore the writer's error for no examples written to th │
│   1890 │   │   │   if isinstance(e, SchemaInferenceError) and e.__context__  │
│   1891 │   │   │   │   e = e.__context__                                     │
│ ❱ 1892 │   │   │   raise DatasetGenerationError("An error occurred while gen │
│   1893 │   │                                                                 │
│   1894 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1895                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

I guess possibly the problem is incompatible version of datasets and transformers. But I tried many versions of datasets, I still encounter error. Can you help me fix this? Thank you so much!

@loinh1106
Copy link

@NguyenThanhAI Hi bạn, mình cũng đang gặp lỗi này. Bạn fix được lỗi này có thể cho mình tham khảo được không ạ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants