-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bfd1d1a
commit 152e010
Showing
7 changed files
with
160 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,37 @@ | ||
# Dutch-LLMs | ||
# Dutch LLM's | ||
Various training, inference and validation code and results related to Open LLM's that were pretrained (full or partially) on the Dutch language. | ||
|
||
## Training | ||
|
||
<< TODO >> | ||
|
||
## Evaluation | ||
|
||
<< TODO >> | ||
|
||
|
||
## Datasets | ||
|
||
Below a description of the various dataset(s) that have been used for training and evaluation. | ||
|
||
### Alpaca Dutch Cleaned | ||
|
||
Alpaca is a dataset containing roughly 51K rows of data that can be used to finetune any Large Language Model. The original dataset is in the English language only. | ||
|
||
Recently I came across a version of the dataset that was completely translated into the Dutch language. Use the following link for the dataset: [Alpaca Dutch Cleaned](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) | ||
|
||
During training of the first Colab Notebook the dataset was split into a training and validation part. The size of the validation set is 2048 rows. | ||
Since I would like to be able to compare the various training runs and evaluation results the training and validation datasets are stored within a subfolder (alpaca_clean_dutch) in this Github repo. | ||
|
||
## References | ||
|
||
``` | ||
@misc{https://doi.org/10.57967/hf/0530, | ||
doi = {10.57967/HF/0530}, | ||
url = {https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch}, | ||
author = {{Bram Vanroy}}, | ||
title = {{A}lpaca {C}leaned {D}utch}, | ||
publisher = {Hugging Face}, | ||
year = {2023} | ||
} | ||
``` |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
{ | ||
"builder_name": "json", | ||
"citation": "", | ||
"config_name": "BramVanroy--alpaca-cleaned-dutch", | ||
"dataset_size": 22014685, | ||
"description": "", | ||
"download_checksums": { | ||
"https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch/resolve/79e4cc109558b26e8f30f44adb768b8f9709dfba/alpaca_data_cleaned-dutch.jsonl": { | ||
"num_bytes": 24355992, | ||
"checksum": null | ||
} | ||
}, | ||
"download_size": 24355992, | ||
"features": { | ||
"id": { | ||
"dtype": "int64", | ||
"_type": "Value" | ||
}, | ||
"instruction": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
}, | ||
"input": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
}, | ||
"output": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
} | ||
}, | ||
"homepage": "", | ||
"license": "", | ||
"size_in_bytes": 46370677, | ||
"splits": { | ||
"train": { | ||
"name": "train", | ||
"num_bytes": 22014685, | ||
"num_examples": 51712, | ||
"dataset_name": "json" | ||
} | ||
}, | ||
"version": { | ||
"version_str": "0.0.0", | ||
"major": 0, | ||
"minor": 0, | ||
"patch": 0 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
{ | ||
"_data_files": [ | ||
{ | ||
"filename": "data-00000-of-00001.arrow" | ||
} | ||
], | ||
"_fingerprint": "0f7260bc068c7e5e", | ||
"_format_columns": null, | ||
"_format_kwargs": {}, | ||
"_format_type": null, | ||
"_output_all_columns": false, | ||
"_split": "train" | ||
} |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
{ | ||
"builder_name": "json", | ||
"citation": "", | ||
"config_name": "BramVanroy--alpaca-cleaned-dutch", | ||
"dataset_size": 22014685, | ||
"description": "", | ||
"download_checksums": { | ||
"https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch/resolve/79e4cc109558b26e8f30f44adb768b8f9709dfba/alpaca_data_cleaned-dutch.jsonl": { | ||
"num_bytes": 24355992, | ||
"checksum": null | ||
} | ||
}, | ||
"download_size": 24355992, | ||
"features": { | ||
"id": { | ||
"dtype": "int64", | ||
"_type": "Value" | ||
}, | ||
"instruction": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
}, | ||
"input": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
}, | ||
"output": { | ||
"dtype": "string", | ||
"_type": "Value" | ||
} | ||
}, | ||
"homepage": "", | ||
"license": "", | ||
"size_in_bytes": 46370677, | ||
"splits": { | ||
"train": { | ||
"name": "train", | ||
"num_bytes": 22014685, | ||
"num_examples": 51712, | ||
"dataset_name": "json" | ||
} | ||
}, | ||
"version": { | ||
"version_str": "0.0.0", | ||
"major": 0, | ||
"minor": 0, | ||
"patch": 0 | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
{ | ||
"_data_files": [ | ||
{ | ||
"filename": "data-00000-of-00001.arrow" | ||
} | ||
], | ||
"_fingerprint": "40ef988d605ceaae", | ||
"_format_columns": null, | ||
"_format_kwargs": {}, | ||
"_format_type": null, | ||
"_output_all_columns": false, | ||
"_split": "train" | ||
} |