Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

自定义数据集文档看不太懂 #1323

Closed
danxuan2022 opened this issue Jul 8, 2024 · 5 comments
Closed

自定义数据集文档看不太懂 #1323

danxuan2022 opened this issue Jul 8, 2024 · 5 comments

Comments

@danxuan2022
Copy link

danxuan2022 commented Jul 8, 2024

自定义数据集可以写的详细一些吗?推荐直接命令行传参的方式,直接命令行传参中的dataset_id是怎么来的?是自己随便定义dataset_id还是怎么来的?不想看源码....
https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#-%E6%8E%A8%E8%8D%90%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0%E7%9A%84%E5%BD%A2%E5%BC%8F

@Jintao-Huang
Copy link
Collaborator

dataset_id 就是 modelscope的dataset_id

@Jintao-Huang
Copy link
Collaborator

dataset_path就是本地文件路径

或者huggingface的dataset_id: HF::{dataset_id}

@zodiacg
Copy link

zodiacg commented Jul 9, 2024

不想看源码有点难评……swift内部就是用这套东西维护数据集的,template.py里面满满当当全是可以改可以抄的代码。

自定义数据集核心需要一个函数:get_function。swift调用get_function的时候会必定给出下面的参数,以及function_kwargs里面指定的其它自定义参数。整体上get_function需要符合这样的函数签名:

def get_custom_dataset(
    # 这几个顺序给出
    dataset_id: str,
    subsets,
    preprocess_func,
    splits,
    dataset_sample,
    # 后面是命名参数
    random_state=random_state,
    dataset_test_ratio=dataset_test_ratio,
    remove_useless_columns=remove_useless_columns,
    use_hf=use_hf, 
   # 你的自定义参数
    **kwargs,
)

get_function仅仅要求返回一个HfDataset或两个HfDataset的tuple,且其内部格式需要是query/repsonse/history那种形式的。dataset_id只是用于关联指定某个数据集名称的时候对应到你的get_function。中间所有的函数参数除非你用到或者需要处理,否则可以忽略,用*args**kwargs统一捕获就行了。

如果你只是想注册魔搭或者HF上的模型,get_function用get_dataset_from_repo即可。然后使用自定义(或者有时候都不需要)的preprocess_func完成格式的转换。swift内部还有很多Processor可以快速完成处理。

@Xiongbenyang
Copy link

Xiongbenyang commented Jul 15, 2024

@Jintao-Huang @zodiacg Hello, I am working on the SFT of InternVL. Do I have to specify the absolute path for the images attribute in the data definition file *.jsonl? What if I want to use relative paths? I tried the command line parameter --custom_train_dataset_path, but it didn't work, and this command line parameter is deprecated.

@tastelikefeet
Copy link
Collaborator

@Jintao-Huang @zodiacg Hello, I am working on the SFT of InternVL. Do I have to specify the absolute path for the images attribute in the data definition file *.jsonl? What if I want to use relative paths? I tried the command line parameter --custom_train_dataset_path, but it didn't work, and this command line parameter is deprecated.

Just use --dataset is Ok
If you want to use relative path, make sure your work dir matches the image paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants