Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should be the right data format for fine-tuning and inference? #5

Open
nonstopfor opened this issue Aug 16, 2020 · 7 comments
Open

Comments

@nonstopfor
Copy link

I want to fine tune MaUde on my own data and use the fine tuned model to do inference. But I don't know the right data format (including train data and test data) . Does anyone know that?

@koustuvsinha
Copy link
Contributor

Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.

@nonstopfor
Copy link
Author

Currently the data is extracted through ParlAI using ParlAIExtractor. If you run the standalone script (online_dialog_eval/data.py) then you'll see the data format the code expects.

Could you give a complete pipeline? Suppose I have origin train, valid and test data (Just some native dialogs). What are the steps I need to fine tune MaUde and do inference?

@nonstopfor
Copy link
Author

Could you tell me in ParlAIExtractor,which function is used to read data from the file? Beacuse finding data format from more than 1000 lines of code in data.py is really a hard work......

@nonstopfor
Copy link
Author

Also, when computing backtranslation and corruption files, what should the data format be?

@koustuvsinha
Copy link
Contributor

The extract_interactions def from ParlAIExtractor is used to build the data. I would suggest you to read ParlAI docs to understand how the data is internally represented, as this repo is heavily dependent on it (as in, we don't have a standard input/output file).

@koustuvsinha
Copy link
Contributor

@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.

@nonstopfor
Copy link
Author

@nonstopfor I just released the entire data used and processed in PersonChat dialog (backtranslation / corruption), which is in this readme. You can view the data format from these files.

In this directory, what is the origin data file? And what do these .csv files come from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants