GitHub - mynlp/Sem4Dialogue: This repository contains the files for Syntactic and Semantic Uniformity for Semantic Parsing and Task-Oriented Dialogue Systems

Uniform semantics for Semantic Parsing and Task-Oriented Dialogue Systems

This is the repository for the paper Syntactic and Semantic Uniformity for Semantic Parsing and Task-Oriented Dialogue Systems (EMNLP Findings 2022)

Introduction

For now, you can find the implemented models are the Transformer model and LSTM with Copy mechanism which you can use directly to get some initial results. In conclusion, the paper's key idea is a framework to unify different semantics of different machine-readable formats in Semantic Parsing and Task-Oriented Dialgoue. Performance is not the main focus of this paper. If you are interested in this research, you can refer to the paper for detail.

Datasets

Current datasets covered in the repository are:

Semantic Parsing

GeoQuery-FunQL Version
GeoQuery-SQL Version
SCAN-Si & Scan-Len
SParC
Spider
NLmaps v2
ATIS-SQL Version

Task-Oriented Dialogue

TreeDST
SMCalFlow
CoSQL
MultiWoz
M2M
DSTC2

Codes Related to the Proposed Format

To obtain the format proposed in the paper, you need to have the following additional packages:
Please refer to the code sem4diag_rewriting.py which contains the processing script we used for different formats of different datasets, but to make the script work, you may need some of the following packages:

mo_sql_parsing for SQL language parsing
pyparsing for nested parenthesis Parsing
anytree and treelib for tree representation
onmt for some default special tokens

For New Task Processing

You need to add your parsing code if:

your data is not in the format that the script can be used to process directly which contains SQL Query, FunQL Query, Dialogue States, NLmaps, etc.

If your data is not listed in the above format, you will need to add new code to parse your data and linearize the tree based on the method we used in the paper. However, the code may not need to be completely rewritten, this is mostly case by case, you may be able to reuse some parts of the code.

This choice largely depends on the nature of your data. Therefore, use the code in the repository in a careful manner. Since the code may not apply to your format and you may need to write your own.

Codes for Experiments

File Introduction

Each dataset folder contains the following files:

processe_xxx.py this is the pre-process code of each dataset, use this to do pre-processing.
processed_xxx folder, this folder is for pre-processed datasets.
vocab_xxx folder, this folder contains the vocabulary file of each dataset
other folders besides the above folders are mainly the raw data of each dataset.

Each dataset contains a YAML configuration file like SCAN.yaml or GeoQuery.yaml, these files specify configurations like the parameter setting and model setting of each dataset, this YAML file is based on the OpenNMT package, you should refer to this package to know the usage of the YAML configuration file.

The utils folder contains necessary metric codes. The metric_test.py is built upon this folder. Current metrics contain the following:

Word Level Exact Match
Sentence Level Exact Match
BLEU Score, BLEU metric is built upon screbleu package, you should also install this package.

This repository also supports SQL execution evaluation, which is a part of the test-suite-sql-eval, but the code of this repository is somehow problematic, so I modified some of the codes, therefore you should not use the code in that repository. However, to run the SQL evaluation code, you need to download the SQL database for different datasets, which is at Google Drive. Download this database and unzip it into the sql_eval directory, for detail on how to use this code, please refer to the original repository. This repository is also the official metric repository for CoSQL, Sparc, and Spider datasets.

For the official metric of the SMCalFlow dataset, you should refer to its repository for the details about how to evaluate it with the official metric.

train.sh contains the training script, it contains the following functions:

train the model
translate the given source input
evaluate the model output

Running the code

It is relatively easy to run the code, for example, if you want to run an experiment in the GeoQuery dataset using the Transformer model and evaluate the model trained with 10000 steps, your training command should be like this:

sh train.sh GeoQuery Transformer 10000

You should refer to train.sh for detail on acceptable parameters of the train.sh.

If you want to use another model, changing the model parameter of the train.sh does not automatically choose the model to run. Before you train the model, you should modify the YAML file of each task to change the model selection. For example, if you want to use the LSTM model in GeoQuery, you should comment on the Transformer setting in the GeoQuery.YAML and uncomment the LSTM setting. Otherwise, even if you input sh train.sh GeoQuery LSTM 10000, it will still train a Transformer model. For details about the YAML configuration, refer to the OpenNMT.

The last numeric parameter specifies the model of a particular step as the evaluation model. For the above example program, it uses Transformer trained at 10000 steps as the model for evaluation. It will automatically translate the test file and evaluate the translated test file with the gold file. The results will be reported after the evaluation is finished. However, the automatic evaluation only supports Word EM, Sent EM, and BLEU score, if you want to execute SQL query in the database, refer to test-suite-sql-eval for detail, in conclusion, the SQL execution accuracy is not automatically reported, you should run it by yourself.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
ATIS		ATIS
ATIS_SQL_Version		ATIS_SQL_Version
CoSQL		CoSQL
DST2&3		DST2&3
GeographySQL		GeographySQL
M2M		M2M
MultiWoz		MultiWoz
SCAN		SCAN
SMCalFlow		SMCalFlow
Sparc		Sparc
Spider		Spider
TreeDST		TreeDST
geo		geo
nlmaps_v2		nlmaps_v2
sql_eval		sql_eval
utils		utils
.DS_Store		.DS_Store
ATIS.yaml		ATIS.yaml
ATIS_SQL.yaml		ATIS_SQL.yaml
GeoQuery.yaml		GeoQuery.yaml
GeographySQL.yaml		GeographySQL.yaml
LICENSE		LICENSE
M2M.yaml		M2M.yaml
MultiWoz.yaml		MultiWoz.yaml
Nlmaps.yaml		Nlmaps.yaml
README.md		README.md
SCAN.yaml		SCAN.yaml
SMCalFlow.yaml		SMCalFlow.yaml
Sparc.yaml		Sparc.yaml
Spider.yaml		Spider.yaml
TreeDST.yaml		TreeDST.yaml
emnlp.pdf		emnlp.pdf
emnlp_page-0001.jpg		emnlp_page-0001.jpg
metric_test.py		metric_test.py
sem4diag_rewriting.py		sem4diag_rewriting.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uniform semantics for Semantic Parsing and Task-Oriented Dialogue Systems

Introduction

Datasets

Codes Related to the Proposed Format

For New Task Processing

Codes for Experiments

File Introduction

Running the code

About

Releases

Packages

Languages

License

mynlp/Sem4Dialogue

Folders and files

Latest commit

History

Repository files navigation

Uniform semantics for Semantic Parsing and Task-Oriented Dialogue Systems

Introduction

Datasets

Codes Related to the Proposed Format

For New Task Processing

Codes for Experiments

File Introduction

Running the code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages