Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readMe #5

Merged
merged 39 commits into from
Jan 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 14 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,22 @@
</div>

-----------------------------------------------------------------------
data is one of the basic elements in the development of artificial intelligence. With the continuous breakthrough of large-scale pre-training model and related technology, it is becoming more and more important to use efficient data processing tools to improve data quality in the corresponding research. So we launched FlagData, an easy-to-use and easy-to-extend data processing toolkit. FlagData integrates several data processing tools and algorithms including data acquisition, data preparation, data preprocessing and data analysis, which provides a strong data level support for model training and deployment in natural language processing, computer vision and other fields.
Data is one of the basic elements in the development of artificial intelligence. With the continuous breakthrough of large-scale pre-training model and related technology, it is becoming more and more important to use efficient data processing tools to improve data quality in the corresponding research. So we launched FlagData, an easy-to-use and easy-to-extend data processing toolkit. FlagData integrates several data processing tools and algorithms including data acquisition, data preparation, data preprocessing and data analysis, which provides a strong data level support for model training and deployment in natural language processing, computer vision and other fields.

FlagData supports the following features:

* it can be used with simple configuration after installation, and the custom feature can be realized with low code volume.
* Realize the high-quality content extraction of various original format data, and greatly reduce the processing cost.

* High-quality structured data can be quickly cleaned from the original html/text/pdf/epub, and sensitive information can be filtered to avoid the risk of privacy disclosure.
* Provide the function of fine-tuning data perspective for large models.

* Support massive text data de-duplication, and provide detailed multi-machine distributed data processing system deployment documents.

* support data quality assessment and common data analysis.
* One-stop efficient distributed data processing function.

The complete pipeline process and features such as
![pipeline](pipeline.png)

## News

- [Dec 15st, 2023] FlagData v1.1.0 has been upgraded
- [Dec 31st, 2023] FlagData v2.0.0 has been upgraded
- [Jan 31st, 2023] FlagData v1.0.0 is online!

--------------------------------------------------------------------------------
Expand Down Expand Up @@ -85,10 +83,10 @@ pip install -r requirements.txt

### Data acquisition phase

The OpenAI interface is utilized to construct a series of single rounds of SFT data for different abilities with three different strategies. The strategies include:
The LLM interface is utilized to construct a series of single rounds of SFT data for different abilities with three different strategies. The strategies include:

+ ImitateGenerator: augment data using several case samples as templates. Supports simultaneous generation of data in multiple languages.
+ AbilityExtractionGenerator: using the OpenAI interface, generalize the abilities contained in several case samples. Generate new samples and answers based on this collection of capabilities.
+ AbilityExtractionGenerator: using the LLM interface, generalize the abilities contained in several case samples. Generate new samples and answers based on this collection of capabilities.
+ AbilityDirectGenerator: Generate new samples directly related to a specified ability type or task type. For example, if you specify the ability as "Logical Reasoning", you can generate a series of logical reasoning questions and answers. In order to increase the diversity of generated samples, it is supported to exclude already generated samples.


Expand All @@ -105,7 +103,7 @@ Image, Formula, etc. Tool scripts provide two forms: keeping full text and savin

See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.

### data preprocessing phase
### Data preprocessing phase

#### Language recognition

Expand Down Expand Up @@ -135,7 +133,7 @@ Currently, the following cleaning rules are included:

It takes only two steps to use the data cleaning feature of FlagData:

1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the configuration file template to explain its meaning. At the same time, you can refer to[Configuration](#Configuration) Chapter.
1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the configuration file template to explain its meaning. At the same time, you can refer to [Configuration](#Configuration) Chapter.

2. Specify the configuration file path in the following code and run it
```python
Expand All @@ -162,19 +160,19 @@ See [ReadMe under quality_assessment Module](flagdata/quality_assessment/README.

#### Data deduplication

deduplicationModule provides the ability to de-duplicate large amounts of text data, using MinHashLSH (Least Hash Locally Sensitive Hash) by converting text into a series of hash values in order to compare similarities between texts.
deduplication Module provides the ability to deduplicate large amounts of text data, using MinHashLSH (Least Hash Locally Sensitive Hash) by converting text into a series of hash values in order to compare similarities between texts.

We can control the parameter threshold, which represents the threshold of similarity, with values ranging from 0 to 1. A setting of 1 means that there is an exact match and no text is filtered out. On the contrary, if a lower similarity value is set, texts with slightly higher similarity will also be retained. We can set a higher threshold value as needed to retain only those texts that are very similar, while discard those texts with slightly less similarity. The empirical default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data, the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets efficiently.
The following is the similar text iterated in the process of data deduplication, which has slight differences in line wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.

```json lines
{
"__id__": 1881195681200,
"content": "新华社北京1月11日电 中共全国人大常委会党组10日举行会议,学习习近平总书记在二十届中央纪委二次全会上的重要讲话和全会精神,结合人大工作实际,研究部署贯彻落实工作。全国人大常委会委员长、党组书记栗战书主持会议并讲话......全国人大常委会党组副书记王晨,全国人大常委会党组成员张春贤、沈跃跃、吉炳轩、艾力更·依明巴海、王东明、白玛赤林、杨振武出席会议并发言。 (责任编辑:符仲明)"
"__id__":3023656977259,
"content":"\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
}
{
"__id__": 944892809591,
"content": "新华社北京1月11日电 中共全国人大常委会党组10日举行会议,学习习近平总书记在二十届中央纪委二次全会上的重要讲话和全会精神,结合人大工作实际,研究部署贯彻落实工作。全国人大常委会委员长、党组书记栗战书主持会议并讲话......全国人大常委会党组副书记王晨,全国人大常委会党组成员张春贤、沈跃跃、吉炳轩、艾力更·依明巴海、王东明、白玛赤林、杨振武出席会议并发言。\n【纠错】\n【责任编辑:周楚卿\n】"
"__id__":3934190045072,
"content":"记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
}
```

Expand Down
24 changes: 11 additions & 13 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,18 @@

FlagData支持以下特性:

* 安装后简单配置即可上手使用,低代码量实现自定义功能。
* 实现多种原始格式数据的高质量内容提取,极大降低处理成本

* 可从原始html/text/pdf/epub 快速清洗得到高质量结构化数据,注重敏感信息滤除,避免隐私泄露风险。
* 提供大模型微调数据透视功能

* 支持海量文本数据去重,并提供详细的多机分布式数据处理系统部署文档。

* 支持数据质量评估与常见数据分析。
* 一站式高效分布式数据处理功能

完整的pipeline流程以及功能如下图:
![pipeline](pipeline_zh.png)

## 动态

- [Dec 15st, 2023] FlagData v1.1.0 升级
- [Dec 31st, 2023] FlagData v2.0.0 升级
- [Jan 31st, 2023] FlagData v1.0.0 上线了!

--------------------------------------------------------------------------------
Expand Down Expand Up @@ -85,11 +83,11 @@ pip install -r requirements.txt

### 2.1、数据获取阶段

我们提供了基于OpenAI接口的数据增强模块
利用OpenAI接口,以三种不同策略,构建一系列针对不同能力的单轮SFT数据。策略包括:
我们提供了基于LLM接口的数据增强模块
利用LLM接口,以三种不同策略,构建一系列针对不同能力的单轮SFT数据。策略包括:

+ ImitateGenerator:以若干案例样本为模板,扩增数据。支持同时生成多种语言数据。
+ AbilityExtractionGenerator: 利用OpenAI接口,归纳出若干案例样本中包含的能力。根据这个能力集合,生成新样本和答案。
+ AbilityExtractionGenerator: 利用LLM接口,归纳出若干案例样本中包含的能力。根据这个能力集合,生成新样本和答案。
+ AbilityDirectGenerator: 根据指定的能力类型,或者任务类型,直接生成与该能力或任务相关的新样本。例如,指定能力为“逻辑推理”,则可生成一系列逻辑推理题目及答案。为增强生成样本的多样性,支持排除已生成样本。

具体示例见[数据增强模块下的readMe](flagdata/data_gen/README_zh.md)
Expand Down Expand Up @@ -169,12 +167,12 @@ Image(图)", "Formula(公式)" 等,工具脚本提供保留全文,以及

```json lines
{
"__id__": 1881195681200,
"content": "新华社北京1月11日电 中共全国人大常委会党组10日举行会议,学习习近平总书记在二十届中央纪委二次全会上的重要讲话和全会精神,结合人大工作实际,研究部署贯彻落实工作。全国人大常委会委员长、党组书记栗战书主持会议并讲话......全国人大常委会党组副书记王晨,全国人大常委会党组成员张春贤、沈跃跃、吉炳轩、艾力更·依明巴海、王东明、白玛赤林、杨振武出席会议并发言。 (责任编辑:符仲明)"
"__id__":3023656977259,
"content":"\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)\n(责任编辑:单晓冰)"
}
{
"__id__": 944892809591,
"content": "新华社北京1月11日电 中共全国人大常委会党组10日举行会议,学习习近平总书记在二十届中央纪委二次全会上的重要讲话和全会精神,结合人大工作实际,研究部署贯彻落实工作。全国人大常委会委员长、党组书记栗战书主持会议并讲话......全国人大常委会党组副书记王晨,全国人大常委会党组成员张春贤、沈跃跃、吉炳轩、艾力更·依明巴海、王东明、白玛赤林、杨振武出席会议并发言。\n【纠错】\n【责任编辑:周楚卿\n】"
"__id__":3934190045072,
"content":"记者 潘世鹏\n\"2022海口三角梅花展\"已接待游客3万多名——\n三角梅富了边洋村\n一年四季,美丽的海南岛始终春意盎然、鲜花盛开,而作为海南省省花的三角梅就是其中最引人注目的鲜花品种之一,成为海南的一道亮丽风景线。\n\"可别小看这一盆盆普通的三角梅花,特别受游客喜爱。仅最近一个多月,我们就卖出了200多万元,盆栽三角梅销路火爆。......吸引更多本地和外地游客来赏花、买花。(经济日报 记者 潘世鹏)"
}
```

Expand Down
6 changes: 3 additions & 3 deletions flagdata/data_gen/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Data acquisition phase

### Data enhancement module based on OpenAI interface
### Data enhancement module based on LLM interface

The OpenAI interface is utilized to construct a series of single rounds of SFT data for different abilities with three different strategies. The strategies include:
The LLM interface is utilized to construct a series of single rounds of SFT data for different abilities with three different strategies. The strategies include:

+ ImitateGenerator: augment data using several case samples as templates. Supports simultaneous generation of data in multiple languages.
+ AbilityExtractionGenerator: using the OpenAI interface, generalize the abilities contained in several case samples. Generate new samples and answers based on this collection of capabilities.
+ AbilityExtractionGenerator: using the LLM interface, generalize the abilities contained in several case samples. Generate new samples and answers based on this collection of capabilities.
+ AbilityDirectGenerator: Generate new samples directly related to a specified ability type or task type. For example, if you specify the ability as "Logical Reasoning", you can generate a series of logical reasoning questions and answers. In order to increase the diversity of generated samples, it is supported to exclude already generated samples.

See `example.py` for an example.
Expand Down
6 changes: 3 additions & 3 deletions flagdata/data_gen/README_zh.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# 数据获取阶段

### 基于OpenAI接口的数据增强模块
### 基于LLM接口的数据增强模块

利用OpenAI接口,以三种不同策略,构建一系列针对不同能力的单轮SFT数据。策略包括:
利用LLM接口,以三种不同策略,构建一系列针对不同能力的单轮SFT数据。策略包括:

+ ImitateGenerator:以若干案例样本为模板,扩增数据。支持同时生成多种语言数据。
+ AbilityExtractionGenerator: 利用OpenAI接口,归纳出若干案例样本中包含的能力。根据这个能力集合,生成新样本和答案。
+ AbilityExtractionGenerator: 利用LLM接口,归纳出若干案例样本中包含的能力。根据这个能力集合,生成新样本和答案。
+ AbilityDirectGenerator: 根据指定的能力类型,或者任务类型,直接生成与该能力或任务相关的新样本。例如,指定能力为“逻辑推理”,则可生成一系列逻辑推理题目及答案。为增强生成样本的多样性,支持排除已生成样本。

具体示例见`example.py`
Expand Down
4 changes: 2 additions & 2 deletions flagdata/deduplication/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# data preprocessing phase > Data deduplication
# Data preprocessing phase > Data deduplication
The following describes in detail how to use distributed capabilities for data deduplication

first. Build a Spark standalone cluster (1 master2 worker)
First. Build a Spark standalone cluster (1 master2 worker)
1. Install jdk

a. Download the jdk package
Expand Down
2 changes: 1 addition & 1 deletion flagdata/language_identification/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# data preprocessing phase > Language recognition
# Data preprocessing phase > Language recognition

LID stands for Language identification, which is a model for language identification.
+ It uses fastText's language classifier, which is trained on Wikipedia, Tatoeba, and SETimes, uses n-grams as features, and uses a hierarchical softmax. 176 languages are classified, and it outputs a score from 0 to 1.
Expand Down
4 changes: 2 additions & 2 deletions flagdata/quality_assessment/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# data preprocessing phase > Quality assessment
# Data preprocessing phase > Quality assessment
BERT and fasttext were chosen as evaluation models because they have the following advantages:

1. the BERT model performs well in text categorization and comprehension tasks, has strong language understanding and
1. The BERT model performs well in text categorization and comprehension tasks, has strong language understanding and
representation capabilities, and can effectively assess text quality.
2. FastText models have efficient training and inference speeds while maintaining classification performance, which can
significantly reduce training and inference time, version number 0.9.2 of fasttext
Expand Down
Loading