Skip to content

Commit

Permalink
update v3.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
wuchengwei committed Jun 13, 2024
1 parent ccac94b commit 84a3d7e
Show file tree
Hide file tree
Showing 175 changed files with 10,597 additions and 24,832 deletions.
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Ignore the .idea directory
.idea/

# Build and Release Folders
bin-debug/
bin-release/
[Oo]bj/
[Bb]in/

# Other files and folders
.settings/

# Executables
*.swf
*.air
*.ipa
*.apk

# Project files, i.e. `.project`, `.actionScriptProperties` and `.flexProperties`
# should NOT be excluded as they contain compiler settings and other important
# information for Eclipse / Flash Builder.
Empty file modified FlagOpen.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified LICENSE
100644 → 100755
Empty file.
140 changes: 78 additions & 62 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

![FlagData](flagdata_logo.png)
[![Pypi Package](https://img.shields.io/pypi/v/flagdata?label=pypi%20package)](https://pypi.org/project/flagdata/)
[![Python Application](https://github.com/FlagOpen/FlagData/actions/workflows/python-app.yml/badge.svg)](https://github.com/FlagOpen/FlagData/actions/workflows/python-app.yml)
[![License](https://img.shields.io/github/license/FlagOpen/FlagData.svg?color=blue)](https://github.com/FlagOpen/FlagData/blob/main/LICENSE)
![GitHub release (release name instead of tag name)](https://img.shields.io/github/v/release/FlagOpen/FlagData?include_prereleases&style=social)

Expand Down Expand Up @@ -30,7 +29,7 @@ The complete pipeline process and features such as
![pipeline](pipeline.png)

## News

- [June 13st, 2024] FlagData v3.0.0 update, supports multiple data types, dozens of operator pools for DIY, and generates high-quality data with one click
- [Dec 31st, 2023] FlagData v2.0.0 has been upgraded
- [Jan 31st, 2023] FlagData v1.0.0 is online!

Expand All @@ -49,10 +48,29 @@ The complete pipeline process and features such as
- [Configuration](#Configuration)
- [Data cleaning](#Data-cleaning)
- [Data Quality assessment](#Data-Quality-assessment)
- [Contact us](#Contact-us)
- [Operator Pool](#Operator-Pool)
- [Strong community support](#Strong-community-support)
- [Users](#Users)
- [Reference project](#Reference-project)
- [License](#License)

# V3.0.0 UPDATE
With the feedback from the community, FlagData has been upgraded. This update provides a set of fool-proof language pre-training data construction tools. According to different data types, we provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. Both novice users and advanced users can easily generate high-quality data.
- Novice users: Just confirm the data type to generate high-quality data.
- Advanced users: We provide dozens of operator pools for users to DIY their own LLM pre-training data construction process.

**Project Features:**

- Ease of use: Fool-style operation, simple configuration is all that is needed to generate high-quality data.
- Flexibility: Advanced users can customize the data construction process through various operator pools.
- Diversity: Supports multiple data types (HTML, Web, Wiki, Book, Paper, QA, Redpajama, Code)

**Key highlights**

- 🚀 Generate high-quality data with one click
- 🔧 Dozens of operator pools for DIY
- 🌐 Support for multiple data types

## Installation

- Under the requirements.txt file, are all the dependent packages of the FlagData project
Expand All @@ -61,29 +79,13 @@ The complete pipeline process and features such as
pip install -r requirements.txt
```

Optionally install the `cleaner` module required in FlagData. You will only install the dependency packages for the
corresponding modules, which is suitable for users who only want to use the `cleaner` module and do not want to install
other module dependency packages.

```bash
pip install flagdata[cleaner]
```

**Install the latest version of the main branch**

The main branch is officially released by FlagData. If you want to install / update to the latest version of the main
branch, use the following command:

```
git clone https://github.com/FlagOpen/FlagData.git
pip install .[all]
```

**Secondary development based on source code**

```bash
git clone https://github.com/FlagOpen/FlagData.git
pip install -r requirements.txt
```

## Quick Start
Expand All @@ -102,7 +104,7 @@ different strategies. The strategies include:
answers. In order to increase the diversity of generated samples, it is supported to exclude already generated
samples.

See [ReadMe under data_gen Module](flagdata/data_gen/README.md) for an example.
See [Instructions for using the Data Enhancement Module](flagdata/data_gen/README.md) for an example.

### Data preparation phase

Expand All @@ -115,7 +117,7 @@ Title [Chapter Title]", "Address [E-mail]","PageBreak", "Header [Header]", "Foot
UncategorizedText [arxiv vertical number]", "
Image, Formula, etc. Tool scripts provide two forms: keeping full text and saving by category resolution.

See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.
See [Instructions for using all2txt modules](flagdata/all2txt/README.md) for an example.

### Data preprocessing phase

Expand All @@ -131,43 +133,33 @@ finally outputs a score of 0: 1.
+ For general cleaning rules, if it is greater than 0.5, it is classified as a specific language, otherwise it indicates
that the page is not sure what language it is and discards the page.

See [ReadMe under language_identification Module](flagdata/language_identification/README.md) for an example.
See [Instructions for using the language identification module](flagdata/language_identification/README.md) for an example.

#### Data cleaning

The cleaner module uses multi-process pool mp.Pool to process data in parallel in a multi-process manner. Use
SharedMemoryManager to create shareable data structures, and multiple processes share data in data processing.

Efficient data cleaning is achieved through multi-processes and shared memory:
We provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. For more customized functions, users can refer to the "data_operator" section.
##### TextCleaner
TextCleaner provides a fast and extensible text data cleaning tool. It provides commonly used text cleaning modules.
Users only need to select the text_clean.yaml file in cleaner_builder.py to process text data.
For details, see[Instructions for using TextCleaner](flagdata/cleaner/docs/Text_Cleaner.md)

Currently, the following cleaning rules are included:
##### ArxivCleaner
ArxivCleaner provides a commonly used arxiv text data cleaning tool.
Users only need to select the arxiv_clean.yaml file in cleaner_builder.py to process arxiv data.

+ Emoticons and meaningless characters (regular)
+ Clean and reprint copyright notice information (Zhihu, csdn, brief book, blog park)
+ Remove unreasonable consecutive punctuation marks, and newline characters are unified as\ n
+ Remove personal privacy, URL and extra spaces such as mobile phone number and ID number
+ Remove irrelevant content such as beginning and end, and remove text whose length is less than n (currently nasty 100)
+ Convert simplified Chinese to traditional Chinese (opencc Library)
##### HtmlCleaner
HtmlCleaner provides commonly used Html format text extraction and data cleaning tools.
Users only need to run the main method to process arxiv data.

It takes only two steps to use the data cleaning feature of FlagData:

1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the
configuration file template to explain its meaning. At the same time, you can refer
to [Configuration](#Configuration) Chapter.

2. Specify the configuration file path in the following code and run it
```python
from flagdata.cleaner.text_cleaner import DataCleaner
if __name__ == "__main__": # Safe import of main module in multi-process
cleaner = DataCleaner("config.yaml")
cleaner.clean()
```
##### QaCleaner
QaCleaner provides commonly used Qa format text extraction and data cleaning tools.
Users only need to run the main method to process Qa data.
For details, see[Instructions for using Qa](flagdata/cleaner/docs/Qa_Cleaner.md)

The cleaned file will be saved in the format `jsonl` to the path corresponding to the `output` parameter specified in
the configuration file.

See [Tutorial 1: Clean the original text obtained from the Internet](/flagdata/cleaner/tutorial_01_cleaner.md) for an
example.
##### BookCleaner
BookCleaner provides a common book format text extraction and data cleaning tool.
Users only need to run the main method to process the book data.
For details, see[Instructions for using Book](flagdata/cleaner/docs/Book_Cleaner.md)

#### Quality assessment

Expand All @@ -182,7 +174,7 @@ This paper compares different text classification models, including logical regr
their performance. In the experiment, BERTEval and FastText models perform well in text classification tasks, and
FastText model performs best in terms of accuracy and recall rate. [experimental results are from ChineseWebText]

See [ReadMe under quality_assessment Module](flagdata/quality_assessment/README.md) for an example.
See [Instructions for using the quality assessment module](flagdata/quality_assessment/README.md) for an example.

#### Data deduplication

Expand All @@ -196,6 +188,7 @@ to retain only those texts that are very similar, while discard those texts with
default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data,
the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets
efficiently.

The following is the similar text iterated in the process of data deduplication, which has slight differences in line
wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.

Expand Down Expand Up @@ -253,13 +246,13 @@ The analysis data analysis module provides the following functions:

+ length analysis of the text.

See [ReadMe under analysis Module](flagdata/analysis/README.md) for an example.
See [Instructions for using the analysis module](flagdata/analysis/README.md) for an example.

## Configuration

For the `data cleansing` and `data quality assessment` modules,
We provide a profile
template:[cleaner_config.yaml](https://dorc.baai.ac.cn/resources/projects/FlagData/cleaner_config.yaml)[bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)
template:[text_clean.yaml、arxiv_clean.yaml](flagData/cleaner/configs)[bert_config.yaml](flagdata/quality_assessment/Bert/bert_config.yaml)
The configuration file is readable [YAML](https://yaml.org) format , provides detailed comments. Please make sure that
the parameters have been modified in the configuration file before using these modules.

Expand All @@ -268,10 +261,16 @@ Here are some important parameters you need to pay attention to:
### Data cleaning

```yaml
# Raw data to be cleaned
# 待清洗的原始数据
input: ./demo/demo_input.jsonl
# Save path of data after cleaning
# 清洗后数据的保存路径
output: ./demo/output.jsonl
# 待处理的字段
source_key: text
# key in the output file for saving
result_key: cleanedContent
# 需要选择的Pipline类
cleaner_class: ArxivCleaner
```
### Data Quality assessment
Expand All @@ -283,20 +282,37 @@ Here are some important parameters you need to pay attention to:
# The text_key field is the field being evaluated
text_key: "raw_content"
```
## Operator Pool
We provide some basic operators for data cleaning, filtering, format conversion, etc. to help users build their own data construction process.
The operators provided are divided into three types: Formatter, Pruner, and Filter. Formatter is used to process structured data and can be used for mutual conversion of data in different formats; Pruner is used to clean text data; Filter is used for sample filtering.
The figure below shows these operators in different processing locations and a list of some of the operators
## Contact us
<img src="pic/data_operator.png" width="50%" height="auto">
If you have any questions about the use and code of this project, you can submit issue. At the same time, you can
contact us directly through [email protected].
<img src="pic/some_operator.png" width="50%" height="auto">
An active community is inseparable from your contribution, if you have a new idea, welcome to join our community, let us
become a part of open source, together to contribute our own efforts for open source!
For detailed description, see[Instructions for using the data operator](flagdata/data_operator/Operator_ZH.md)
## Strong community support
### Community Support
If you have any questions about the use and code of this project, you can submit an issue. You can also contact us directly via email at [email protected];
An active community cannot be separated from your contribution. If you have a new idea, welcome to join our community, let us become part of open source, and contribute to open source together! ! !
<img src="contact_me.png" width="50%" height="auto">
Or follow Zhiyuan FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
Or follow the FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
![contact_me](FlagOpen.png)
### Questions and Feedback
- Please report issues and make suggestions through GitHub Issues, and we will respond quickly within 24 hours.
- You are also welcome to discuss actively in GitHub Discussions.
- If it is inconvenient to use GitHub, of course, everyone in the FlagData open source community can also speak freely. For reasonable suggestions, we will iterate in the next version.
We will invite experts in the field to hold online and offline exchanges regularly to share the latest LLM research results.
## Users
<img src="pic/users.png" width="50%" height="auto">
## Reference project
Part of this project is referenced from the following code:
Expand Down
Loading

0 comments on commit 84a3d7e

Please sign in to comment.