Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data quality issues #4

Open
huangfw opened this issue Aug 6, 2024 · 3 comments
Open

Data quality issues #4

huangfw opened this issue Aug 6, 2024 · 3 comments

Comments

@huangfw
Copy link

huangfw commented Aug 6, 2024

Hello, thank you for your impressive work! When I was visualizing the data, I found that some documents had problems such as missing detection boxes and incomplete formula detection boxes. Please help confirm whether there are problems with data quality. Thank you!

The data I downloaded from huggingface, the randomly selected files are:
astro-ph.CO/1804.05921
astro-ph.CO/1005.1278

@BOBrown
Copy link
Contributor

BOBrown commented Aug 7, 2024

@huangfw
Thank you for your attention to our work. It should be noted that DocGenome-train is annotated automatically, so the data quality in the DocGenome-train dataset is not consistent. We checked the two examples (astro-ph.CO/1804.05921 astro-ph.CO/1005.1278) you mentioned, and indeed the detection boxes for the formulas are inaccurate. This issue is mainly due to that some macro definitios in the LaTeX source code are not effectively recognized by our proposed automated annotation tool, and we will address these issues in the next version.

@BOBrown
Copy link
Contributor

BOBrown commented Aug 7, 2024

@huangfw Besides, we have performed quality grading on all automatically annotated data in DocGemone, assigning it to three quality levels: tier-1, tier-2, and tier-3, as shown in Figure 3-(b) in DocGenome paper. You can choose data with a quality level of tier-1.

@sky-fly97
Copy link
Contributor

sky-fly97 commented Aug 7, 2024

@huangfw Hello, we have added the information about the different quality levels of the trainset for reference. Later on we will also go into more detail about the various components of the trainset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants