Data quality issues #4

huangfw · 2024-08-06T12:50:29Z

Hello, thank you for your impressive work! When I was visualizing the data, I found that some documents had problems such as missing detection boxes and incomplete formula detection boxes. Please help confirm whether there are problems with data quality. Thank you!

The data I downloaded from huggingface, the randomly selected files are:
astro-ph.CO/1804.05921
astro-ph.CO/1005.1278

BOBrown · 2024-08-07T02:02:42Z

@huangfw
Thank you for your attention to our work. It should be noted that DocGenome-train is annotated automatically, so the data quality in the DocGenome-train dataset is not consistent. We checked the two examples (astro-ph.CO/1804.05921 astro-ph.CO/1005.1278) you mentioned, and indeed the detection boxes for the formulas are inaccurate. This issue is mainly due to that some macro definitios in the LaTeX source code are not effectively recognized by our proposed automated annotation tool, and we will address these issues in the next version.

BOBrown · 2024-08-07T02:05:44Z

@huangfw Besides, we have performed quality grading on all automatically annotated data in DocGemone, assigning it to three quality levels: tier-1, tier-2, and tier-3, as shown in Figure 3-(b) in DocGenome paper. You can choose data with a quality level of tier-1.

sky-fly97 · 2024-08-07T10:11:15Z

@huangfw Hello, we have added the information about the different quality levels of the trainset for reference. Later on we will also go into more detail about the various components of the trainset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality issues #4

Data quality issues #4

huangfw commented Aug 6, 2024

BOBrown commented Aug 7, 2024 •

edited

Loading

BOBrown commented Aug 7, 2024 •

edited

Loading

sky-fly97 commented Aug 7, 2024 •

edited

Loading

Data quality issues #4

Data quality issues #4

Comments

huangfw commented Aug 6, 2024

BOBrown commented Aug 7, 2024 • edited Loading

BOBrown commented Aug 7, 2024 • edited Loading

sky-fly97 commented Aug 7, 2024 • edited Loading

BOBrown commented Aug 7, 2024 •

edited

Loading

BOBrown commented Aug 7, 2024 •

edited

Loading

sky-fly97 commented Aug 7, 2024 •

edited

Loading