Is the dataset publicly available? #9

JasonKitty · 2024-08-23T01:26:49Z

I can only find DocGenome dataset, is table recognition trained on this dataset?

Thank you!

PrinceVictor · 2024-08-23T03:08:19Z

Yes. Our model is trained on the DocGenome dataset. Specifically, we extracted the table data from DocGenome to fine-tune our model.

Thank you for your interest in our work!
Let me know if you have any further questions.

JasonKitty · 2024-08-23T16:10:34Z

Yes. Our model is trained on the DocGenome dataset. Specifically, we extracted the table data from DocGenome to fine-tune our model.

Thank you for your interest in our work! Let me know if you have any further questions.

Thank you for your reply! I have two more questions.

The article mentions that table recognition and formula recognition both use the same model as Pix2Struct. Are these models trained separately for each task?
For formula recognition, Mineru uses unimernet, which adds length embedding to the decoder. Is there a similar improvement applied in table recognition?

PrinceVictor · 2024-08-26T03:28:36Z

Thank you for your questions.

Separate models trained for table and formula recognition.
Unlike unimernet, the is no length embedding added to decoder.

JasonKitty · 2024-08-26T09:24:32Z

Thank you for your questions.

Separate models trained for table and formula recognition.

Unlike unimernet, the is no length embedding added to decoder.

Thank u! One more question, the paper mentions tokenizer from nougat, is there an update here? Because I find that the two Tokenizers are not the same.

PrinceVictor · 2024-08-30T02:48:50Z

We currently utilize the tokenizer from Pix2Struct, but we have expanded the vocabulary to support the Chinese language better.

JasonKitty · 2024-08-30T06:59:57Z

We currently utilize the tokenizer from Pix2Struct, but we have expanded the vocabulary to support the Chinese language better.

Table recognition is a token-intensive task, and I think using a dedicated tokenizer can streamline expression, improve inference speed and training performance.

PrinceVictor · 2024-09-02T07:05:18Z

Thank you for your valuable suggestion. We will continue to improve the model for better performance.

JasonKitty · 2024-09-04T03:21:14Z

Thank you for your valuable suggestion. We will continue to improve the model for better performance.

Questions Regarding the Data Preparation.

It is mentioned in the article that the training data consists of 500k articles. May I ask how many table image-Latex pairs were used for StructEqTable training?
In table Latex, there are often many cross-references (e.g., \ref{}, \cite{}, \citep{}), and non-unique expressions (e.g., \textbf{} and \bf{}, different formatting controls that produce similar visual effects). Will such noise negatively affect the model's learning?
Was any data cleaning performed on the table Latex?
How were the table images annotated with the corresponding Latex text?
What are the shortcomings of this model?

Looking forward to your reply.
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the dataset publicly available? #9

Is the dataset publicly available? #9

JasonKitty commented Aug 23, 2024

PrinceVictor commented Aug 23, 2024

JasonKitty commented Aug 23, 2024 •

edited

Loading

PrinceVictor commented Aug 26, 2024

JasonKitty commented Aug 26, 2024

PrinceVictor commented Aug 30, 2024

JasonKitty commented Aug 30, 2024

PrinceVictor commented Sep 2, 2024

JasonKitty commented Sep 4, 2024

Is the dataset publicly available? #9

Is the dataset publicly available? #9

Comments

JasonKitty commented Aug 23, 2024

PrinceVictor commented Aug 23, 2024

JasonKitty commented Aug 23, 2024 • edited Loading

PrinceVictor commented Aug 26, 2024

JasonKitty commented Aug 26, 2024

PrinceVictor commented Aug 30, 2024

JasonKitty commented Aug 30, 2024

PrinceVictor commented Sep 2, 2024

JasonKitty commented Sep 4, 2024

JasonKitty commented Aug 23, 2024 •

edited

Loading