Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with superscript/subscript #1005

Open
keto33 opened this issue Apr 30, 2023 · 5 comments
Open

Problem with superscript/subscript #1005

keto33 opened this issue Apr 30, 2023 · 5 comments
Labels
enhancement pdfalto Issue related to pdfalto

Comments

@keto33
Copy link

keto33 commented Apr 30, 2023

The documentation states PDFALTO recognises superscript/subscript.

First, how does GROBID format superscript/subscript? I have not seen <sub> or <sup> in the output.

Second, in my practice, superscript/subscript is printed with space even in formula blocks in the form of H 2 O with no effect. Is it the intended behaviour?

Third, I noticed that superscript/subscript is sometimes misplaced. For example, MnO<sub>2</sub> film is printed as MnO film 2. I can share examples but cannot upload the PDFs as I am not the copyright holder.

I use GROBID 0.7.2 using the command:

curl -sS --form [email protected] --form segmentSentences=1 --form includeRawCitations=1 \
--form includeRawAffiliations=1 --form teiCoordinates=persName --form teiCoordinates=figure \
--form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=formula \
127.0.0.1:8070/api/processFulltextDocument > output.xml
@lfoppiano
Copy link
Collaborator

lfoppiano commented May 1, 2023

@keto33 Pdfalto recognises superscript/subscript, indeed. Grobid too (they can be accessed via the LayoutToken objects), but it does not yet output in the XML (the change is currently worked in PR #936). The change is quite complex and will take some time to be merged in master, but you can try it out nevertheless.

@keto33
Copy link
Author

keto33 commented May 1, 2023

@lfoppiano, I had seed #936 before when I was looking for issues related to subscript. However, I did not quite catch how to implement it. Is it already implemented in 0.7.3?

@kermitt2
Copy link
Owner

kermitt2 commented May 1, 2023

Hi @keto33

I had seed #936 before when I was looking for issues related to subscript. However, I did not quite catch how to implement it. Is it already implemented in 0.7.3?

Serialization of superscript/subscript in the TEI XML is implemented in PR #936, but it is foreseen to be merged in version 0.8.0, not 0.7.3. The superscript/subscript should be working well, but in this branch serializing bold/italic is more complicated and require more tests.

Second, in my practice, superscript/subscript is printed with space even in formula blocks in the form of H 2 O with no effect. Is it the intended behaviour?

Currently yes.

Third, I noticed that superscript/subscript is sometimes misplaced. For example, MnO2 film is printed as MnO film 2. I can share examples but cannot upload the PDFs as I am not the copyright holder.

It's possible yes, some document editor/publication generate PDF where PDF element flow is not the reading order for some special tokens.
In pdfalto, blocks can be re-ordered based on their spacial distribution, but within a block currently the PDF flow order is not modified because re-ordering this flow creates much more errors that it solves. However we could have a special process in pdfalto just for superscript/subscript if it's more frequent on these specific elements.

Could you maybe share the landing page of the articles where you saw these problems? I might have a subscription to access them and reproduce the error.

@keto33
Copy link
Author

keto33 commented May 1, 2023

@kermitt2, thanks for following up. I encountered the problem of subscript/superscript in several papers. More complicated subscripts (e.g., 1+x) are often totally ignored.

This paper https://doi.org/10.1016/S0167-2738(00)00327-1 has most of the problems I mentioned. You just need to take a look at the abstract. If you do not have a subscription, I can post it here. I just didn't want to upload copyrighted materials on your project page without your permission.

And if you are interested in more examples, I can provide them.

@lfoppiano
Copy link
Collaborator

@keto33 There is a version that is accessible without subscription here: https://zenodo.org/record/1259881

In general, the subscripts are recognized, however, it does depends on how the PDF document was constructed. Are these papers all from Elsevier?
In this particular case, it seems that the subscripts are placed in a separate line, and not in the correct order.

See the picture below, you can see it by highlighting and you can see that the subscript are sorted after the line and not within:

image

This issue should be added in Pdfalto If I'm not wrong

@lfoppiano lfoppiano added pdfalto Issue related to pdfalto enhancement and removed pdfalto Issue related to pdfalto labels Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement pdfalto Issue related to pdfalto
Projects
None yet
Development

No branches or pull requests

3 participants