Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processFulltextDocument fails on 0.23% arXiv PDFs #1113

Open
MarksonChen opened this issue May 9, 2024 · 6 comments
Open

processFulltextDocument fails on 0.23% arXiv PDFs #1113

MarksonChen opened this issue May 9, 2024 · 6 comments
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Milestone

Comments

@MarksonChen
Copy link

MarksonChen commented May 9, 2024

I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.

Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (./gradlew run)

An example error log:

ERROR [2024-05-09 13:13:55,538] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
! at java.base/java.util.Objects.checkIndex(Objects.java:359)
! at java.base/java.util.ArrayList.get(ArrayList.java:427)
! at org.grobid.core.data.Note.getPageNumber(Note.java:77)
! at org.grobid.core.document.TEIFormatter.lambda$toTEITextPiece$0(TEIFormatter.java:1460)
! at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
! at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
! at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
! at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
! at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
! at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
! at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1461)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 78 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2708)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:320)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)

The 50 PDFs that failed:

https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990

@kermitt2
Copy link
Owner

kermitt2 commented May 9, 2024

Hi @MarksonChen

This is normally fixed with #1075
Are you using the latest master version?

@MarksonChen
Copy link
Author

Hi kermitt2,

Thank you for your reply. I was using 0.8.0.

However, after switching to the latest master version (using git clone https://github.com/kermitt2/grobid.git), 49 out of 50 papers listed above still cannot be parsed with processFulltextDocument.

kermitt2 added a commit that referenced this issue May 9, 2024
kermitt2 added a commit that referenced this issue May 9, 2024
@kermitt2
Copy link
Owner

kermitt2 commented May 9, 2024

Thank you @MarksonChen for checking and reporting these arXiv error cases.

Indeed the problem is not related to the issue corresponding to #1075, sorry. I just pushed a quick fix and these files should work too.

@MarksonChen
Copy link
Author

Hi, kermitt2, thank you so much for your speedy fix! The amount of continual work put into this open-source project has been remarkable. All 22085 fetchable arXiv PDFs can be parsed successfully with processFulltextDocument.

@lfoppiano
Copy link
Collaborator

@kermitt2 I have a dejavu on this issue while working on PR #1097 and #1099.

This happen, as far as I remember, when a note with the same "label" is identified in the text. So when the notes list is collected from the text, by using the int idx = clusterTokens.indexOf(matching.get()); without updating the position, will result in having the same note with the same positions.

int idx = clusterTokens.indexOf(matching.get());

For the first article of the list, 2202.03169, happens because there are three notes with the same intervals. Maybe we could just filter them as an additional precaution.

I write here also some additional information, as I will forget in one hour.
I've checked just one example, which is quite messed up:

image
TakeasanexamplethesetupinFigure2,whereaballcaxn,xsetup2oRfSercetpiorens3en.1t,thceauosbaslerfvaacttiornswatithimouetsateupntiqanudesetoTfakeasanexamplethes1e0tu2pin3F.2ig.uLrea2r,nwinhgerweitahbIanlltecravnentionsoverTime
t
t+
t t+1 t+1 106 Note that when two variables Ci and Cj can only be inter-
We consider a dataset D of tuples {x , x , I } where 

@lfoppiano lfoppiano reopened this Jun 10, 2024
@lfoppiano
Copy link
Collaborator

I'm reopening this, I'm following up my last comment.

Avoiding the duplicated interval is done by updating the search space of the indexOf by reducing the list of tokens.
However, I noticed that

labels2Notes.put(note.getLabel(), note);
if the label is repeated in the same sentence we override the note. I think it would work fine by using the note identifier, which should be unique from notes point of view.

I'm submitting the PR with two fixes:

  1. avoid collecting the same position in the text when the note label is the same. So for example if we have This note1, and this note2, but back to the first note1, we would collect twice the offset of the first 1 label.
  2. update the labels2notes so that we use the identifier instead.

lfoppiano added a commit that referenced this issue Jun 12, 2024
@lfoppiano lfoppiano added this to the 0.8.1 milestone Jun 12, 2024
@lfoppiano lfoppiano added implemented The issue has been implemented bug From Hemiptera and especially its suborder Heteroptera labels Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera implemented The issue has been implemented
Projects
None yet
Development

No branches or pull requests

3 participants