You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because PDF encodings are a mess, parsing often leads to poor results with non-ASCII text of all kinds. For example, scientific notation in the form 10^3 is typically parsed as "103" by PyMuPDF and as "10 3" by grobid, which is really unfortunate when one tries to extract quantitative information (though the latter is fixable by post-processing). Similarly, non-ASCII characters are a bit of an RNG. For example, in one paper I was looking at IFN-β was parsed as IFN-\x02 by PyMuPDF and IFN- by grobid.
A few questions on this:
I think grobid results are generally more fixable than PyMuPDF results. The newest paper for this repo uses grobid for some of its results - are you planning to add your own implementation of grobid parsing to this repo?
As far as I can tell, there's currently no post-processing for the PyMuPDF parsing. In the recent paper, I there's also no mention of post-processing for grobid results. Do you currently plan to implement post-processing to fix some of this?
Some of these things would be relatively easy to fix. For example, when using grobid, scientific number notation could be rescued with something likes this:
def rescue_scientific_notation(text):
# replace space between 10 and the exponent with ^
text = re.sub(r'([*x]?)(10) (\d*)', r'\1\2^\3', text)
return text
Similarly, weird non-Ascii encodings could probably be rescued by asking ChatGPT to guess the correct character (this is a really crude example implementation):
@loesinghaus Hello! I'm here to help you with any issues or questions you have. Let's work together to solve the problem.
Currently, there is no specific mention of post-processing for handling scientific notation or non-ASCII characters in the paper-qa codebase. The existing implementation uses libraries like PyMuPDF and grobid for parsing, but there is no documented post-processing logic for correcting issues like those you've described [1].
Your suggestion to implement post-processing, such as using regular expressions to correct scientific notation or leveraging AI models to handle non-ASCII characters, could be a valuable addition. If you are considering contributing these improvements, integrating them into the existing parsing functions or creating a dedicated post-processing module would be a logical approach.
Because PDF encodings are a mess, parsing often leads to poor results with non-ASCII text of all kinds. For example, scientific notation in the form 10^3 is typically parsed as "103" by PyMuPDF and as "10 3" by grobid, which is really unfortunate when one tries to extract quantitative information (though the latter is fixable by post-processing). Similarly, non-ASCII characters are a bit of an RNG. For example, in one paper I was looking at IFN-β was parsed as IFN-\x02 by PyMuPDF and IFN- by grobid.
A few questions on this:
Some of these things would be relatively easy to fix. For example, when using grobid, scientific number notation could be rescued with something likes this:
Similarly, weird non-Ascii encodings could probably be rescued by asking ChatGPT to guess the correct character (this is a really crude example implementation):
This correctly fixes most of the worst offenders and the interferon names:
Are there plans to implement parsing post-processing?
The text was updated successfully, but these errors were encountered: