Fixing the parsing of scientific notation and non-ASCII characters #744

loesinghaus · 2024-12-04T06:24:33Z

Because PDF encodings are a mess, parsing often leads to poor results with non-ASCII text of all kinds. For example, scientific notation in the form 10^3 is typically parsed as "103" by PyMuPDF and as "10 3" by grobid, which is really unfortunate when one tries to extract quantitative information (though the latter is fixable by post-processing). Similarly, non-ASCII characters are a bit of an RNG. For example, in one paper I was looking at IFN-β was parsed as IFN-\x02 by PyMuPDF and IFN-␤ by grobid.

A few questions on this:

I think grobid results are generally more fixable than PyMuPDF results. The newest paper for this repo uses grobid for some of its results - are you planning to add your own implementation of grobid parsing to this repo?
As far as I can tell, there's currently no post-processing for the PyMuPDF parsing. In the recent paper, I there's also no mention of post-processing for grobid results. Do you currently plan to implement post-processing to fix some of this?

Some of these things would be relatively easy to fix. For example, when using grobid, scientific number notation could be rescued with something likes this:

def rescue_scientific_notation(text):
    # replace space between 10 and the exponent with ^
    text = re.sub(r'([*x]?)(10) (\d*)', r'\1\2^\3', text)
    return text

Similarly, weird non-Ascii encodings could probably be rescued by asking ChatGPT to guess the correct character (this is a really crude example implementation):

from openai import OpenAI
import asyncio

def get_client():
    api_key = "an-api-key"
    client = OpenAI(api_key=api_key)
    return client

def extract_gpt_response(response):
    response = response.choices[0].message.parsed
    return response

async def fetch_structured_response(query, model, response_format, system_prompt, client):
    response = client.beta.chat.completions.parse(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            query
        ],
        response_format=response_format
    )
    return response

async def get_structured_responses(queries, model, response_format, system_prompt, client):
    # Run all requests concurrently
    responses = await asyncio.gather(*(fetch_structured_response(query, model, response_format, system_prompt, client) for query in queries))
    responses = [extract_gpt_response(r) for r in responses]
    
    return responses

def return_queries(contents):
    return [{"role": "user", "content": content} for content in contents]

class TranslationDictionary(BaseModel):
    original_char: list[str]
    new_char: list[str]
    
system_prompt_conversion ="""
You are looking at a text from a publication that contains potentially falsely encoded non-ASCII characters.
You are given examples in the format character, ord(char), hex(ord(char)), and the 20 characters before and after the character.
Your task is to create a python dictionary that maps each non-ASCII character to its correct ASCII equivalent.
Note that you can also match to the original character. Only do so if it genuinely makes sense in context.
If you cannot find a match, replace the character with U+2205.
"""

async def preclean_parsed_text(text):
    ascii_chars = [chr(i) for i in range(128)]
    greek_chars = [chr(i) for i in range(945, 970)]
    other_known_chars = ["©", "±", "°", "′", "Δ", "ï", "ö", "ü", "ä", "\n", "\r", "\t"]
    known_chars = ascii_chars + greek_chars + other_known_chars
    
    # fix some known replacements
    known_replacements = {"×": "x", "ï": "i", "  ": " ", "u ¨": "ü", "a ¨": "ä", "o ¨": "ö"}
    text = replace_text(text, known_replacements)
    
    # rescue scientific notation (only works for grobid parsing)
    text = rescue_scientific_notation(text)
    
    # ------------------------------------------------------------
    # try to clean up unusual characters
    unusual_chars = []
    for char_index, char in enumerate(text):
        if char not in known_chars:
            unusual_chars.append((char, ord(char), hex(ord(char)), text[char_index-20:char_index+20]))
    # order by the ord() value
    unusual_chars.sort(key=lambda x: x[1])

    # keep at most 10 examples per char
    unusual_chars_filtered = []
    curr_char = ""
    i = 0
    for char, e1, e2, e3 in unusual_chars:  
        if char != curr_char:
            curr_char = char
            i = 0
        if i < 10:
            unusual_chars_filtered.append((char, e1, e2, e3))
            i += 1
          
    queries = return_queries([str(unusual_chars_filtered)])
    responses = await get_structured_responses(
        queries, model="gpt-4o",
        response_format=TranslationDictionary,
        system_prompt=system_prompt_conversion, client=openai_client)
    response = responses[0]
    
    # create the replacement dictionary
    original_chars = response.original_char
    new_chars = response.new_char
    replacement_dict = {o: n for o, n in zip(original_chars, new_chars)}
    
    # replace the text
    text = replace_text(text, replacement_dict)
    
    return text

This correctly fixes most of the worst offenders and the interferon names:

replacement_dict = 
{'ϩ': '+', 'Ϫ': '-', 'ϫ': 'x', 'Ϯ': '+', 'Ͻ': '<', 'Ͼ': '>', 'Ј': "'", '؉': '+', '؊': '-', 'ء': 'U+2205', '\n': '\n', '†': '\n', '‡': '\n', '␣': 'α', '␤': 'β', '␥': 'γ'}

Are there plans to implement parsing post-processing?

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-04T06:26:13Z

@loesinghaus Hello! I'm here to help you with any issues or questions you have. Let's work together to solve the problem.

Currently, there is no specific mention of post-processing for handling scientific notation or non-ASCII characters in the paper-qa codebase. The existing implementation uses libraries like PyMuPDF and grobid for parsing, but there is no documented post-processing logic for correcting issues like those you've described ^[1].

Your suggestion to implement post-processing, such as using regular expressions to correct scientific notation or leveraging AI models to handle non-ASCII characters, could be a valuable addition. If you are considering contributing these improvements, integrating them into the existing parsing functions or creating a dedicated post-processing module would be a logical approach.

To continue talking to Dosu, mention @dosu.

jamesbraza · 2024-12-04T21:06:10Z

Hi @loesinghaus yes we can add a post-parsing callback, it sounds like a good idea. It will need to be async so we can call LLMs in it

We have started a callback section here: https://github.com/Future-House/paper-qa/blob/v5.7.0/paperqa/settings.py#L509-L536

Feel free to open a PR for this

dosubot bot added the question Further information is requested label Dec 4, 2024

jamesbraza mentioned this issue Dec 5, 2024

Impact of carriage return ('\n'), multiple carriage return and multiple spaces dut to PDF parsing #751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the parsing of scientific notation and non-ASCII characters #744

Fixing the parsing of scientific notation and non-ASCII characters #744

loesinghaus commented Dec 4, 2024

dosubot bot commented Dec 4, 2024

jamesbraza commented Dec 4, 2024

Fixing the parsing of scientific notation and non-ASCII characters #744

Fixing the parsing of scientific notation and non-ASCII characters #744

Comments

loesinghaus commented Dec 4, 2024

dosubot bot commented Dec 4, 2024

jamesbraza commented Dec 4, 2024