Data preparation phase

Under the all2txt module, the unstructured / semi-structured files such as pdf2txt and epub2txt can be converted into txt, and it can well solve the problem of incoherent text content caused by single column, double column, and the order of Chinese text interspersed with charts.

At the same time, the types of elements after parsing are "Table", "FigureCaption", "NarrativeText", "ListItem", " Title [Chapter Title]", "Address [E-mail]","PageBreak", "Header [Header]", "Footer [Footer]", "UncategorizedText [arxiv vertical number]", " Image, Formula, etc. Tool scripts provide two forms: keeping full text and saving by category resolution.

Take pdf2txt as an example (the same goes for epub2txt):

retain the full text (default)

python pdf2txt.py -i "input_path" -o "output_file"

The result is

Fig. 1: The overall architecture of LayoutParser...
Fig. 2: The relationship between the three types of...
Fig. 3: Layout detection and OCR results visualization...
[1] Abadi, M., Agarwal, A., Barham, P., Brevdo...
[2] Alberti, M., Pondenkandath, V., W¨ursch...
[3] Antonacopoulos, A., Bridson, D., Papadopoulos...

Reserved according to different type categories

python pdf2txt.py -i "input_path" -o "output_file" --process_all

The result is

{
    "FigureCaption":[
        "Fig. 1: The overall architecture of LayoutParser...",
        "Fig. 2: The relationship between the three types of...",
        "Fig. 3: Layout detection and OCR results visualization..."
    ],
    "ListItem":[
        "[1] Abadi, M., Agarwal, A., Barham, P., Brevdo...",
        "[2] Alberti, M., Pondenkandath, V., W¨ursch...",
        "[3] Antonacopoulos, A., Bridson, D., Papadopoulos..."
    ]
}

According to different type categories, users can automatically choose which type of data to extract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data preparation phase

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data preparation phase