Default pipeline generates many "unreadable" documents #433

arnaudstiegler · 2024-05-08T13:18:42Z

Hi,
I use Augraphy extensively but I've noticed that:

the default pipeline can be too destructive on my documents to the point where a human cannot read the text on it (see example below)
the only way to have a "milder" augmentation pipeline is to create a custom pipeline which requires listing out all the augmentations and is a bit cumbersome to experiment with (so many options).

It'd be great to either provide an option like "mild/strong" for the default pipeline to give some control over the default pipeline without needing to deep-dive into the internals of the package.

For instance, this doc is almost unreadable, and training models on unreadable docs can lead to really damaging behaviors like hallucinating answers completely on docs that they can't read

kwcckw · 2024-05-12T07:45:11Z

Hi,

Thanks for your suggestion. We will consider that in the future update.

Alternatively, you may try to create a new pipeline or use some other predefined pipelines. There are 11 types of predefined pipelines, from pipeline_archetype1 to pipeline_archetype11:

https://github.com/sparkfish/augraphy/blob/dev/augraphy/default/pipeline.py

arnaudstiegler · 2024-05-12T18:47:04Z

Thanks for the answer! I didn't know about the predefined pipelines, not sure whether I missed them in the documentation. Are those just "random" pipelines or is there a specific use case / logic for each one?

kwcckw · 2024-05-14T01:03:26Z

Thanks for the answer! I didn't know about the predefined pipelines, not sure whether I missed them in the documentation. Are those just "random" pipelines or is there a specific use case / logic for each one?

So each pipeline is meant to generate a specific kind of real life dirty document effect. It should have a consistent output so you will be not able to see much variations in each archetype pipeline.

Travvy88 · 2024-10-25T12:37:56Z

@arnaudstiegler I faced the same problem. I get the source code of default pipeline and changed probabilties of every augmentation from 0.2 to 0.1. The output is less corrupted and more readable.

jboarman · 2024-10-25T14:14:52Z

@arnaudstiegler I faced the same problem. I get the source code of default pipeline and changed probabilties of every augmentation from 0.2 to 0.1. The output is less corrupted and more readable.

@Travvy88 If you want to push that update to the default pipeline as a PR, we'd love to include you as a contributor to the Augraphy library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default pipeline generates many "unreadable" documents #433

Default pipeline generates many "unreadable" documents #433

arnaudstiegler commented May 8, 2024

kwcckw commented May 12, 2024

arnaudstiegler commented May 12, 2024

kwcckw commented May 14, 2024

Travvy88 commented Oct 25, 2024

jboarman commented Oct 25, 2024

Default pipeline generates many "unreadable" documents #433

Default pipeline generates many "unreadable" documents #433

Comments

arnaudstiegler commented May 8, 2024

kwcckw commented May 12, 2024

arnaudstiegler commented May 12, 2024

kwcckw commented May 14, 2024

Travvy88 commented Oct 25, 2024

jboarman commented Oct 25, 2024