Highlight text in documents
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.
Current file formats supported:
The easiest way to install is via pip and PyPI
pip install txtmarker
Python 3.9+ is supported. Using a Python virtual environment is recommended.
txtmarker can also be installed directly from GitHub to access the latest, unreleased features.
pip install git+https://github.com/neuml/txtmarker
Python 3.9+ is supported
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
Notebook | Description | |
---|---|---|
Introducing txtmarker | Overview of the functionality provided by txtmarker | |
Highlighting with Transformers | AI-driven highlighting with Transformers |
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
Creates a new highlighter instance.
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")
extension: string
Type of highlighter to create (i.e. pdf)
formatter: callable
Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
chunks: int
Splits queries into multiple chunks. This is designed for very long text matches.
Extracts page text from infile
and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.
highlighter.pages("input.pdf")
infile: string
Full path to input file
Highlights using provided annotations. Annotated file is stored as outfile
.
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])
infile: string
Full path to input file
outfile: string
Full path to output file, i.e. the highlighted file
highlights: list of (string, string|regex)
List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call re.escape
).