GitHub - neuml/txtmarker: 🖍️ Highlight text in documents

Highlight text in documents

txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.

Current file formats supported:

pdf

Installation

The easiest way to install is via pip and PyPI

pip install txtmarker

Python 3.9+ is supported. Using a Python virtual environment is recommended.

txtmarker can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/txtmarker

Python 3.9+ is supported

Examples

The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.

Notebooks

Notebook	Description
Introducing txtmarker	Overview of the functionality provided by txtmarker
Highlighting with Transformers	AI-driven highlighting with Transformers

Configuration

The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.

Create a new highlighter

Creates a new highlighter instance.

from txtmarker.factory import Factory
highlighter = Factory.create("pdf")

extension

extension: string

Type of highlighter to create (i.e. pdf)

Optional constructor arguments:

formatter

formatter: callable

Formats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.

chunks

chunks: int

Splits queries into multiple chunks. This is designed for very long text matches.

Page text

Extracts page text from infile and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.

highlighter.pages("input.pdf")

infile

infile: string

Full path to input file

Highlight text

Highlights using provided annotations. Annotated file is stored as outfile.

highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])

infile

infile: string

Full path to input file

outfile

outfile: string

Full path to output file, i.e. the highlighted file

highlights

highlights: list of (string, string|regex)

List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call re.escape).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
examples		examples
src/python/txtmarker		src/python/txtmarker
test/python		test/python
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
demo.png		demo.png
logo.png		logo.png
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Examples

Notebooks

Configuration

Create a new highlighter

extension

Optional constructor arguments:

formatter

chunks

Page text

infile

Highlight text

infile

outfile

highlights

About

Releases 3

Packages

Languages

License

neuml/txtmarker

Folders and files

Latest commit

History

Repository files navigation

Installation

Examples

Notebooks

Configuration

Create a new highlighter

extension

Optional constructor arguments:

formatter

chunks

Page text

infile

Highlight text

infile

outfile

highlights

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages