This package provides a class to extract text from a pdf.
This is a fork of Spatie/pdftotext
<?php
use Bakame\Pdftotext\Pdftotext;
$pdftotext = Pdftotext::fromUnix();
$text = $pdftotext->extract('/path/to/file.pdf');
You need PHP >= 7.2+ but the latest stable version of PHP is recommended.
Behind the scenes this package leverages pdftotext. You can verify if the binary installed on your system by issueing this command:
which pdftotext
If it is installed it will return the path to the binary.
To install the binary you can use
- On apt based system:
apt-get install poppler-utils
On yum based system:
yum install poppler-utils
On MacOS
brew install poppler
You can install the package via composer:
composer require bakame/pdftotext
Extracting text from a pdf is easy, just need to specify:
- the path to the
pdftotext
binary. - the path to the pdf file to extract.
<?php
use Bakame\Pdftotext\Pdftotext;
$text = (new Pdftotext('/path/to/pdftotext'))
->extract('/path/to/file.pdf')
;
If you are on a Linux based system you can use the fromUnix
named constructor which will try to locate
and return an instance using the correct executable path.
<?php
use Bakame\Pdftotext\Pdftotext;
$text = Pdftotext::fromUnix()->extract('/path/to/file.pdf');
Sometimes you may want to use pdftotext options.
You can add them as options to the extract
method calls like shown below:
<?php
use Bakame\Pdftotext\Pdftotext;
$text = Pdftotext::fromUnix()->extract('table.pdf', ['layout', 'r 96']);
If you need to add defaults options, you can use the setDefaultOptions
method
to add basic options on each extraction call, or use the class constructor :
<?php
use Bakame\Pdftotext\Pdftotext;
$text = (new Pdftotext('/path/to/pdftotext', ['layout', 'r 96']))
->extract('table.pdf', ['f 1'])
;
// will return the same data as
$text = Pdftotext::fromUnix(['layout', 'r 96'])->extract('table.pdf', ['f 1']);
// will return the same data as
$pdftotext = new Pdftotext('/path/to/pdftotext');
$pdftotext->setDefaultOptions(['layout', 'r 96']);
$text = $pdftotext->extract('table.pdf', ['f 1']);
Default options will be merge with the individuals options added when calling the extract
method.
You can even directly save your text extraction to a file using the save
method. This
method takes the same arguments as the extract
method but requires a destination file as its
second argument.
<?php
use Bakame\Pdftotext\Pdftotext;
$bytes = Pdftotext::fromUnix(['layout', 'r 96'])->save('table.pdf', 'table.txt', ['f 1']);
The returned $bytes
is the number of bytes written to the file.
You can set a timeout if you are dealing with larges PDF files using the setTimeout
method. By default, the timeout is set to 60 seconds.
<?php
use Bakame\Pdftotext\Pdftotext;
$pdftotext = new Pdftotext('/path/to/pdftotext', ['layout', 'r 96']);
$pdftotext->setTimeout(120); //the extraction will timeout after 2 minutes.
$bytes = $pdftotext->save('table.pdf', 'table.txt', ['f 1']);
The package has:
- a coding style compliance test suite using PHP CS Fixer.
- a code analysis compliance test suite using PHPStan.
- a PHPUnit test suite
To run the tests, run the following command from the project folder.
$ composer test
Contributions are welcome and will be fully credited. Please see CONTRIBUTING and CONDUCT for details.
If you discover any security related issues, please email [email protected] instead of using the issue tracker.
Please see CHANGELOG for more information on what has changed recently.
The MIT License (MIT). Please see License File for more information.