Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The 'undefined' option for pdfToCairo's second parameter does not produce valid output for tiff files. #430

Open
3 tasks done
wunderkind2k1 opened this issue Jun 16, 2022 · 16 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@wunderkind2k1
Copy link

wunderkind2k1 commented Jun 16, 2022

Prerequisites

  • I have written a descriptive issue title

  • I have searched existing issues to ensure it has not already been reported

  • I agree to follow the Code of Conduct that this project adheres to

API/app/plugin version

5.1.6

Node.js version

v16.13.2

Operating system

macOS

Operating system version (i.e. 20.04, 11.3, 10)

12.4

Description

I am trying to fetch a single page from a pdf without writing it down in a separate image file. This works beautifully for jpg's and png's as documented in https://github.com/Fdawgs/node-poppler/blob/master/README.md#popplerpdftocairo.
For tiff files it's a different story though: It works only if I give an output file as a second parameter in the pdfToCairo function, but not when I use 'undefined'. The result ist way to small - like it is only the header of the tiff file or something.

I checked wether my poppler version (22.05.0) is working correctly on the comand line. It does.
pdftocairo -tiff -f 1 -l 1 -singlefile example.pdf - > example.tiff works perfectly on the shell.

As far as I can see - node-poppler does send the correct params to the spawned child process. But the result is a very short string - sth like this:
image

From debugging index.js in node-poppler I can see that

image

is only called once for the whole childprocess. This could be the problem.

Steps to Reproduce

This code should be enough to see that foo.tif is not a valid tiff file.
You can use any or the provided pdf file: example.pdf

import { Poppler } from 'node-poppler';
import fs from 'fs';
import path from 'path';

const file = fs.readFileSync(path.join(__dirname, 'example.pdf'));

(async () => {
  const poppler = new Poppler('/usr/bin');
  const res: string | Error = await poppler.pdfToCairo(file, undefined, {
    firstPageToConvert:1,
    lastPageToConvert: 1,
    singleFile:true,
    tiffCompression: 'jpeg',
    tiffFile: true
    //pngFile: true

  });
  if (res instanceof Error) {
    console.log('Error: ' + JSON.stringify(res));
    return;
  }
  fs.writeFileSync('foo.tif', res, { encoding: 'binary' })
})();

Additional information:
Though I wrote this code on OSX I also tried it on a docker container with alpine linux expecting the behaviour to be an OSX glitch. But I could also reproduce the problem on linux successfully.

Expected Behaviour

The expected behaviour should be equal for all possible output formats - meaning when using an 'undefined' outputfile and the -singleFile Option the resulting string should contain valid image data.

@wunderkind2k1 wunderkind2k1 added the bug Something isn't working label Jun 16, 2022
@wunderkind2k1 wunderkind2k1 changed the title using the 'undefined' option for pdfToCairo's second parameter does not produce valid output for tiff files. The 'undefined' option for pdfToCairo's second parameter does not produce valid output for tiff files. Jun 16, 2022
@Fdawgs
Copy link
Owner

Fdawgs commented Jun 16, 2022

Thanks for reporting this @wunderkind2k1, I'll take a look!

@wunderkind2k1
Copy link
Author

Let me know if I can be of any help. Sven

@Fdawgs
Copy link
Owner

Fdawgs commented Jun 18, 2022

Thanks Sven, I plan on looking at this on Monday. You're more than welcome to have a crack at fixing it yourself and opening a PR if you wish!

@Fdawgs Fdawgs added the help wanted Extra attention is needed label Jun 20, 2022
@Fdawgs
Copy link
Owner

Fdawgs commented Jun 20, 2022

Can confirm this is also an issue using Windows 10 with the included binaries:

image

  • pdf_1.3_NHS_Constitution.tif is the result when passing a filename to write to, and can be opened
  • foo.tif is the result when writing to stdout and then writing to file using fs.writeFileSync(), and Windows Photo Viewer throws an error when opened

As you can see, both files are the same size, so no issue with them being cut short.
Not really sure on how to fix this. 😞

@wunderkind2k1
Copy link
Author

Ok. Thank you for verifying it. I think its kind of complicated to track down... maybe related to the stream events of the child process? I will try to find some time and have a deeper look into it.

@Mustafa-Aswadi

This comment was marked as off-topic.

@wunderkind2k1

This comment was marked as off-topic.

@Mustafa-Aswadi

This comment was marked as off-topic.

@wunderkind2k1
Copy link
Author

wunderkind2k1 commented Jan 3, 2023

Hi. A little update on this:
I had the chance to dig deeper into this during the holidays.

I changed src/index.js a little to get an error:

image

After running this testcode:

const fs = require("fs");
const path = require("path");
const { Poppler } = require("./index");
const findCairoBinaryPath = require("./findCairoBinaryPath");

const file = fs.readFileSync(
	path.join(__dirname, "../test_files/issue_430.pdf")
);

(async () => {
	try {
		const currentPathToPOPPLER = findCairoBinaryPath();
		const poppler = new Poppler(currentPathToPOPPLER);
		const res = await poppler.pdfToCairo(file, undefined, {
			tiffFile: true,
			tiffCompression: "jpeg",
			//pngFile: true,
			singleFile: true,
			firstPageToConvert: 1,
			lastPageToConvert: 1,
		});
		if (res instanceof Error) {
			console.log(`Error:\n${JSON.stringify(res)}`);
			return;
		}
		console.log(res.length);
		fs.writeFileSync("foo.tif", res, { encoding: "binary" });
	} catch (e) {
		console.log(`Error: ${e.toString()}`);
	}
})();

This is the resulting error:

Error: Error: TIFFAppendToStrip: Maximum TIFF file size exceeded.
TiffWriter: Error writing tiff row 16

// repetitions removed

TIFFAppendToStrip: Maximum TIFF file size exceeded.
TiffWriter: Error writing tiff row 1648
JPEGLib: Application transferred too few scanlines.

To me this looks like there is a problem with the underlying libs like libtiff then. But I haven't been able to find the reason yet, especially as I am wondering why a call to

cat test_files/issue_430.pdf | pdftocairo -tiff -tiffcompression jpeg -singlefile -f 1 -l 1 - - > bar.tif

produces a nice bar.tif file then.

@Fdawgs
Copy link
Owner

Fdawgs commented Jan 10, 2023

Thanks for continuing to look at this @wunderkind2k1!

@msageryd
Copy link

msageryd commented Sep 19, 2023

@wunderkind2k1 Did you find a solution to this?

I have just added streaming support to poppler.pdfToCairo in a fork. I tried rendering to TIFF and got the same problem. Just wanted to chime in and inform that this does not seem to be a buffer problem, since my stream version behaves the same.

The file size of my TIFF files are 8 bytes.
Rendering to JPG files is no problem with file sizes about 3-9 MB

I can confirm that the exact same pdftocario call works fine manually. The resulting TIFF file is 58 MB in my case.
I.e. the last argument to my manually called pdftocairo is "-", which will direct the output to stdout. Piping this into a file works great.

The only difference between my manual call and poppler.pdfToCairo is the spawn command.

  • stdout ends after 8 bytes when -tiff is used
  • I tried to attach a 'finish' event and I got it. This means that stream.end is called from somewhere.
  • stdout works great when -jpeg is used

This post might relate to this: nodejs/node#12921
But I didn't find anything useful.

My fork: https://github.com/msageryd/node-poppler

@wunderkind2k1
Copy link
Author

I didn't find a solution to this until now. I ended up doing the tiff conversion with libvips... which has been a solution but a unnecessarily complex one.

@msageryd
Copy link

msageryd commented Sep 19, 2023

Are you using libvips for converting from pdf->tiff?
Or are you using pdftocario for png export and libvips for png->tiff?

If the latter is true, you might want to use my stream version, since this makes it easy to pipe the result from poppler.pdfToCairo straigth into Sharp (Sharp uses libvips).

@wunderkind2k1
Copy link
Author

I extracted a png with poppler and used sharp to convert it to a tiff which than has been fed into an ocr... But I am not working on that project anymore. Thanks for sharing your lib - I might use it if needed <3

@mvz
Copy link

mvz commented Jan 11, 2024

This issue may be related: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1334.

TL;DR: The tiff library needs random access to the output file.

@wunderkind2k1
Copy link
Author

That sounds like the explanation! Thank you for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants