Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', #349

Closed
7 tasks done
kdshreyas opened this issue Jun 21, 2023 · 3 comments
Closed
7 tasks done

Comments

@kdshreyas
Copy link

kdshreyas commented Jun 21, 2023

Summary of your issue

I encountered an issue while processing a PDF file where a specific page consistently triggers a "CalledProcessError" with the following command: ['java', '-Dfile.encoding=UTF8', '-jar']. This error disrupts the processing flow and prevents further execution.

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\Anaconda\envs\dev_env\lib\site-packages\tabula\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '1', '--lattice', '--format', 'JSON'

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?
    test_pdf_output.pdf

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?
    Python version:
    3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)]
    Java version:
    java version "1.8.0_371"
    Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
    tabula-py version: 2.3.0
    platform: Windows-10-10.0.19045-SP0
    uname:
    uname_result(system='Windows', node='IND-CHN-LT11760', release='10', version='10.0.19045', machine='AMD64')
    linux_distribution: ('MSYS_NT-10.0-19045', '3.1.7', '')
    mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: ?
  • Paste the output of java -version command on your terminal: ?
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: ?

What did you do when you faced the problem?

Code:

inputpdf = 'test_pdf_output.pdf'
page = 1
tables = tabula.read_pdf(inputpdf, pages = page, lattice = True, guess = False)
df = tables[0]

Expected behavior:

The command should execute successfully on the page of the PDF file, without encountering any errors.

Actual behavior:

The error "CalledProcessError" is encountered when processing the specified page within the PDF file.

Error from tabula-java:
Exception in thread "main" java.lang.IllegalArgumentException: lines must be orthogonal, vertical and horizontal
	at technology.tabula.Ruling.intersectionPoint(Ruling.java:214)
	at technology.tabula.Ruling.findIntersections(Ruling.java:378)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.findCells(SpreadsheetExtractionAlgorithm.java:134)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:63)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:41)
	at technology.tabula.CommandLineApp$TableExtractor.extractTablesSpreadsheet(CommandLineApp.java:452)
	at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:410)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:180)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)


---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In [237], line 3
      1 inputpdf = 'output.pdf'
      2 page = 1
----> 3 tables = tabula.read_pdf(inputpdf, pages = page, lattice = True, guess = False)
      4 df = tables[0]
      5 df

File D:\Anaconda\envs\dev_env\lib\site-packages\tabula\io.py:322, in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
    317     raise ValueError(
    318         "{} is empty. Check the file, or download it manually.".format(path)
    319     )
    321 try:
--> 322     output = _run(java_options, kwargs, path, encoding)
    323 finally:
    324     if temporary:

File D:\Anaconda\envs\dev_env\lib\site-packages\tabula\io.py:80, in _run(java_options, options, path, encoding)
     77     args.append(path)
     79 try:
---> 80     result = subprocess.run(
     81         args,
     82         stdout=subprocess.PIPE,
     83         stderr=subprocess.PIPE,
     84         stdin=subprocess.DEVNULL,
     85         check=True,
     86     )
     87     if result.stderr:
     88         logger.warning("Got stderr: {}".format(result.stderr.decode(encoding)))

File D:\Anaconda\envs\dev_env\lib\subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\\Anaconda\\envs\\dev_env\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '1', '--lattice', '--format', 'JSON', 'output.pdf']' returned non-zero exit status 1.

Related Issues:

@kdshreyas kdshreyas changed the title CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'D:\\Anaconda\\envs\\dev_env\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', '11', '--lattice', '--format', 'JSON' tabula-py CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', Jun 21, 2023
@chezou
Copy link
Owner

chezou commented Jun 21, 2023

Thanks for reporting the issue.

It looks like this is the tabula-java issue, which happens with he specific PDF. I can find similar issue in their repo.
tabulapdf/tabula-java#218

Would you mind if you could provide the PDF and report it on tabula-java?

@chezou chezou added tabula-java limitation PDF required PDF should be provided to address this issue and removed PDF required PDF should be provided to address this issue labels Jun 21, 2023
@chezou
Copy link
Owner

chezou commented Jun 21, 2023

Okay, I confirmed the issue happens with --lattice option for tabula-java with the file. It doesn't raise an error without --lattice option.

$ java  -Dfile.encoding=UTF8 -jar tabula/tabula-1.0.5-jar-with-dependencies.jar --pages 1 --lattice ~/Downloads/test_pdf_output.pdf
Exception in thread "main" java.lang.IllegalArgumentException: lines must be orthogonal, vertical and horizontal
	at technology.tabula.Ruling.intersectionPoint(Ruling.java:214)
	at technology.tabula.Ruling.findIntersections(Ruling.java:378)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.findCells(SpreadsheetExtractionAlgorithm.java:134)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:63)
	at technology.tabula.extractors.SpreadsheetExtractionAlgorithm.extract(SpreadsheetExtractionAlgorithm.java:41)
	at technology.tabula.CommandLineApp$TableExtractor.extractTablesSpreadsheet(CommandLineApp.java:452)
	at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:410)
	at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:180)
	at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124)
	at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106)
	at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)
$ java  -Dfile.encoding=UTF8 -jar tabula/tabula-1.0.5-jar-with-dependencies.jar --pages 1  ~/Downloads/test_pdf_output.pdf
"","Utah Medicaid Preferred Drug List - Effective April 1, 2023"
"",Quinolones
"",Last Brand
Preferred Drugs,Status Type Limits Mandatory 3-Month Additional Note
"",Update Required
Cipro suspension,Preferred Brand 02/01/10 Cipro susp
"ciprofloxacin 250, 500, 750mg Preferred",Generic 02/01/10
levofloxacin,Preferred Generic 02/01/16
moxifloxacin,Preferred Generic 01/01/21
"",Last Required Prior Brand
Non Preferred Drugs,Status Type Limits Additional Note
"",Update Authorization Form Required
Baxdela,Non Preferred Brand 10/01/17 Medication Coverage Exception
Cipro tablet,Non Preferred Brand 02/01/10 Medication Coverage Exception
ciprofloxacin 100mg tablet,Non Preferred Generic 01/01/22 Medication Coverage Exception
ciprofloxacin suspension,Non Preferred Generic 01/01/20 Medication Coverage Exception Cipro susp
ofloxacin tablet,Non Preferred Generic 02/01/10 Medication Coverage Exception
"",Tetracyclines
"",Last Brand
Preferred Drugs,Status Type Limits Mandatory 3-Month Additional Note
"",Update Required
doxycycline monohydrate,
"",Preferred Generic 01/01/20
"50, 100mg capsule",
doxycycline hyclate,
"",Preferred Generic 01/01/20
"50, 100mg",
minocycline,
"",Preferred Generic 01/01/20
"50, 75, 100mg capsule",
"",Last Required Prior Brand
Non Preferred Drugs,Status Type Limits Additional Note
"",Update Authorization Form Required
demeclocycline,Non Preferred Generic 01/01/20 Medication Coverage Exception
Doryx,Non Preferred Brand 01/01/20 Medication Coverage Exception
doxycycline (unless listed preferred),Non Preferred Generic 01/01/20 Medication Coverage Exception
Minocin,Non Preferred Brand 01/01/20 Medication Coverage Exception
minocycline ER capsule,Non Preferred Generic 12/01/22 Medication Coverage Exception
minocycline tablet,Non Preferred Generic 01/01/20 Medication Coverage Exception
Minolira,Non Preferred Brand 01/01/20 Medication Coverage Exception
Nuzyra,Non Preferred Brand 01/01/20 Medication Coverage Exception
Solodyn,Non Preferred Brand 01/01/20 Medication Coverage Exception
tetracycline,Non Preferred Generic 01/01/20 Medication Coverage Exception
Vibramycin,Non Preferred Brand 01/01/20 Medication Coverage Exception
Ximino,Non Preferred Brand 01/01/20 Medication Coverage Exception
"",Page 11 of 111

This should hit some issues on tabula-java side.

Close as tabula-py doesn't have any workaround.

@kdshreyas
Copy link
Author

Hey @chezou,

Thanks for the quick reply, I have created a issue tabulapdf/tabula-java#529 as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants