spark-submit-project (SSP) is a script file that calls spark-submit
but takes aways the hassle of manually adding files to --py-files
, --files
, and --archives
arguments.
There are many alternatives out there but for small scale development, those can be annoying to set up. SSP, though a bit unprofessional, is quick and gets you started right away.
Contents
Method 1
- Create a folder. This is your project folder.
- Copy the contents of
dist/linux/
ordist/windows/
to your project folder, according to your OS.
Method 2
- Download the folder
dist/linux/
ordist/windows/
, accordind to your OS. - Create an environment variable
SSP_HOME_DIR
that has the path of the downloaded folder. - Also add the path of the downloaded folder to your
PATH
environment variable. - Now you can run
ssp
orssp.sh
from anywhere in a terminal.ssp init
orssp.sh init
can be run to make your current directory, your project folder.
Requires python>=3.6 and pip If you do not wish to change your environment's python version, look at the configuration section to learn how to define what python SSP uses.
In a console/terminal with the present working directory as your project folder, use the following commands: Linux
$ ./ssp.sh <args>
Windows
> .\ssp.bat <args>
<args> are the same args you would pass to spark-submit
. You can even pass your own --py-files
, --files
, and --archives
arguments.
See the examples.
SSP needs to know what you want to submit with spark-submit
and what not to. For this, there are several directories and files it uses.
Default values assume your project folder as the working directory. To change defaults, look at the configuration section.
Item | Description | Default |
---|---|---|
Requirements File | Contains package names. One per line. See this link for details. | requirements.txt |
Libraries Directory | [1]A private directory used by SSP script to store downloaded files described in Requirements File. | .spark-submit-project/lib |
Source Code Directory | [2]The directory where your code files reside. It is not necessary to have this directory. | src |
Distribution Directory | [1] A private directory used by SSP to store archived files created during submission. | .spark-submit-project/dist |
Include Code File | [3]A file where each line is a file path of a code related file e.g. a py file, a package zip, egg or whl. These are submitted as --py-files arg. | None |
Include Code Directory | [2]A directory whose files and directories are submitted as --py-files arg. | None |
Include Assets File | [3][4]A file where each line is a file path of a non-code related file. | None |
Include Assets Directory | [2][4]A directory containing none code files and directories . | None |
Paths are relative to the present working directory of the shell where the script is run.
[1] This directory should only be used by SSP. Do not copy your files into it.
[2] The top-level files are included directly. Whereas top-level directories are archived (zip) and then included in the list of files to send with spark-submit
.
[3] Paths must not be directories,
[4] zip files are sent as the --archives arg, while other files are sent as --files arg. You can send all files as --files by changing the configuration file.
NOTE: Files from all these locations and directories (as zip archives) will be placed on the working directory of the executors after spark-submit. This means you can import/access files without manually adding the directories to the path.
ATTENTION: Python package dependencies that required c/c++ compilation are likely to fail when shared through this method. Though I only did testing on a standalone mode spark cluster. See this link to learn more.
If you do not want to install python>=3.6 in your working environment, you can set an environment variable SSP_PYTHON
with the path of the python to use.
Alternatively, you can edit the bash/batch script.
.spark-submit-project/ssp.conf
is the configuration file used by the script.
There are three sections in the file:
Here you will set up the paths of the directories and files mentioned above.
Here you define the level of logging. Logs are stored in .spark-submit-project/log.txt
.
The 'Level' is an integer. See this link for more details.
Here you set whether you want to use the --archives
option.
If you prefer to not use this option, then assets' zip files will be sent as --files
args.
Find examples in the example folder.