Spark-Submit-Project

spark-submit-project (SSP) is a script file that calls spark-submit but takes aways the hassle of manually adding files to --py-files, --files, and --archives arguments.

There are many alternatives out there but for small scale development, those can be annoying to set up. SSP, though a bit unprofessional, is quick and gets you started right away.

Contents

Spark-Submit-Project
Install
Usage
Including Files and Folders
Configuration
- Python
- Configuration File
  - PATHS
  - LOGGING
  - OPTIONS
Examples

Install

Method 1

Create a folder. This is your project folder.
Copy the contents of dist/linux/ or dist/windows/ to your project folder, according to your OS.

Method 2

Download the folder dist/linux/ or dist/windows/, accordind to your OS.
Create an environment variable SSP_HOME_DIR that has the path of the downloaded folder.
Also add the path of the downloaded folder to your PATH environment variable.
Now you can run ssp or ssp.sh from anywhere in a terminal. ssp init or ssp.sh init can be run to make your current directory, your project folder.

Requires python>=3.6 and pip If you do not wish to change your environment's python version, look at the configuration section to learn how to define what python SSP uses.

Usage

In a console/terminal with the present working directory as your project folder, use the following commands: Linux

$ ./ssp.sh <args>

Windows

> .\ssp.bat <args>

<args> are the same args you would pass to spark-submit. You can even pass your own --py-files, --files, and --archives arguments.

See the examples.

Including Files and Folders

SSP needs to know what you want to submit with spark-submit and what not to. For this, there are several directories and files it uses.

Default values assume your project folder as the working directory. To change defaults, look at the configuration section.

Item	Description	Default
Requirements File	Contains package names. One per line. See this link for details.	requirements.txt
Libraries Directory	^[1]A private directory used by SSP script to store downloaded files described in Requirements File.	.spark-submit-project/lib
Source Code Directory	^[2]The directory where your code files reside. It is not necessary to have this directory.	src
Distribution Directory	^[1] A private directory used by SSP to store archived files created during submission.	.spark-submit-project/dist
Include Code File	^[3]A file where each line is a file path of a code related file e.g. a py file, a package zip, egg or whl. These are submitted as --py-files arg.	None
Include Code Directory	^[2]A directory whose files and directories are submitted as --py-files arg.	None
Include Assets File	^[3][4]A file where each line is a file path of a non-code related file.	None
Include Assets Directory	^[2][4]A directory containing none code files and directories .	None

Paths are relative to the present working directory of the shell where the script is run.

^[1] This directory should only be used by SSP. Do not copy your files into it.

^[2] The top-level files are included directly. Whereas top-level directories are archived (zip) and then included in the list of files to send with spark-submit.

^[3] Paths must not be directories,

^[4] zip files are sent as the --archives arg, while other files are sent as --files arg. You can send all files as --files by changing the configuration file.

NOTE: Files from all these locations and directories (as zip archives) will be placed on the working directory of the executors after spark-submit. This means you can import/access files without manually adding the directories to the path.

ATTENTION: Python package dependencies that required c/c++ compilation are likely to fail when shared through this method. Though I only did testing on a standalone mode spark cluster. See this link to learn more.

Configuration

Python

If you do not want to install python>=3.6 in your working environment, you can set an environment variable SSP_PYTHON with the path of the python to use. Alternatively, you can edit the bash/batch script.

Configuration File

.spark-submit-project/ssp.conf is the configuration file used by the script. There are three sections in the file:

PATHS

Here you will set up the paths of the directories and files mentioned above.

LOGGING

Here you define the level of logging. Logs are stored in .spark-submit-project/log.txt. The 'Level' is an integer. See this link for more details.

OPTIONS

Here you set whether you want to use the --archives option. If you prefer to not use this option, then assets' zip files will be sent as --files args.

Examples

Find examples in the example folder.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dist		dist
example		example
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark-Submit-Project

Install

Usage

Including Files and Folders

Configuration

Python

Configuration File

PATHS

LOGGING

OPTIONS

Examples

About

Releases

Packages

Languages

License

HussainTaj-W/spark_submit_project

Folders and files

Latest commit

History

Repository files navigation

Spark-Submit-Project

Install

Usage

Including Files and Folders

Configuration

Python

Configuration File

PATHS

LOGGING

OPTIONS

Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages