This package is an implementation of the Vector-free L-BFGS solver and some scalable machine learning algorithms for Apache Spark.
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of which use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. Vector-free L-BFGS avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets.
spark-vlbfgs currently supports the following algorithms:
- Logistic Regression
- Linear Regression
- Softmax Regression
with regularization:
- L1
- L2
- Elastic Net
To be supported:
- Multilayer Perceptron Classifier
spark-vlbfgs is built using Apache Maven. To build spark-vlbfgs and its example programs, run:
mvn clean package -DskipTests
by default this project will be built against spark-2.0.0 with scala-2.11,
if you want to specify other version, use maven -D
parameter such as:
mvn clean package -Dscala.binary.version=2.10 -Dspark.version=2.0.0
then run example:
spark-submit
--master yarn
--num-executors 10
--executor-cores 2
--class org.apache.spark.ml.example.VLORExample
/path/to/spark-vlbfgs-0.1-SNAPSHOT.jar [paramlist]
You can train a logistic regression model via spark-vlbfgs API which is consistent with Apache Spark MLlib:
val dataset: Dataset[_] = spark.read.format("libsvm").load("data/a9a")
val trainer = new VLogisticRegression()
.setColsPerBlock(100)
.setRowsPerBlock(10)
.setColPartitions(3)
.setRowPartitions(3)
.setRegParam(0.5)
val model = trainer.fit(dataset)
println(s"Vector-free logistic regression coefficients: ${model.coefficients}")
- Scaling Apache Spark MLlib to Billions of Parameters (Spark Summit East 2017)
- Large-scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned (Spark Summit 2017)
If you have any questions or encounter bugs, feel free to submit an issue or contact:
We are immensely grateful to Xiangrui Meng for the initial work and guidance during the design and development of spark-vlbfgs.