Skip to content

sramirez/spark-RELIEFFC-fselection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RELIEF-F feature selection for Apache Spark

The present algorithm (called BELIEF) implements Feature Weighting (FW) on Spark for its application on Big Data problems. This repository contains an improved implementation of RELIEF-F algorithm [1], which has been extended with a cheap but effective feature redundancy elimination technique. BELIEF leverages distance computations computed in prior steps to estimate inter-feature redundancy relationships at virtually no cost. BELIEF is also highly scalable to different sample sizes, from hundreds of samples to thousands.

Spark package: https://spark-packages.org/package/sramirez/spark-RELIEFFC-fselection.

Main features:

  • Compliance with 2.2.0 Spark version, and ml API.
  • Support for sparse data and high-dimensional datasets (millions of features).
  • Include a new heuristic that removes redundant features from the final selection set.
  • Scalable to large sample sets.

This software has been tested on several large-scale datasets, such as:

Example (ml):

import org.apache.spark.ml.feature._
val selector = new ReliefFRSelector()
    	.setNumTopFeatures(10)
	.setEstimationRatio(0.1) 
    	.setSeed(123456789L) // for sampling
	.setNumNeighbors(5) // k-NN used in RELIEF
    	.setDiscreteData(true)
    	.setInputCol("features")// this must be a feature vector
    	.setLabelCol("labelCol")
    	.setOutputCol("selectedFeatures")


val result = selector.fit(df).transform(df)

Prerequisites:

RELIEF computations are required to be normalized to improve comparisons among feature ranks and nearest neighbor searches. Additionally, continuous data should have 0 mean, and 1 standard deviation for a better performing in REDUNDANCY estimations. We recommend to rely on MLLIB standard scaler to homogeneize data:

https://spark.apache.org/docs/latest/ml-features.html#standardscaler

Likewise, one-hot encoder is recommended for nominal features (unordered discrete data)

Contributors

References

[1] I. Kononenko, E. Simec, M. Robnik-Sikonja, Overcoming the myopia of inductive learning algorithms with RELIEFF, Applied Intelligence 7 (1) (1997) 39–55.