Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of fat jars #340

Open
aalexandrov opened this issue May 4, 2017 · 5 comments
Open

Size of fat jars #340

aalexandrov opened this issue May 4, 2017 · 5 comments
Assignees

Comments

@aalexandrov
Copy link
Contributor

This is a general discussion question regarding the size of the fat-jars produced by the emma-spark-examples and emma-flink-examples modules.

Running

find -name '*jar' | grep -v original | grep -v nexus | xargs du -hs 

in the project root shows the following output

65M	./emma-examples/emma-examples-spark/target/emma-examples-spark-0.2-SNAPSHOT.jar
64M	./emma-examples/emma-examples-flink/target/emma-examples-flink-0.2-SNAPSHOT.jar
440K	./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT.jar
420K	./emma-examples/emma-examples-library/target/emma-examples-library-0.2-SNAPSHOT-tests.jar
148K	./emma-spark/target/emma-spark-0.2-SNAPSHOT.jar
148K	./emma-flink/target/emma-flink-0.2-SNAPSHOT.jar
20K	./emma-gui/target/emma-gui-0.2-SNAPSHOT.jar
56K	./emma-quickstart/target/emma-quickstart-0.2-SNAPSHOT.jar
3,7M	./emma-language/target/emma-language-0.2-SNAPSHOT.jar
3,9M	./emma-language/target/emma-language-0.2-SNAPSHOT-tests.jar

The emma-flink-examples and emma-spark-examples jars are ~65M each, which is also indicative of the expected size of any client jars binding emma-language and one of emma-flink or emma-spark in the future.

A closer in emma-spark-examples reveals the root causes (output is similar for the other one).

mvn dependency:list -DincludeScope=runtime -DoutputAbsoluteArtifactFilename=true \
  | grep '/home/alexander/.m2/repository' \
  | awk -F":compile:" '{print $2}' \
  | xargs du -hs \
  | sort -r -h \
  | sed "s|$HOME/.m2/repository/||"

The list looks as follows.

14M	org/scalanlp/breeze_2.11/0.12/breeze_2.11-0.12.jar
12M	org/scalaz/scalaz-core_2.11/7.2.7/scalaz-core_2.11-7.2.7.jar
7,0M	org/spire-math/spire_2.11/0.7.4/spire_2.11-0.7.4.jar
4,4M	org/typelevel/cats-kernel_2.11/0.9.0/cats-kernel_2.11-0.9.0.jar
3,7M	org/emmalanguage/emma-language/0.2-SNAPSHOT/emma-language-0.2-SNAPSHOT.jar
3,4M	com/chuusai/shapeless_2.11/2.3.2/shapeless_2.11-2.3.2.jar
3,3M	org/typelevel/cats-core_2.11/0.9.0/cats-core_2.11-0.9.0.jar
3,0M	org/scalacheck/scalacheck_2.11/1.13.4/scalacheck_2.11-1.13.4.jar
2,0M	org/apache/commons/commons-math3/3.4.1/commons-math3-3.4.1.jar
1,2M	org/typelevel/cats-laws_2.11/0.9.0/cats-laws_2.11-0.9.0.jar
1,2M	net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1.jar
1,1M	org/xerial/snappy/snappy-java/1.1.2.6/snappy-java-1.1.2.6.jar
1,0M	org/apache/parquet/parquet-jackson/1.9.0/parquet-jackson-1.9.0.jar
944K	org/apache/parquet/parquet-column/1.9.0/parquet-column-1.9.0.jar
780K	org/apache/parquet/parquet-encoding/1.9.0/parquet-encoding-1.9.0.jar
764K	org/codehaus/jackson/jackson-mapper-asl/1.9.11/jackson-mapper-asl-1.9.11.jar
748K	com/github/rwl/jtransforms/2.4.0/jtransforms-2.4.0.jar
724K	org/scalactic/scalactic_2.11/3.0.3/scalactic_2.11-3.0.3.jar
480K	log4j/log4j/1.2.17/log4j-1.2.17.jar
440K	org/emmalanguage/emma-examples-library/0.2-SNAPSHOT/emma-examples-library-0.2-SNAPSHOT.jar
384K	org/apache/parquet/parquet-format/2.3.1/parquet-format-2.3.1.jar
344K	com/univocity/univocity-parsers/2.4.1/univocity-parsers-2.4.1.jar
288K	io/spray/spray-json_2.11/1.3.3/spray-json_2.11-1.3.3.jar
280K	org/typelevel/cats-free_2.11/0.9.0/cats-free_2.11-0.9.0.jar
276K	com/typesafe/config/1.3.1/config-1.3.1.jar
268K	org/apache/parquet/parquet-hadoop/1.9.0/parquet-hadoop-1.9.0.jar
244K	io/verizon/quiver/core_2.11/5.5.14-scalaz-7.2/core_2.11-5.5.14-scalaz-7.2.jar
228K	org/codehaus/jackson/jackson-core-asl/1.9.11/jackson-core-asl-1.9.11.jar
208K	org/typelevel/cats-kernel-laws_2.11/0.9.0/cats-kernel-laws_2.11-0.9.0.jar
180K	org/scalanlp/breeze-macros_2.11/0.12/breeze-macros_2.11-0.12.jar
164K	com/github/mpilquist/simulacrum_2.11/0.10.0/simulacrum_2.11-0.10.0.jar
164K	com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar
148K	org/emmalanguage/emma-spark/0.2-SNAPSHOT/emma-spark-0.2-SNAPSHOT.jar
144K	com/github/scopt/scopt_2.11/3.5.0/scopt_2.11-3.5.0.jar
108K	com/jsuereth/scala-arm_2.11/2.0/scala-arm_2.11-2.0.jar
96K	commons-pool/commons-pool/1.5.4/commons-pool-1.5.4.jar
88K	org/spire-math/spire-macros_2.11/0.7.4/spire-macros_2.11-0.7.4.jar
72K	commons-codec/commons-codec/1.5/commons-codec-1.5.jar
44K	org/typelevel/discipline_2.11/0.7.2/discipline_2.11-0.7.2.jar
44K	org/slf4j/slf4j-api/1.7.25/slf4j-api-1.7.25.jar
44K	org/apache/parquet/parquet-common/1.9.0/parquet-common-1.9.0.jar
36K	org/typelevel/machinist_2.11/0.6.1/machinist_2.11-0.6.1.jar
24K	com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar
20K	net/sf/opencsv/opencsv/2.3/opencsv-2.3.jar
16K	org/scala-sbt/test-interface/1.0/test-interface-1.0.jar
12K	org/typelevel/catalysts-macros_2.11/0.0.5/catalysts-macros_2.11-0.0.5.jar
12K	org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar
8,0K	org/typelevel/cats-macros_2.11/0.9.0/cats-macros_2.11-0.9.0.jar
8,0K	com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar
4,0K	org/typelevel/macro-compat_2.11/1.1.1/macro-compat_2.11-1.1.1.jar
4,0K	org/typelevel/cats-jvm_2.11/0.9.0/cats-jvm_2.11-0.9.0.jar
4,0K	org/typelevel/cats_2.11/0.9.0/cats_2.11-0.9.0.jar
4,0K	org/typelevel/catalysts-platform_2.11/0.0.5/catalysts-platform_2.11-0.0.5.jar

It might be better to rely on the breeze version shipped with the dataflow engine rather than bundling our own. @ParkL could you check the versions bundled with Spark 2.1.0 and Flink 1.2.1?

I am not sure what to do with scalaz. It seems that we're only using it due to quiver, and I am not aware of any alternative which has smaller footprint or, say, relies on cats.

I am open for suggestions.

@aalexandrov aalexandrov added this to the May 2017 milestone May 4, 2017
@aalexandrov aalexandrov self-assigned this May 4, 2017
@aalexandrov
Copy link
Contributor Author

aalexandrov commented May 4, 2017

Flink 1.2.0 bundles breeze 0.12.

Spark 2.1.0 bundles breeze 0.12 as well, but with some exclusions.

I am not sure whether those are available in the classpath when submitting a job against a running cluster.

@joroKr21
Copy link
Member

joroKr21 commented May 5, 2017

I can access Breeze in the Spark REPL, but not in the Flink REPL. I have a few questions:

  1. How should libraries available in Spark and Flink be scoped - as provided?
  2. Why are test libraries like scalacheck and cats-laws submitted with the jar?
  3. Currently quiver is only needed at compile time. Can't we exclude it from the jar?

@aalexandrov
Copy link
Contributor Author

  1. Yes, thinks that could be found in the Flink or Spark classpath should be marked as provided. My understanding is that those are excluded from the fat-jar built by the shade plugin.
  2. I guess that those can be found along some non-(test or provided`) path in the dependency tree.
  3. This is a great idea!

@aalexandrov
Copy link
Contributor Author

aalexandrov commented May 5, 2017

Breeze is not in the Flink REPL because it's not a top level dependency in Flink (it's only listed in Flink's ML library).

@joroKr21
Copy link
Member

joroKr21 commented May 8, 2017

So MLLib and Flink-ML have a dependency on Breeze and Breeze has a dependency on Shapeless. The newest version of Breeze depends on the newest version of Shapeless, but MLLib and Flink-ML reference older versions of Breeze.

@aalexandrov aalexandrov modified the milestones: May 2017, Jun 2017 Jun 1, 2017
@aalexandrov aalexandrov modified the milestones: Jun 2017, Jul 2017 Jul 2, 2017
@aalexandrov aalexandrov modified the milestone: Jul 2017 Aug 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants