Skip to content

Data Analysis

Will Smith edited this page Dec 12, 2019 · 1 revision

Data Exploration and Analysis in Boatnet [ Work in Progress ]

We will require new tools and techniques to perform analysis on the complex JSON data stored in CouchDB/ NoSQL via the Boatnet application suite.

Our approach is to utilize tools designed for Big Data:

For analysis (read-only)

  • Apache Drill: http://drill.apache.org/ which gives us full ANSI SQL capabilities, which we will run on CouchDB data dumped to a local or distributed filesystem.

  • Apache Spark: https://spark.apache.org/ which gives us a direct cloudant/couchdb data connector (https://bahir.apache.org/docs/spark/current/spark-sql-cloudant/) but requires significant infrastructure to be performant.

  • Apache Zeppelin: https://zeppelin.apache.org/ as a notebook-style frontend to this data. This allows interactive SQL queries, visualizations, and exporting to Excel and CSV data formats. It supports SQL, R, Python, and Scala languages, and has numerous data adapters to access our data via Drill, Spark, and many others.

Our initial tests have been more performant with Apache Drill. Initial rollout of a Boatnet data analysis example suite will be a Zeppelin instance communicating to Apache Drill on a static dump of the CouchDB data.

Data Maintenance: Write access for data stewards

  • Currently we have tools in progress that communicate directly with our CouchDB, such as the Lookups editor in obs-web. https://github.com/nwfsc-fram/boatnet
  • These tools will probably also allow batch updates of data in CouchDB. Details TBD.

Dumping CouchDB JSON Data for Analysis

  • See https://github.com/wsmith-nwfsc/couchdb-dump which is a port to export data in a simple format.
  • Note that Drill functions better on well-partitioned data. Notes on this TBD, we seem to have good performance by partitioning by year.

[TODO] Configuring Apache Drill locally

[TODO- Optional] Configuring Apache Spark SQL locally and production deployment notes

[TODO] Configuring Apache Zeppelin

[TODO] Getting Started with Apache Zeppelin on CouchDB Data