Data Analysis

Data Exploration and Analysis in Boatnet [ Work in Progress ]

We will require new tools and techniques to perform analysis on the complex JSON data stored in CouchDB/ NoSQL via the Boatnet application suite.

Our approach is to utilize tools designed for Big Data:

For analysis (read-only)

Apache Drill: http://drill.apache.org/ which gives us full ANSI SQL capabilities, which we will run on CouchDB data dumped to a local or distributed filesystem.
Apache Spark: https://spark.apache.org/ which gives us a direct cloudant/couchdb data connector (https://bahir.apache.org/docs/spark/current/spark-sql-cloudant/) but requires significant infrastructure to be performant.
Apache Zeppelin: https://zeppelin.apache.org/ as a notebook-style frontend to this data. This allows interactive SQL queries, visualizations, and exporting to Excel and CSV data formats. It supports SQL, R, Python, and Scala languages, and has numerous data adapters to access our data via Drill, Spark, and many others.

Our initial tests have been more performant with Apache Drill. Initial rollout of a Boatnet data analysis example suite will be a Zeppelin instance communicating to Apache Drill on a static dump of the CouchDB data.

Data Maintenance: Write access for data stewards

Currently we have tools in progress that communicate directly with our CouchDB, such as the Lookups editor in obs-web. https://github.com/nwfsc-fram/boatnet
These tools will probably also allow batch updates of data in CouchDB. Details TBD.

Dumping CouchDB JSON Data for Analysis

See https://github.com/wsmith-nwfsc/couchdb-dump which is a port to export data in a simple format.
Note that Drill functions better on well-partitioned data. Notes on this TBD, we seem to have good performance by partitioning by year.

[TODO] Configuring Apache Drill locally

http://drill.apache.org/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Analysis

Data Exploration and Analysis in Boatnet [ Work in Progress ]

For analysis (read-only)

Data Maintenance: Write access for data stewards

Dumping CouchDB JSON Data for Analysis

[TODO] Configuring Apache Drill locally

[TODO- Optional] Configuring Apache Spark SQL locally and production deployment notes

[TODO] Configuring Apache Zeppelin

[TODO] Getting Started with Apache Zeppelin on CouchDB Data

Home

Overview

Requirements

Roadmap

Data Analysis in Boatnet

Clone this wiki locally