-
Notifications
You must be signed in to change notification settings - Fork 5
Data Analysis
We will require new tools and techniques to perform analysis on the complex JSON data stored in CouchDB/ NoSQL via the Boatnet application suite.
Our approach is to utilize tools designed for Big Data:
-
Apache Drill: http://drill.apache.org/ which gives us full ANSI SQL capabilities, which we will run on CouchDB data dumped to a local or distributed filesystem.
-
Apache Spark: https://spark.apache.org/ which gives us a direct cloudant/couchdb data connector (https://bahir.apache.org/docs/spark/current/spark-sql-cloudant/) but requires significant infrastructure to be performant.
-
Apache Zeppelin: https://zeppelin.apache.org/ as a notebook-style frontend to this data. This allows interactive SQL queries, visualizations, and exporting to Excel and CSV data formats. It supports SQL, R, Python, and Scala languages, and has numerous data adapters to access our data via Drill, Spark, and many others.
Our initial tests have been more performant with Apache Drill. Initial rollout of a Boatnet data analysis example suite will be a Zeppelin instance communicating to Apache Drill on a static dump of the CouchDB data.
- Currently we have tools in progress that communicate directly with our CouchDB, such as the Lookups editor in obs-web. https://github.com/nwfsc-fram/boatnet
- These tools will probably also allow batch updates of data in CouchDB. Details TBD.
- See https://github.com/wsmith-nwfsc/couchdb-dump which is a port to export data in a simple format.
- Note that Drill functions better on well-partitioned data. Notes on this TBD, we seem to have good performance by partitioning by year.