Home

Short Name

Analyse historical shopping data with Spark and PixieDust in a Jupyter notebook

Short Description

Use Jupyter Notebooks with IBM Watson Studio to analyse historical shopping data with the open-source Python packages Apache Spark and PixieDust. Create bar charts, line charts, scatter plots, pie charts, histograms and maps without any coding.

Offering Type

Cognitive

Introduction

This code pattern shows how to analyse historical shopping data with Jupyter Notebooks in IBM Watson Studio and the open-source Python packages Apache Spark and PixieDust. Users can quickly analyse data and produce charts and maps.

Author

By Patrick Titzler and Margriet Groenendijk

Code

https://github.com/IBM/analyze-customer-data-spark-pixiedust

Demo

link to demo video

Video

link to youtube video

Overview

In this code pattern historical shopping data is analysed in a Jupyter notebook with the open-source Python packages Apache Spark and PixieDust.

When the reader has completed this code pattern, they will understand how to:

Use Jupyter Notebooks in IBM Watson Studio
Load data with PixieDust and clean data with Spark
Create charts and maps with PixieDust

Flow

Log in to IBM Watson Studio
Load the provided notebook into Watson Studio
Load the customer data in the notebook
Transform the data with Apacke Spark
Create charts and maps with PixieDust

Included components

IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
IBM Apache Spark: an open source cluster computing framework optimized for extremely fast and large scale data processing

Featured Technologies

Jupyter notebooks: an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text
PixieDust: Open source Python package, providing support for Javascript/Node.js code.

Blog

Blog Title: Speeding up data exploration with PixieDust and Jupyter notebooks

Blog Author: Margriet Groenendijk

Blog Content - see below

With PixieDust you can use the power of Python and Jupyter notebooks when you:

have never coded before
are an experienced data analyst or data scientist
are a developer with little Python experience wanting to quickly explore some data

Jupyter notebooks is a tool used by many data scientists to wrangle and clean data, visualise data, build and test machine learning models and even write talks. The reason for this is that both text, code and figures and tables can be combined, which makes it easy to keep the code structured by adding a lot of comments and explanations of your thought processes and decisions you made.

To visualise data with Python there are many packages available. When you just got started this might be overwhelming. When you are experienced it still takes a bit of time to create charts, because the syntax of all these packages is slightly different. Especially as it is easy to spend a lot of time tweaking your code to create the perfect chart. I have to admit I tend to do this as it is so much fun, but definitely not always necessary.

With PixieDust you can explore data in a simpler way and also spend more time exploring the data instead of going down the rabbit hole of tweaking the code to change the colours, fonts, line styles, axes and anything else you can manually change.

The main command to create charts from Spark or pandas DataFrames is display(df). When you run this command in a cell in a notebook the data will be displayed in a table. Now you have the option to scroll through the data, filter the data or create a chart from a menu. All of this is simply done by clicking a few buttons.

PixieDust uses other visualisation packages to create the charts, such as matplotlib, bokeh, seaborn and Brunel. You can see it as a clever wrapper around these libraries that will save you time while exploring data.

To explore PixieDust you can go through this code pattern where historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps. Jupyter notebooks are run in IBM Watson Studio. The code pattern will help you through all the steps to set up your IBM Cloud account, create the notebook and run the notebook.

In case you want to jump straight to the code, the GitHub repository contains the notebook that you can run both in the cloud or locally.

To learn more about PixieDust and Jupyter notebooks these are a few resources to get you started:

Learn more

Watson Studio: Master the art of data science with IBM's Watson Studio
Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
With Watson: Want to take your Watson app to the next level? Looking to utilize Watson Brand assets? Join the With Watson program to leverage exclusive brand, marketing, and tech resources to amplify and accelerate your Watson embedded commercial solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly