In this code pattern historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps.
When you have completed this code patterns, you will understand how to:
- Use Jupyter Notebooks in IBM Watson Studio
- Load data with PixieDust and clean data with Spark
- Create charts and maps with PixieDust
The intended audience is anyone interested in quickly analyzing data in a Jupyter notebook.
- Log in to IBM Watson Studio
- Load the provided notebook into Watson Studio
- Load the customer data in the notebook
- Transform the data with Apache Spark
- Create charts and maps with PixieDust
- x19_income_select.csv: Household income statistics for many categories of income, including wages, interest, social security, public assistance, and retirement. Compiled at the zip code geography level by the United States Census Bureau. Available as a data set on Watson Studio
- customers_orders1_opt.csv: Fictitious customer demographics and sales data. Published by IBM. Available as a data set on Watson Studio
- IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
- PixieDust: Open source Python package, providing support for Javascript/Node.js code.
- Create a project
- Create a notebook
- Load customer data in the notebook
- Transform the data with Apache Spark
- Create charts and maps with PixieDust
-
Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
-
Create a new project by clicking
+ New project
and choosingData Science
: -
Enter a name for the project name and click
Create
. -
NOTE: By creating a project in Watson Studio a free tier
Object Storage
service andWatson Machine Learning
service will be created in your IBM Cloud account. Select theFree
storage type to avoid fees. -
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the
Assets
andSettings
tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.
-
From the new project
Overview
panel, click+ Add to project
on the top right and choose theNotebook
asset type. -
Fill in the following information:
- Select the
From URL
tab. [1] - Enter a
Name
for the notebook and optionally a description. [2] - Under
Notebook URL
provide the following url: https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/notebooks/analyze-customer-data.ipynb [3] - For
Runtime
select theSpark Python 3.6
option. [4]
- Select the
-
Click the
Create
button. -
TIP: Once successfully imported, the notebook should appear in the
Notebooks
section of theAssets
tab.
-
Run the cells one at a time. Select the first cell and press the
(►) Run
button to start stepping through the notebook. -
Load the data set customers_orders1_opt.csv into the notebook.
Before analyzing the data, it needs to be cleaned and formatted. This can be done with a few pyspark commands:
-
Select only the columns you are interested in with
df.select()
-
Convert the AGE column to a numeric data type so you can run calculations on customer age with a user defined function (udf).
-
Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER with a second
udf
.
The data can now be explored with PixieDust:
-
With
display()
explore the data in a table. -
Then click on the below button to create one of the charts in the list.
-
Drag and drop the variables you want to display into the
Keys
andValues
fields. Select the aggregation from the drop-down menu and clickOK
. -
From the menu on the right of the chart you can select which renderer you want to use, where each one of them visualises the data in a different way. Other options are clustering by a variable, the size and orientation of the chart and the display of a legend.
-
Below are two examples of a bar chart and a map created in the notebook.
- Build a recommender with Apache Spark and Elasticsearch
- Create a web-based mobile health app using Watson services on IBM Cloud and IBM Watson Studio
- Use machine learning to predict U.S. opioid prescribers with Watson Studio and scikit-learn
- Watson Studio: Master the art of data science with IBM's Watson Studio
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.