Analyze customer data using Jupyter notebooks, Apache Spark, and PixieDust

In this code pattern historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps.

When you have completed this code patterns, you will understand how to:

Use Jupyter Notebooks in IBM Watson Studio
Load data with PixieDust and clean data with Spark
Create charts and maps with PixieDust

The intended audience is anyone interested in quickly analyzing data in a Jupyter notebook.

Flow

Log in to IBM Watson Studio
Load the provided notebook into Watson Studio
Load the customer data in the notebook
Transform the data with Apache Spark
Create charts and maps with PixieDust

About the data

x19_income_select.csv: Household income statistics for many categories of income, including wages, interest, social security, public assistance, and retirement. Compiled at the zip code geography level by the United States Census Bureau. Available as a data set on Watson Studio
customers_orders1_opt.csv: Fictitious customer demographics and sales data. Published by IBM. Available as a data set on Watson Studio

Included Components

IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
PixieDust: Open source Python package, providing support for Javascript/Node.js code.

1. Create a project and add the Spark services

Log into IBM's Watson Studio. Once in, you'll land on the dashboard.
Create a new project by clicking + New project and choosing Data Science:
Enter a name for the project name and click Create.
NOTE: By creating a project in Watson Studio a free tier Object Storage service and Watson Machine Learning service will be created in your IBM Cloud account. Select the Free storage type to avoid fees.
Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the Assets and Settings tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.

2. Create a notebook

From the new project Overview panel, click + Add to project on the top right and choose the Notebook asset type.
Fill in the following information:
- Select the From URL tab. [1]
- Enter a Name for the notebook and optionally a description. [2]
- Under Notebook URL provide the following url: https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/notebooks/analyze-customer-data.ipynb [3]
- For Runtime select the Spark Python 3.6 option. [4]
Click the Create button.
TIP: Once successfully imported, the notebook should appear in the Notebooks section of the Assets tab.

3. Load customer data in the notebook

Run the cells one at a time. Select the first cell and press the (►) Run button to start stepping through the notebook.
Load the data set customers_orders1_opt.csv into the notebook.

4. Transform the data with Apache Spark

Before analyzing the data, it needs to be cleaned and formatted. This can be done with a few pyspark commands:

Select only the columns you are interested in with df.select()
Convert the AGE column to a numeric data type so you can run calculations on customer age with a user defined function (udf).
Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER with a second udf.

5. Create charts and maps with PixieDust

The data can now be explored with PixieDust:

With display() explore the data in a table.
Then click on the below button to create one of the charts in the list.

Drag and drop the variables you want to display into the Keys and Values fields. Select the aggregation from the drop-down menu and click OK.
From the menu on the right of the chart you can select which renderer you want to use, where each one of them visualises the data in a different way. Other options are clustering by a variable, the size and orientation of the chart and the display of a legend.
Below are two examples of a bar chart and a map created in the notebook.

Histogram

Map

Learn more

Watson Studio: Master the art of data science with IBM's Watson Studio
Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
doc/source/images		doc/source/images
notebooks		notebooks
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyze customer data using Jupyter notebooks, Apache Spark, and PixieDust

Flow

About the data

Included Components

Steps

1. Create a project and add the Spark services

2. Create a notebook

3. Load customer data in the notebook

4. Transform the data with Apache Spark

5. Create charts and maps with PixieDust

Related links

Learn more

License

About

Releases

Packages

Contributors 7

Languages

License

IBM/analyze-customer-data-spark-pixiedust

Folders and files

Latest commit

History

Repository files navigation

Analyze customer data using Jupyter notebooks, Apache Spark, and PixieDust

Flow

About the data

Included Components

Steps

1. Create a project and add the Spark services

2. Create a notebook

3. Load customer data in the notebook

4. Transform the data with Apache Spark

5. Create charts and maps with PixieDust

Related links

Learn more

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages