Skip to content
This repository has been archived by the owner on Jul 3, 2024. It is now read-only.

IBM/analyze-customer-data-spark-pixiedust

Repository files navigation

Analyze customer data using Jupyter notebooks, Apache Spark, and PixieDust

In this code pattern historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps.

When you have completed this code patterns, you will understand how to:

The intended audience is anyone interested in quickly analyzing data in a Jupyter notebook.

Flow

arch

  1. Log in to IBM Watson Studio
  2. Load the provided notebook into Watson Studio
  3. Load the customer data in the notebook
  4. Transform the data with Apache Spark
  5. Create charts and maps with PixieDust

About the data

  • x19_income_select.csv: Household income statistics for many categories of income, including wages, interest, social security, public assistance, and retirement. Compiled at the zip code geography level by the United States Census Bureau. Available as a data set on Watson Studio
  • customers_orders1_opt.csv: Fictitious customer demographics and sales data. Published by IBM. Available as a data set on Watson Studio

Included Components

  • IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
  • PixieDust: Open source Python package, providing support for Javascript/Node.js code.

Steps

  1. Create a project
  2. Create a notebook
  3. Load customer data in the notebook
  4. Transform the data with Apache Spark
  5. Create charts and maps with PixieDust

1. Create a project and add the Spark services

  • Log into IBM's Watson Studio. Once in, you'll land on the dashboard.

  • Create a new project by clicking + New project and choosing Data Science:

    studio project

  • Enter a name for the project name and click Create.

  • NOTE: By creating a project in Watson Studio a free tier Object Storage service and Watson Machine Learning service will be created in your IBM Cloud account. Select the Free storage type to avoid fees.

    studio-new-project

  • Upon a successful project creation, you are taken to a dashboard view of your project. Take note of the Assets and Settings tabs, we'll be using them to associate our project with any external assets (datasets and notebooks) and any IBM cloud services.

    studio-project-dashboard

2. Create a notebook

3. Load customer data in the notebook

  • Run the cells one at a time. Select the first cell and press the (►) Run button to start stepping through the notebook.

  • Load the data set customers_orders1_opt.csv into the notebook.

4. Transform the data with Apache Spark

Before analyzing the data, it needs to be cleaned and formatted. This can be done with a few pyspark commands:

  • Select only the columns you are interested in with df.select()

  • Convert the AGE column to a numeric data type so you can run calculations on customer age with a user defined function (udf).

  • Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER with a second udf.

5. Create charts and maps with PixieDust

The data can now be explored with PixieDust:

  • With display() explore the data in a table.

  • Then click on the below button to create one of the charts in the list.

notebook

  • Drag and drop the variables you want to display into the Keys and Values fields. Select the aggregation from the drop-down menu and click OK.

  • From the menu on the right of the chart you can select which renderer you want to use, where each one of them visualises the data in a different way. Other options are clustering by a variable, the size and orientation of the chart and the display of a legend.

  • Below are two examples of a bar chart and a map created in the notebook.

Histogram notebook

Map notebook

Related links

Learn more

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ