-
Notifications
You must be signed in to change notification settings - Fork 22
Home
Analyse historical shopping data with Spark and PixieDust in a Jupyter notebook
Use Jupyter Notebooks with IBM Watson Studio to analyse historical shopping data with the open-source Python packages Apache Spark and PixieDust. Create bar charts, line charts, scatter plots, pie charts, histograms and maps without any coding.
Cognitive
This code pattern shows how to analyse historical shopping data with Jupyter Notebooks in IBM Watson Studio and the open-source Python packages Apache Spark and PixieDust. Users can quickly analyse data and produce charts and maps.
By Patrick Titzler and Margriet Groenendijk
- link to demo video
- link to youtube video
In this code pattern historical shopping data is analysed in a Jupyter notebook with the open-source Python packages Apache Spark and PixieDust.
When the reader has completed this code pattern, they will understand how to:
- Use Jupyter Notebooks in IBM Watson Studio
- Load data with PixieDust and clean data with Spark
- Create charts and maps with PixieDust
- Log in to IBM Watson Studio
- Load the provided notebook into Watson Studio
- Load the customer data in the notebook
- Transform the data with Apacke Spark
- Create charts and maps with PixieDust
- IBM Watson Studio: a suite of tools and a collaborative environment for data scientists, developers and domain experts
- IBM Apache Spark: an open source cluster computing framework optimized for extremely fast and large scale data processing
- Jupyter notebooks: an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text
- PixieDust: Open source Python package, providing support for Javascript/Node.js code.
With PixieDust you can use the power of Python and Jupyter notebooks when you:
- have never coded before
- are an experienced data analyst or data scientist
- are a developer with little Python experience wanting to quickly explore some data
Jupyter notebooks is a tool used by many data scientists to wrangle and clean data, visualise data, build and test machine learning models and even write talks. The reason for this is that both text, code and figures and tables can be combined, which makes it easy to keep the code structured by adding a lot of comments and explanations of your thought processes and decisions you made.
To visualise data with Python there are many packages available. When you just got started this might be overwhelming. When you are experienced it still takes a bit of time to create charts, because the syntax of all these packages is slightly different. Especially as it is easy to spend a lot of time tweaking your code to create the perfect chart. I have to admit I tend to do this as it is so much fun, but definitely not always necessary.
With PixieDust you can explore data in a simpler way and also spend more time exploring the data instead of going down the rabbit hole of tweaking the code to change the colours, fonts, line styles, axes and anything else you can manually change.
The main command to create charts from Spark or pandas DataFrames is display(df)
. When you run this command in a cell in a notebook the data will be displayed in a table. Now you have the option to scroll through the data, filter the data or create a chart from a menu. All of this is simply done by clicking a few buttons.
PixieDust uses other visualisation packages to create the charts, such as matplotlib, bokeh, seaborn and Brunel. You can see it as a clever wrapper around these libraries that will save you time while exploring data.
To explore PixieDust you can go through this code pattern where historical shopping data is analyzed with Spark and PixieDust. The data is loaded, cleaned and then analyzed by creating various charts and maps. Jupyter notebooks are run in IBM Watson Studio. The code pattern will help you through all the steps to set up your IBM Cloud account, create the notebook and run the notebook.
In case you want to jump straight to the code, the GitHub repository contains the notebook that you can run both in the cloud or locally.
To learn more about PixieDust and Jupyter notebooks these are a few resources to get you started:
- Use Node.js in Jupyter notebooks with pixiedust_node
- Notebooks for Excel users
- Wrangle data in Jupyter notebooks with PixieDust Rosie
- PixieDebugger - a visual Python debugger for Jupyter notebooks
- Contributing to the PixieDust project
- Watson Studio: Master the art of data science with IBM's Watson Studio
- Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
- With Watson: Want to take your Watson app to the next level? Looking to utilize Watson Brand assets? Join the With Watson program to leverage exclusive brand, marketing, and tech resources to amplify and accelerate your Watson embedded commercial solution.