STAT GU4243/GR5243 Fall 2016 Applied Data Science

Project 4 Association mining of music and text

- from the million song data project

In this project we will explore the association between music features and lyrics words from a subset of songs in the million song data. Association rule minging has a wide range of applications that include marketing research (on co-purchasing), natural language processing, finance, public health and etc. Here the word "rules" is really as general as any interesting and meaningful patterns. Based on the association patterns identified, we will create lyric words recommender algorithms for a piece of music (using its music features).

May the next Nobel prize in Literature be with you!

Challenge

For this project, you will receive a set of 2350 songs from the million song data project-[coursework login required]. This is a hacking challenge where the organizer is interested in exploring a collection of creative lyrics recommender methods that the participating data scientists can come with. The participants are encouraged to discuss online and exchange ideas.

The data set released contain:

Common_id.txt: ids for the songs that have both lyrics and sound analysis information. 2350 in total;
lyr.Rdata: dim: 2350*5001. bag-of-words for 2350 songs stored in an R dataframe;
data.zip: h5 format music feature files for the 2350 songs;
msm_dataset_train.txt original text of the lyrics data. (Potentially can be used for n-gram models).

Submission

On 11/16/2016, you will receive music features of 100 songs. We will provide you with a dictionary of 5000 words. For each song, you need to produce a ranked list (with the most likely being the first) of 100 suggested lyric words from the given dictionary.

Project organization

This is a very short project. So we will not form teams. A GitHub starter codes repo will be posted online for you to fork and start your own project.

You can consider using the following data mining tools for this project. This is only a suggestive list.

Representation learning
Topic modeling
Clustering
Itemset mining or Association rule mining (More resources)
Word prediction
Recommendation based on similarities

On 11/9/2016, we will give tutorials and give comments on:

Topic modeling
Clustering algorithms
An example analysis of the million song dataset from Spring 2016. [Jingying Zhou]

Repositary requirement

The final repo should be under our class github organization (TZStatsADS) and be organized according to the structure of the starter codes.

proj/
├──doc/
├──figs/
├──lib/
├──output/
├── README

The data is too big to be hosted on GitHub.
The doc folder should have documentations for this project, presentation files and other supporting materials.
The figs folder contains figure files produced during the project and running of the codes.
The lib folder contain computation codes for your data analysis. Make sure your README.md is informative about what are the programs found in this folder.
The output folder is the holding place for intermediate and final computational results.

The root README.md should contain your name and an abstract of your findings.

Suggested workflow

This is a relatively short project. We only have about two weeks of working time.

[wk1] Week 1 is the data processing and mining week. Read data description, project requirement, browse data and the starter codes, and think about what to do and try out different tools you find related to this task.
[wk1] Try out ideas on a subset of the data set to get a sense of computational burden of this project.
[wk2] Based on outcomes from week 1 think carry out predictive modeling for the recommender algorithm.
[wk2] The data analysis is likely to take a lot of time. Start early.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project4_desc.md

Project4_desc.md

STAT GU4243/GR5243 Fall 2016 Applied Data Science

Project 4 Association mining of music and text

- from the million song data project

May the next Nobel prize in Literature be with you!

Challenge

Submission

Project organization

Repositary requirement

Suggested workflow

Files

Project4_desc.md

Latest commit

History

Project4_desc.md

File metadata and controls

STAT GU4243/GR5243 Fall 2016 Applied Data Science

Project 4 Association mining of music and text

- from the million song data project

May the next Nobel prize in Literature be with you!

Challenge

Submission

Project organization

Repositary requirement

Suggested workflow