Skip to content

pranayperiwal/big-data-intro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

big-data-intro

Using Hadoop and Spark for some basic exercises

Task 1:

Preprocess data. Process the provided user query logs (search_data.sample). Strip the clickUrls in the query log using Hadoop to leave only a specific part (the url before the first ‘/’) of the clickUrls.

Example input: zhidao.baidu.com/question/48881311 Example output: zhidao.baidu.com

Output from the MapReduce operation:

image

Task 2:

Rank the tokens that appear most often in the queried url. Tokenlize the clickUrls in the query log, then rank them according to the number of times they appear. The output should be the top ten tokens and the number of times they appear.

Output:

image

Task 3:

Rank the time period (by minute) with the most queries. Count the number of query at each minute, then rank them from more to less. The output should be the top ten time period (by minute) with most queries and the number of queries during that time period.

Output:

image

About

Using Hadoop and Spark for some basic exercises

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published