Skip to content

Project using Python, Hive and MapReduce to compare various techniques to find the top K words in a very large file i.e. different techniques to process Big Data.

Notifications You must be signed in to change notification settings

ridakn/Big-Data-Top-K-Words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big-Data-Top-K-Words

Project to compare various techniques to find the top K words in a very large file i.e. different techniques to process Big Data.

Introduction

In recent years, data has become very abundant due to the rapid increase of internet. It's consequently getting increasingly challenging to read, store and process large datasets; the input size is one of the key factors in determining how well or efficiently a program works and therefore, poses a problem and reduces efficiency the larger it gets. Being able to read, store and process huge amounts of data is the main problem of Big Data. Traditional database and data processing systems have been in place; however, the datasets are now so large that it's becoming difficult to manage this data efficiently. Apart from input size, there are various factors that affect how a program executes: the data structures used, the memory available and the algorithm used. This project focuses on analysing how these factors affect the performance for three datasets of different sizes i.e. three input sizes.

Aim

In this project, the main objective was to find the top K words in an input file where K is some integer pertaining to the frequency of each word i.e. how many times each word occurs in the text file and then print the top K most frequent words. Three text files were used as input, each of a different size: 400MB, 8GB and 32GB. Firstly, standard python technique is used, then MapReduce and Hive are used to showcase the improvement in performance for the same task.

Methods & Results

Python

Case 1: Read entire file into memory and used loop to count top K words.
Case 2: Read entire file into memory and used Python Counter for top K words.
Case 3: Read file line by line and used loop to count top K words.
Case 4: Read file line by line and used Python Counter for top K words.
Case 5: Read file in chunks and processed in parallel to find top K words.

Screen Shot 2021-06-23 at 7 43 41 PM

Map Reduce

Case 1: 1 Reducer
Case 2: Many Reducers (96)

Subcases:
Case A: Mapper & Reducer
Case B: Mapper, Reducer & Combiner
Case C: Mapper, Reducer, Combiner with Partitioner
Case D: Mapper, Reducer & Combiner using Compression of Text File

Screen Shot 2021-06-23 at 7 57 53 PM

Case 3: Varying the number of reducers

Screen Shot 2021-06-23 at 8 00 14 PM

Hive

Screen Shot 2021-06-23 at 8 01 19 PM

Screen Shot 2021-06-23 at 8 01 22 PM

Map Reduce VS Hive

Screen Shot 2021-06-23 at 8 02 31 PM

Conclusion

Screen Shot 2021-06-23 at 8 02 34 PM

About

Project using Python, Hive and MapReduce to compare various techniques to find the top K words in a very large file i.e. different techniques to process Big Data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages