The objective of this homework was to process the log generator file using Hadoop Map Reduce Framework. This will help in parallel processing of the data.
The project was developed using the following environment
- Windows OS
- IDE: IntelliJ IDEA 2021
- VMWare Workstation 16 Pro
- Hortonworks 3.0.1 Sandbox
- Java 1.8 needs to be installed on the system
- Setup the HDP Sandbox
- SBT needs to be installed on your system
Click on this link to see how to deploy your Map Reduce on AWS EMR
We start off by creating a log generator dataset. We have created a log file consisting of 50,000 log messages. This will be used to implement all the four jobs to be performed. We then perform tasks for all the four functionalities mentioned.
-
Job 1:
Mapper Class: Mapper_Job1
Reducer Class: Reducer_Job1
Goal: To show different messages(ERROR, DEBUG, INFO, WARN) across predefined time intervals along with their string instances of the designated regex pattern.
-
Job 2:
Mapper Class: Mapper_Job2
Reducer Class: Reducer_Job2
Goal: The message of type ERROR is to be displayed in the descending order of its time interval having the strings instances of the designated string pattern.
-
Job 3:
Mapper Class: Mapper_Job3
Reducer Class: Reducer_Job3
Goal: Compute aggregation of the messages produced. For eg. (ERROR, 16), (INFO,22)
-
Job 4:
Mapper Class: Mapper_Job4
Mapper Class: Reducer_Job4
Goal: For each of the message type we have to compute the total number of characters it's string instances has which are found in the designated Regex pattern.
-
Clone this repo onto your system
-
Open command line of your OS and browse to project directory
-
Build using(In the Intellij terminal or cmd in Windows):
sbt clean compile assembly
-
Using VSCode open the folder of this jar file and click on Go Live.
-
Start VMWare Workstation Pro
-
Run using:
hadoop jar jarname.jar inp_dir out_dir
This is the sample output that I have received for each of the jobs.
14610=DEBUG,1
14611=INFO,1
14611=WARN,1
14612=DEBUG,1
14614=INFO,1
14614=WARN,1
14615=ERROR,1
14617=ERROR,1
14617=INFO,1
14617=WARN,1
7,14618
7,14669
7,14886
7,14881
7,14848
7,14761
7,14641
4,14740
4,14671
4,14692
DEBUG=10,8737
ERROR=10,843
INFO,10
WARN,10