Skip to content

This repository keeps my solution for Task 1 in the Introduction to Machine Learning course in Innopolis University. The key technics here are data preprocessing and training ANN on highly imbalanced dataset.

Notifications You must be signed in to change notification settings

PodYapolskiy/where-is-waldo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Task 1: Where's Waldo? (50%)

Fingerprinting

Browser fingerprinting is a technique used to identify and track individuals based on unique characteristics of their web browser configuration. These characteristics can include the browser type, version, installed plugins, and screen resolution, among others. By combining these attributes, websites can create a digital fingerprint that can be used to track user behavior across multiple sites, even if they clear their cookies or use different devices. This has raised concerns about privacy and the potential for this technology to be used for targeted advertising, surveillance, and other purposes.

Read more about Fingerprinting

What You Need to Do

In this task, you are required to employ a fully connected feed-forward Artificial Neural Network (ANN) to tackle a classification problem. This involves several key steps, each critical to the development and performance of your model:

  • Exploratory Data Analysis (EDA) (10%): Begin by conducting a thorough exploratory analysis of the provided dataset. Your goal here is to uncover patterns, anomalies, relationships, or trends that could influence your modeling decisions. Share the insights you gather from this process and explain how they informed your subsequent steps.

  • Data Preprocessing and Feature Engineering (10%): Based on your EDA insights, choose and implement the most appropriate data preprocessing steps and feature engineering techniques. This may include handling missing values, encoding categorical variables, normalizing data, and creating new features that could enhance your model's ability to learn from the data.

  • Model Design and Training (10%): Design a fully connected feed-forward ANN model. You will need to experiment with different architectures, layer configurations, and hyperparameters to find the most effective solution for the classification problem at hand.

  • Feature Importance Analysis (10%): After developing your model, analyze which features are most important for making predictions. Discuss how this analysis aligns with your initial EDA insights and what it reveals about the characteristics most indicative of specific user behaviors or identities.

  • Evaluation (10%): You will be required to submit your model prediction on a hidden data set.

Data

You will be using the data in Task_1.json to identify Waldo (user_id=0). The dataset includes:

  • "browser", "os" and "locale": Information about the software used.
  • "user_id": A unique identifier for each user.
  • "location": Geolocation based on the IP address used.
  • "sites": A list of visited URLs and the time spent there in seconds.
  • "time" and "date": When the session started in GMT.

Evaluation

After training, evaluate your model by printing the classification report on your test set. Then, predict whether each user in task_1_verify.json is Waldo or not, by adding the boolean is_waldo property to the task_1_verify.json:

  [
    {
+     "is_waldo": false,
      "browser": "Chrome",
      "os": "Debian",
      "locale": "ur-PK",
      "location": "Russia/Moscow",
      "sites": [
          // ...
      ],
      "time": "04:12:00",
      "date":"2017-06-29"
    }
    // ...
  ]

Learning Objectives

  • Exploratory Data Analysis: Apply suitable analysis techniques to gain insights and better understand the dataset.
  • Classification Approach: Identify the most appropriate method for the given problem.
  • Data Preprocessing: Select and execute proper preprocessing and encoding techniques.
  • Model Implementation: Utilize ANNs to address a classification problem, including training, validation, and testing phases.
  • Feature Importance Analysis: Determine and report which features are most critical for the model's predictions to uncover insights into specific user behaviors.

About

This repository keeps my solution for Task 1 in the Introduction to Machine Learning course in Innopolis University. The key technics here are data preprocessing and training ANN on highly imbalanced dataset.

Topics

Resources

Stars

Watchers

Forks