Task 1: Where's Waldo? (50%)
Browser fingerprinting is a technique used to identify and track individuals based on unique characteristics of their web browser configuration. These characteristics can include the browser type, version, installed plugins, and screen resolution, among others. By combining these attributes, websites can create a digital fingerprint that can be used to track user behavior across multiple sites, even if they clear their cookies or use different devices. This has raised concerns about privacy and the potential for this technology to be used for targeted advertising, surveillance, and other purposes.
Read more about Fingerprinting
In this task, you are required to employ a fully connected feed-forward Artificial Neural Network (ANN) to tackle a classification problem. This involves several key steps, each critical to the development and performance of your model:
-
Exploratory Data Analysis (EDA) (10%): Begin by conducting a thorough exploratory analysis of the provided dataset. Your goal here is to uncover patterns, anomalies, relationships, or trends that could influence your modeling decisions. Share the insights you gather from this process and explain how they informed your subsequent steps.
-
Data Preprocessing and Feature Engineering (10%): Based on your EDA insights, choose and implement the most appropriate data preprocessing steps and feature engineering techniques. This may include handling missing values, encoding categorical variables, normalizing data, and creating new features that could enhance your model's ability to learn from the data.
-
Model Design and Training (10%): Design a fully connected feed-forward ANN model. You will need to experiment with different architectures, layer configurations, and hyperparameters to find the most effective solution for the classification problem at hand.
-
Feature Importance Analysis (10%): After developing your model, analyze which features are most important for making predictions. Discuss how this analysis aligns with your initial EDA insights and what it reveals about the characteristics most indicative of specific user behaviors or identities.
-
Evaluation (10%): You will be required to submit your model prediction on a hidden data set.
You will be using the data in Task_1.json
to identify Waldo (user_id=0
). The dataset includes:
- "browser", "os" and "locale": Information about the software used.
- "user_id": A unique identifier for each user.
- "location": Geolocation based on the IP address used.
- "sites": A list of visited URLs and the time spent there in seconds.
- "time" and "date": When the session started in GMT.
After training, evaluate your model by printing the classification report on your test set. Then, predict whether each user in task_1_verify.json
is Waldo or not, by adding the boolean is_waldo
property to the task_1_verify.json
:
[
{
+ "is_waldo": false,
"browser": "Chrome",
"os": "Debian",
"locale": "ur-PK",
"location": "Russia/Moscow",
"sites": [
// ...
],
"time": "04:12:00",
"date":"2017-06-29"
}
// ...
]
- Exploratory Data Analysis: Apply suitable analysis techniques to gain insights and better understand the dataset.
- Classification Approach: Identify the most appropriate method for the given problem.
- Data Preprocessing: Select and execute proper preprocessing and encoding techniques.
- Model Implementation: Utilize ANNs to address a classification problem, including training, validation, and testing phases.
- Feature Importance Analysis: Determine and report which features are most critical for the model's predictions to uncover insights into specific user behaviors.