-
Notifications
You must be signed in to change notification settings - Fork 0
/
repo_logs
45 lines (33 loc) · 2.94 KB
/
repo_logs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Original File: POC_1_URL_Classification.ipynb
Lacked: 1. User-defined keywords to influence classification.
2. Website content language barrier, i.e, for languages other than English.
Modified File (1): POC_1_URL_Classification_mod.ipynb
Added: 1. User-defined productive and non-productive keywords to influence classification.
2. Langdetect Googletrans Translator to override website content language barrier, i.e, for languages other than English.
Lacked: 1. Simplicity due to being a Notebook file.
2. User input interface.
Modified File (2): POC_1_URL_Classification_menu-driven.ipynb
Added: 1. Self-created functions, which allowed me to break down my code into smaller, manageable pieces, promoting code organization, reusability, and modularity.
2. Menu-driven for simplicity and user-friendly code.
Lacked: 1. Sufficient data in dataset.
2. User input interface.
Modified File (3): POC_1_URL_Classification_menu-driven_Add_Data.ipynb
Added: 1. Self-created functions to push the resulting output into the original dataset to counter lack of data problem.
2. Choice to push data into dataset, so as to prevent wrong results into the dataset, which will hamper further classifications.
3. Newly updated dataset is then used to train the model for better results.
Lacked: 1. Discards the wrong resulting output (user judges the response/result to be right or wrong :: Initially).
2. Small dataset problem prevails if wrong results are discarded, so thinking of inserting the opposite classification in case of wrong output.
Modified File (4): POC_1_URL_Classification_menu-driven_AddCorrectData.ipynb -> POC_1_URL_Classification_Close2Final.ipynb
Added: 1. Choice to push correct data into dataset (judged by user), so as to prevent wrong results into the dataset, which will hamper further classifications.
Lacked: 1. Time consuming to add corrected data into dataset to enlargen it.
On Further Testing:
1. Some websites return Error 400 or Error 403 on Requests, even after using User_agents.
2. Presence of such url with a label in dataset results in huge classification errors.
3. Small dataset size results in the rise of importance of each and every url present.
4. Further Testing required using varieties of url of the same domain to check whether domain affects classification or not.
Modified File (4): POC_1_URL_Classification_Final.ipynb
Added: 1. Increasing sample url in dataset removes maximum error.
2. Included domain checker with sample domain inputs from user.
3. Same procedure as keywords_management, domain_management adds or deletes sample productive or non-productive domains.
4. Before classification, url's domain is extracted. If present in either of the provided sample domain categories, it is classified as such.
5. The classifier model is then skipped, else normal procedure.