This example shows you how to classify URLs as phishy or normal using Phishing Website Dataset. Since we are classifying the elements of a given set into two groups ie. phishy or normal, this is a binary classification problem.
We have following features available in the dataset:
HAVING_IP_ADDRESS
: Whether an IP adress is used as an alternate to a domain name. {-1, 1}URL_LENGTH
: Whether URL length is legitimate, suspicious or phishing. {1, 0, -1}SHORTINING_SERVICE
: Whether it is using URL shortening service or not. {1, -1}HAVING_AT_SYMBOL
: Whether URL contains "@" symbol. {-1, 1}DOUBLE_SLASH_REDIRECTING
: Whether URL contains double slash redirecting or not. {-1, 1}PREFIX_SUFFIX
: Whether URL contains prefix or suffix separated by "-". {-1, 1}HAVING_SUB_DOMAIN
: Whether count of sub domains in URL is legitimate, suspicious or phishing. {-1, 0, 1}SSLFINAL_STATE
: Whether URL use https and issuer is trusted, use https but issuer is not trusted or no https. {-1, 0, 1}DOMAIN_REGISTERATION_LENGTH
: Whether domain expires in less than a year or not. {-1, 1}FAVICON
: Whether favicon is loaded from external domain or not. {-1, 1}PORT
: Whether port is of preferred status or not. {-1, 1}HTTPS_TOKEN
: Whetherhttps
token is part of domain or not. {-1, 1}REQUEST_URL
: Whether percentage of requests made to external domain falls in legitimate or suspicious category. {-1, 1}URL_OF_ANCHOR
: Whether percentage of url in anchor tags reference external domain or self falls in legitimate, suspicious or phishy category. {-1, 0, 1}LINKS_IN_TAGS
: Whether percentage of links in meta, script, link tags referencing external domain falls in legitimate, suspicious or phishy category. {-1, 0, 1}SFH
: Whether server form handler is empty or contains "about: blank", refers to a different domain or is normal. {1, 0, -1}SUBMITTING_TO_EMAIL
: Whether the form submits information to email. {-1, 1}ABNORMAL_URL
: Whether URL contains host name or not. {-1, 1}REDIRECT
: Whether URL redirects less than equal to 1, between 2 and 4 or greater than 4. {0, 1}ON_MOUSEOVER
: WhetheronMouseOver
changes status bar or not. {-1, 1}RIGHTCLICK
: Whether right click is disabled or not. {-1, 1}POPUPWIDNOW
: Whether pop up window contain text field or not. {-1, 1}IFRAME
: Whether page contains iframe tag or not. {-1, 1}AGE_OF_DOMAIN
: Whether age of domain is less than 6 months or not. {-1, 1}DNSRECORD
: Whether there is DNS record for the domain or not. {-1, 1}WEB_TRAFFIC
: Whether website ranking is less than 100,000, greater than 100,000 or is not recognized by Alexa and/or has no web traffic. {-1, 0, 1}PAGE_RANK
: Whether page rank is less than 0.2 or not. {-1, 1}GOOGLE_INDEX
: Whether web page is indexed by Google or not. {-1, 1}LINKS_POINTING_TO_PAGE
: Whether links pointing to the page is equal to 0, between 0 and 2 or greater than 2. {-1, 0, 1}STATISTICAL_REPORT
: Host belongs to top phishing IPs or domains or not. {-1, 1}
Prepare the environment:
$ npm install
# Or
$ yarn
To build and watch the example, run:
$ yarn watch