www.activestate.com Open in urlscan Pro
2600:9000:229f:800:a:6be0:b800:93a1 Public Scan

Back to summary

URL:
https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/
Submission: On July 04 via api (July 4th 2023, 7:01:01 am UTC) from SG — Scanned from SG

Form analysis
3 forms found in the DOM

GET https://www.activestate.com

<form role="search" method="get" class="search-form" action="https://www.activestate.com" data-hs-cf-bound="true">
  <input type="search" class="search-field" placeholder="Search …" value="" name="s">
</form>

POST https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/5262266/f905f13a-b203-4188-85ec-1c047acbb38f

<form id="hsForm_f905f13a-b203-4188-85ec-1c047acbb38f" method="POST" accept-charset="UTF-8" enctype="multipart/form-data" novalidate=""
  action="https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/5262266/f905f13a-b203-4188-85ec-1c047acbb38f"
  class="hs-form-private hsForm_f905f13a-b203-4188-85ec-1c047acbb38f hs-form-f905f13a-b203-4188-85ec-1c047acbb38f hs-form-f905f13a-b203-4188-85ec-1c047acbb38f_4a53302e-2f01-4343-894b-ab9a13f76c2f hs-form stacked"
  target="target_iframe_f905f13a-b203-4188-85ec-1c047acbb38f" data-instance-id="4a53302e-2f01-4343-894b-ab9a13f76c2f" data-form-id="f905f13a-b203-4188-85ec-1c047acbb38f" data-portal-id="5262266" data-hs-cf-bound="true">
  <div class="hs_email hs-email hs-fieldtype-text field hs-form-field"><label id="label-email-f905f13a-b203-4188-85ec-1c047acbb38f" class="" placeholder="Enter your Email" for="email-f905f13a-b203-4188-85ec-1c047acbb38f"><span>Email</span><span
        class="hs-form-required">*</span></label>
    <legend class="hs-field-desc" style="display: none;"></legend>
    <div class="input"><input id="email-f905f13a-b203-4188-85ec-1c047acbb38f" name="email" required="" placeholder="Email Address" type="email" class="hs-input" inputmode="email" autocomplete="email" value=""></div>
  </div>
  <div class="hs_submit hs-submit">
    <div class="hs-field-desc" style="display: none;"></div>
    <div class="actions"><input type="submit" class="hs-button primary large" value="Sign me up »"></div>
  </div><input name="hs_context" type="hidden"
    value="{&quot;embedAtTimestamp&quot;:&quot;1688454053382&quot;,&quot;formDefinitionUpdatedAt&quot;:&quot;1665689622092&quot;,&quot;lang&quot;:&quot;en&quot;,&quot;renderRawHtml&quot;:&quot;true&quot;,&quot;userAgent&quot;:&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.198 Safari/537.36&quot;,&quot;pageTitle&quot;:&quot;How to detect a phishing URL using Python and Machine Learning&quot;,&quot;pageUrl&quot;:&quot;https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/&quot;,&quot;isHubSpotCmsGeneratedPage&quot;:false,&quot;hutk&quot;:&quot;baed7bb703c9cc509e8f981d7708dac3&quot;,&quot;__hsfp&quot;:2241961375,&quot;__hssc&quot;:&quot;36684543.1.1688454055083&quot;,&quot;__hstc&quot;:&quot;36684543.baed7bb703c9cc509e8f981d7708dac3.1688454055083.1688454055083.1688454055083.1&quot;,&quot;formTarget&quot;:&quot;#hbspt-form-4a53302e-2f01-4343-894b-ab9a13f76c2f&quot;,&quot;locale&quot;:&quot;en&quot;,&quot;timestamp&quot;:1688454055111,&quot;originalEmbedContext&quot;:{&quot;portalId&quot;:&quot;5262266&quot;,&quot;formId&quot;:&quot;f905f13a-b203-4188-85ec-1c047acbb38f&quot;,&quot;region&quot;:&quot;na1&quot;,&quot;target&quot;:&quot;#hbspt-form-4a53302e-2f01-4343-894b-ab9a13f76c2f&quot;,&quot;isBuilder&quot;:false,&quot;isTestPage&quot;:false,&quot;isPreview&quot;:false,&quot;isMobileResponsive&quot;:true},&quot;correlationId&quot;:&quot;4a53302e-2f01-4343-894b-ab9a13f76c2f&quot;,&quot;renderedFieldsIds&quot;:[&quot;email&quot;],&quot;captchaStatus&quot;:&quot;NOT_APPLICABLE&quot;,&quot;emailResubscribeStatus&quot;:&quot;NOT_APPLICABLE&quot;,&quot;isInsideCrossOriginFrame&quot;:false,&quot;source&quot;:&quot;forms-embed-1.3372&quot;,&quot;sourceName&quot;:&quot;forms-embed&quot;,&quot;sourceVersion&quot;:&quot;1.3372&quot;,&quot;sourceVersionMajor&quot;:&quot;1&quot;,&quot;sourceVersionMinor&quot;:&quot;3372&quot;,&quot;_debug_allPageIds&quot;:{},&quot;_debug_embedLogLines&quot;:[{&quot;clientTimestamp&quot;:1688454053876,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved pageContext values which may be overriden by the embed context: {\&quot;pageTitle\&quot;:\&quot;How to detect a phishing URL using Python and Machine Learning\&quot;,\&quot;pageUrl\&quot;:\&quot;https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/\&quot;,\&quot;userAgent\&quot;:\&quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.198 Safari/537.36\&quot;,\&quot;isHubSpotCmsGeneratedPage\&quot;:false}&quot;},{&quot;clientTimestamp&quot;:1688454053878,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved countryCode property from normalized embed definition response: \&quot;SG\&quot;&quot;},{&quot;clientTimestamp&quot;:1688454055106,&quot;level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;Retrieved analytics values from API response which may be overriden by the embed context: {\&quot;hutk\&quot;:\&quot;baed7bb703c9cc509e8f981d7708dac3\&quot;}&quot;}]}"><iframe
    name="target_iframe_f905f13a-b203-4188-85ec-1c047acbb38f" style="display: none;"></iframe>
</form>

GET https://www.activestate.com/search/

<form role="search" method="get" class="search-form" action="https://www.activestate.com/search/" data-hs-cf-bound="true">
  <div>
    <input type="search" class="search-field" placeholder="Search for …" value="" name="s">
    <input type="submit" class="search-submit" value="">
    <i class="fa fa-search2"></i>
  </div>
</form>

Text Content

 * Contact Sales
 * Sign In


 * Search ……
   * 

Free Account
 * Why ActiveState
 * Products
   * ActiveState Platform
     * Roadmap
     * Product Updates
   * State Tool (Package Manager)
   * Python
   * Perl
   * Tcl
   * Ruby
 * Enterprise
   * Overview
   * Enterprise Security
   * SLSA
   * Attestations
   * Artifact Repository
   * Artifactory Integration
   * Python 2 – Extended Support
   * Support & Maintenance
 * Pricing
 * Resources
   * Support
   * Blog
   * Product Demos
   * Data Sheets
   * White Papers
   * Webinars
   * Videos
   * Case Studies
   * Quick Reads

 * Why ActiveState
 * +Products
   * +ActiveState Platform
     * Roadmap
     * Product Updates
   * State Tool (Package Manager)
   * Python
   * Perl
   * Tcl
   * Ruby
 * +Enterprise
   * Overview
   * Enterprise Security
   * SLSA
   * Attestations
   * Artifact Repository
   * Artifactory Integration
   * Python 2 – Extended Support
   * Support & Maintenance
 * Pricing
 * +Resources
   * Support
   * Blog
   * Product Demos
   * Data Sheets
   * White Papers
   * Webinars
   * Videos
   * Case Studies
   * Quick Reads

Last Updated: August 5, 2022


PHISHING URL DETECTION WITH PYTHON AND ML



Phishing is a form of fraudulent attack where the attacker tries to gain
sensitive information by posing as a reputable source. In a typical phishing
attack, a victim opens a compromised link that poses as a credible website. The
victim is then asked to enter their credentials, but since it is a “fake”
website, the sensitive information is routed to the hacker and the victim gets
”‘hacked.”

Phishing is popular since it is a low effort, high reward attack. Most modern
web browsers, antivirus software and email clients are pretty good at detecting
phishing websites at the source, helping to prevent attacks. To understand how
they work, this blog post will walk you through a tutorial that shows you how to
build your own phishing URL detector using Python and machine learning:

 1. Identify the criteria that can recognize fake URLs
 2. Build a decision tree that can iterate through the criteria
 3. Train our model to recognize fake vs real URLs
 4. Evaluate our model to see how it performs
 5. Check for false positives/negatives


GET STARTED: INSTALL ML TOOLS WITH THIS READY-TO-USE PYTHON ENVIRONMENT

To follow along with the code in this Python phishing detection tutorial, you’ll
need to have a recent version of Python installed, along with all the packages
used in this post. The quickest way to get up and running is to install the
Phishing URL Detection runtime for Windows or Linux, which contains a version of
Python and all the packages you’ll need. 



In order to download the ready-to-use phishing detection Python environment, you
will need to create an ActiveState Platform account. Just use your GitHub
credentials or your email address to register. Signing up is easy and it unlocks
the ActiveState Platform’s many benefits for you!



For Windows users, run the following at a CMD prompt to automatically download
and install our CLI, the State Tool along with the COVID Simulation runtime into
a virtual environment:

powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Phishing-URL-Detection"

For Linux users, run the following to automatically download and install our
CLI, the State Tool along with the COVID Simulation runtime into a virtual
environment:

sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Phishing-URL-Detection




1 — HOW TO IDENTIFY A FRAUDULENT URL

A fraudulent domain or phishing domain is an URL scheme that looks suspicious
for a variety of reasons. Most commonly, the URL: 

 * Is misspelled
 * Points to the wrong top-level domain
 * A combination of a valid and a fraudulent URL
 * Is incredibly long 
 * Is just be an IP address
 * Has a low pagerank
 * Has a young domain age
 * Ranks poorly on the Alexa Top 1 Million Sites

All these are characteristics of a phishing URL that can help us distinguish it
from a valid URL. These characteristics can be converted into machine learning
feature sets such as numbers, labels and booleans.

The University of California, Irvine put together a dataset identifying
fraudulent versus valid URLs. Feature sets are divided into four main
categories:

 1. Address Bar-Based Features – these are features extracted from the URL
    itself, like URL length >54 characters, or whether it contains an IP
    address, uses an URL shortening service like TinyURL or Bitly, or employs
    redirection. Addition features may also include:
    
    
    
    
    * Adding a prefix or suffix separated by (-) to the domain
    * Having sub-domain and multi-sub-domains
    * Existence of HTTPS
    * Domain registration age
    * Favicon loading from a different domain
    * Using a non-standard port
 2. Abnormal Features – these may include: 
    * Loading images loaded in the body from a different URL
    * Minimal use of meta tags
    * The use of a Server Form Handler (SFH)
    * Submitting information to email
    * An abnormal URL
 3. HTML and JavaScript-Based Features – these can include things like: 
    * Website forwarding 
    * Status bar customization typically using JavaScript to display a fake URL 
    * Disabling the ability to right-click so users can’t view page source code
    * Using pop-up windows
    * iFrame redirection
 4. Domain-Based Features – these can include:
    * Unusually young domains
    * Suspicious DNS record
    * Low volume of website traffic
    * PageRank, where 95% of phishing webpages have no PageRank
    * Whether the site has been indexed by Google


2 — BUILDING A DECISION TREE

Given all the criteria that can help us identify phishing URLs, we can use a
machine learning algorithm, such as a decision tree classifier to help us decide
whether an URL is valid or not. 

First, let’s download the UC Irvine dataset and explore its contents. The
feature list contains:

 * having_IP_Address  { -1,1 }
 * URL_Length   { 1,0,-1 }
 * Shortining_Service { 1,-1 }
 * having_At_Symbol   { 1,-1 }
 * double_slash_redirecting { -1,1 }
 * Prefix_Suffix  { -1,1 }
 * having_Sub_Domain  { -1,0,1 }
 * SSLfinal_State  { -1,1,0 }
 * Domain_registeration_length { -1,1 }
 * Favicon { 1,-1 }
 * port { 1,-1 }
 * HTTPS_token { -1,1 }
 * Request_URL  { 1,-1 }
 * URL_of_Anchor { -1,0,1 }
 * Links_in_tags { 1,-1,0 }
 * SFH  { -1,1,0 }
 * Submitting_to_email { -1,1 }
 * Abnormal_URL { -1,1 }
 * Redirect  { 0,1 }
 * on_mouseover  { 1,-1 }
 * RightClick  { 1,-1 }
 * popUpWidnow  { 1,-1 }
 * Iframe { 1,-1 }
 * age_of_domain  { -1,1 }
 * DNSRecord   { -1,1 }
 * web_traffic  { -1,0,1 }
 * Page_Rank { -1,1 }
 * Google_Index { 1,-1 }
 * Links_pointing_to_page { 1,0,-1 }
 * Statistical_report { -1,1 }

And finally, the Result designates whether the URL is valid or not:

 * Result  { -1,1 }

Where -1 denotes an invalid URL and 1 is a valid URL.

Now let’s now jump into the code. First, we load the required modules:

# To perform operations on dataset

import pandas as pd

import numpy as np



# Machine learning model

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier



# Visualization

from sklearn import metrics

from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.tree import export_graphviz

Next we read and split the dataset:

df = pd.read_csv('.../dataset.csv')

dot_file = '.../tree.dot'

confusion_matrix_file = '.../confusion_matrix.png'

And then print the results:

print(df.head())

-1  1  1.1  1.2  -1.1  -1.2  -1.3  -1.4  -1.5  1.3  1.4  -1.6  1.5  -1.7  1.6     ...    -1.9  -1.10  0  1.7  1.8  1.9  1.10  -1.11  -1.12  -1.13  -1.14  1.11  1.12  -1.15  -1.16

0   1  1    1    1     1    -1     0     1    -1    1    1    -1    1     0   -1  ...       1      1  0    1    1    1     1     -1     -1      0     -1     1     1      1     -1

1   1  0    1    1     1    -1    -1    -1    -1    1    1    -1    1     0   -1  ...      -1     -1  0    1    1    1     1      1     -1      1     -1     1     0     -1     -1

2   1  0    1    1     1    -1    -1    -1     1    1    1    -1   -1     0    0  ...       1      1  0    1    1    1     1     -1     -1      1     -1     1    -1      1     -1

3   1  0   -1    1     1    -1     1     1    -1    1    1     1    1     0    0  ...       1      1  0   -1    1   -1     1     -1     -1      0     -1     1     1      1      1

4  -1  0   -1    1    -1    -1     1     1    -1    1    1    -1    1     0    0  ...      -1     -1  0    1    1    1     1      1      1      1     -1     1    -1     -1      1

This dataset contains 5 rows and 31 columns, where each column contains a value
for each of the attributes we discussed in the above section.


3 — TRAIN THE MODEL

As always, the first step in training a machine learning model is to split the
dataset into testing and training data:

X = df.iloc[:, :-1]

y = df.iloc[:, -1]


Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

Since the dataset contains boolean data, it’s always best to use a Decision
Tree, RandomForest Classifier or Logistic Regression algorithm since these
models work best for classification. In this case, I chose to work with a
Decision Tree, because it’s straightforward and generally gives the best results
when trying to classify data.

model = DecisionTreeClassifier()

model.fit(Xtrain, ytrain)


4 — EVALUATE THE MODEL

Now that the model is trained, let’s see how well it does on the test data:

ypred = model.predict(Xtest)

print(metrics.classification_report(ypred, ytest))

print("\n\nAccuracy Score:", metrics.accuracy_score(ytest, ypred).round(2)*100, "%")

We used the model to predict Xtest data. Now let’s compare the results to ytest
and see how well we did:

             precision    recall  f1-score   support

         -1       0.95      0.95      0.95      1176

          1       0.96      0.96      0.96      1588


  micro avg       0.96      0.96      0.96      2764

  macro avg       0.96      0.96      0.96      2764

weighted avg       0.96      0.96      0.96      2764


Accuracy Score: 96.0 %

Not bad! We made literally no modifications to the data and achieved an accuracy
score of 96%. From here, you can dive deeper into the data and see if there’s
any transformation that can be done to further improve the accuracy of
prediction.


5 — IDENTIFY FALSE POSITIVES & FALSE NEGATIVES

The results of any decision tree evaluation are likely to contain both false
positives (URLs that are actually valid, but that our model indicates are not),
as well as false negatives (URLs that are actually bad, but our model indicates
are fine). To help resolve these instances, let’s draw out a confusion matrix (a
table with 4 different combinations of predicted and actual values) for our
results. The matrix will help us identify:

 * True Positives
 * True Negatives
 * False Positives (Type 1 Error)
 * False Negatives (Type 2 Error)

mat = confusion_matrix(ytest, ypred)

sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)

plt.xlabel('true label')

plt.ylabel('predicted label');

plt.savefig(confusion_matrix_file)

As you can see, the number of false positives and false negatives are pretty low
compared to our true positives and negatives, so we can be pretty sure of our
results.

To see how the decision tree panned out in making these decisions, we can
visualize it with sklearn, matplotlib and sns.

export_graphviz(model, out_file=dot_file, feature_names=X.columns.values)
>> dot -Tpng tree.dot -o tree.png

We use export_graphviz to create a dot file of the decision tree, which is a
text file that lets us visualize the actual bifurcations in decisions. Then,
using the command line tool dot we convert the text file to a PNG image which
shows our final “tree” of decisions (open it in a new tab to view the details):





PHISHING URL DETECTION WITH PYTHON: SUMMARY

These days, when everyone is working for home, there’s a lot less opportunity to
just casually ask your office colleagues if they’ve received a suspicious email
like the one you just got. And attackers know it, driving a 300% increase in
cybercrime since the start of the pandemic. It’s always good practice to check
every link before you click on it, but of course, busy employees can get
careless.

This blog post showed you how, given a set of criteria that can typically
identify phishing URLs, you can build and train a simple decision tree model to
evaluate any given URL, and indicate whether it is actually valid or not with
96% accuracy. Now, if only it was as easy as this to prevent people from
clicking fraudulent links in the first place!

 * You can find the criteria for evaluating phishing URLs in UC Irvine’s
   dataset.
 * To get started building your own URL phishing detector, sign up for a free
   ActiveState Platform account so you can download our Phishing URL Detection
   runtime environment and get started faster.

RECOMMENDED READS

> Top 5 Cybersecurity Tools for a Work-from-Home World



> Using Python for CyberSecurity Testing



Swaathi Kakarla


SWAATHI KAKARLA

Guest blogger: Swaathi Kakarla is the co-founder and CTO at Skcript. She enjoys
talking and writing about code efficiency, performance, and startups. In her
free time, she finds solace in yoga, bicycling and contributing to open source.

Home » ActiveState Blog on Decision tree » Machine Learning » Phishing URL
Detection with Python and ML
decision tree, machine learning, phishing, URL detection



BLOG AUTHOR

Swaathi Kakarla

Guest blogger: Swaathi Kakarla is the co-founder and CTO at Skcript. She enjoys
talking and writing about code efficiency, performance, and startups. In her
free time, she finds solace in yoga, bicycling and contributing to open source.




PRACTICAL INFO IN YOUR INBOX

Get our latest blogs, resources and insights to help you create more value with
open source languages



Languages & Tools

 * ActiveState Platform
 * State Tool
 * Perl
 * Python
 * Tcl
 * Ruby
 * Komodo IDE



Product Info

 * Why ActiveState
 * Enterprise Solutions
 * Enterprise Security
 * Plans & Pricing
 * ActivePython vs Anaconda
 * Product Updates
 * Open Source Maintainers

Support

 * Contact Support
 * Service Status
 * Documentation
 * FAQs
 * Community Forum
 * Komodo Forum
 * Videos

Company

 * About Us
 * Contact Us
 * Resellers
 * Customers
 * Careers
 * Leadership
 * Press




STAY UP-TO-DATE ON ACTIVESTATE NEWS

Email*



You can unsubscribe at any time. For more information, consult our Privacy
Policy.

© 2022 ActiveState Software Inc. All rights reserved. ActiveState®, ActivePerl®,
ActiveTcl®, ActivePython®, Komodo®, ActiveGo™, ActiveRuby™, ActiveNode™,
ActiveLua™, and The Open Source Languages Company™ are all trademarks of
ActiveState.
Legal - Privacy Policy - Accessibility




What’s the state of your software supply chain Take the 2023 survey!
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word word word word word

mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1


×
×

www.activestate.com Open in urlscan Pro 2600:9000:229f:800:a:6be0:b800:93a1 Public Scan

Form analysis 3 forms found in the DOM

GET https://www.activestate.com

POST https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/5262266/f905f13a-b203-4188-85ec-1c047acbb38f

GET https://www.activestate.com/search/

Text Content

www.activestate.com Open in urlscan Pro
2600:9000:229f:800:a:6be0:b800:93a1 Public Scan

Form analysis
3 forms found in the DOM