www.activestate.com
Open in
urlscan Pro
2600:9000:229f:800:a:6be0:b800:93a1
Public Scan
URL:
https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/
Submission: On July 04 via api from SG — Scanned from SG
Submission: On July 04 via api from SG — Scanned from SG
Form analysis
3 forms found in the DOMGET https://www.activestate.com
<form role="search" method="get" class="search-form" action="https://www.activestate.com" data-hs-cf-bound="true">
<input type="search" class="search-field" placeholder="Search …" value="" name="s">
</form>
POST https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/5262266/f905f13a-b203-4188-85ec-1c047acbb38f
<form id="hsForm_f905f13a-b203-4188-85ec-1c047acbb38f" method="POST" accept-charset="UTF-8" enctype="multipart/form-data" novalidate=""
action="https://forms.hsforms.com/submissions/v3/public/submit/formsnext/multipart/5262266/f905f13a-b203-4188-85ec-1c047acbb38f"
class="hs-form-private hsForm_f905f13a-b203-4188-85ec-1c047acbb38f hs-form-f905f13a-b203-4188-85ec-1c047acbb38f hs-form-f905f13a-b203-4188-85ec-1c047acbb38f_4a53302e-2f01-4343-894b-ab9a13f76c2f hs-form stacked"
target="target_iframe_f905f13a-b203-4188-85ec-1c047acbb38f" data-instance-id="4a53302e-2f01-4343-894b-ab9a13f76c2f" data-form-id="f905f13a-b203-4188-85ec-1c047acbb38f" data-portal-id="5262266" data-hs-cf-bound="true">
<div class="hs_email hs-email hs-fieldtype-text field hs-form-field"><label id="label-email-f905f13a-b203-4188-85ec-1c047acbb38f" class="" placeholder="Enter your Email" for="email-f905f13a-b203-4188-85ec-1c047acbb38f"><span>Email</span><span
class="hs-form-required">*</span></label>
<legend class="hs-field-desc" style="display: none;"></legend>
<div class="input"><input id="email-f905f13a-b203-4188-85ec-1c047acbb38f" name="email" required="" placeholder="Email Address" type="email" class="hs-input" inputmode="email" autocomplete="email" value=""></div>
</div>
<div class="hs_submit hs-submit">
<div class="hs-field-desc" style="display: none;"></div>
<div class="actions"><input type="submit" class="hs-button primary large" value="Sign me up »"></div>
</div><input name="hs_context" type="hidden"
value="{"embedAtTimestamp":"1688454053382","formDefinitionUpdatedAt":"1665689622092","lang":"en","renderRawHtml":"true","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.198 Safari/537.36","pageTitle":"How to detect a phishing URL using Python and Machine Learning","pageUrl":"https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/","isHubSpotCmsGeneratedPage":false,"hutk":"baed7bb703c9cc509e8f981d7708dac3","__hsfp":2241961375,"__hssc":"36684543.1.1688454055083","__hstc":"36684543.baed7bb703c9cc509e8f981d7708dac3.1688454055083.1688454055083.1688454055083.1","formTarget":"#hbspt-form-4a53302e-2f01-4343-894b-ab9a13f76c2f","locale":"en","timestamp":1688454055111,"originalEmbedContext":{"portalId":"5262266","formId":"f905f13a-b203-4188-85ec-1c047acbb38f","region":"na1","target":"#hbspt-form-4a53302e-2f01-4343-894b-ab9a13f76c2f","isBuilder":false,"isTestPage":false,"isPreview":false,"isMobileResponsive":true},"correlationId":"4a53302e-2f01-4343-894b-ab9a13f76c2f","renderedFieldsIds":["email"],"captchaStatus":"NOT_APPLICABLE","emailResubscribeStatus":"NOT_APPLICABLE","isInsideCrossOriginFrame":false,"source":"forms-embed-1.3372","sourceName":"forms-embed","sourceVersion":"1.3372","sourceVersionMajor":"1","sourceVersionMinor":"3372","_debug_allPageIds":{},"_debug_embedLogLines":[{"clientTimestamp":1688454053876,"level":"INFO","message":"Retrieved pageContext values which may be overriden by the embed context: {\"pageTitle\":\"How to detect a phishing URL using Python and Machine Learning\",\"pageUrl\":\"https://www.activestate.com/blog/phishing-url-detection-with-python-and-ml/\",\"userAgent\":\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.198 Safari/537.36\",\"isHubSpotCmsGeneratedPage\":false}"},{"clientTimestamp":1688454053878,"level":"INFO","message":"Retrieved countryCode property from normalized embed definition response: \"SG\""},{"clientTimestamp":1688454055106,"level":"INFO","message":"Retrieved analytics values from API response which may be overriden by the embed context: {\"hutk\":\"baed7bb703c9cc509e8f981d7708dac3\"}"}]}"><iframe
name="target_iframe_f905f13a-b203-4188-85ec-1c047acbb38f" style="display: none;"></iframe>
</form>
GET https://www.activestate.com/search/
<form role="search" method="get" class="search-form" action="https://www.activestate.com/search/" data-hs-cf-bound="true">
<div>
<input type="search" class="search-field" placeholder="Search for …" value="" name="s">
<input type="submit" class="search-submit" value="">
<i class="fa fa-search2"></i>
</div>
</form>
Text Content
* Contact Sales * Sign In * Search …… * Free Account * Why ActiveState * Products * ActiveState Platform * Roadmap * Product Updates * State Tool (Package Manager) * Python * Perl * Tcl * Ruby * Enterprise * Overview * Enterprise Security * SLSA * Attestations * Artifact Repository * Artifactory Integration * Python 2 – Extended Support * Support & Maintenance * Pricing * Resources * Support * Blog * Product Demos * Data Sheets * White Papers * Webinars * Videos * Case Studies * Quick Reads * Why ActiveState * +Products * +ActiveState Platform * Roadmap * Product Updates * State Tool (Package Manager) * Python * Perl * Tcl * Ruby * +Enterprise * Overview * Enterprise Security * SLSA * Attestations * Artifact Repository * Artifactory Integration * Python 2 – Extended Support * Support & Maintenance * Pricing * +Resources * Support * Blog * Product Demos * Data Sheets * White Papers * Webinars * Videos * Case Studies * Quick Reads Last Updated: August 5, 2022 PHISHING URL DETECTION WITH PYTHON AND ML Phishing is a form of fraudulent attack where the attacker tries to gain sensitive information by posing as a reputable source. In a typical phishing attack, a victim opens a compromised link that poses as a credible website. The victim is then asked to enter their credentials, but since it is a “fake” website, the sensitive information is routed to the hacker and the victim gets ”‘hacked.” Phishing is popular since it is a low effort, high reward attack. Most modern web browsers, antivirus software and email clients are pretty good at detecting phishing websites at the source, helping to prevent attacks. To understand how they work, this blog post will walk you through a tutorial that shows you how to build your own phishing URL detector using Python and machine learning: 1. Identify the criteria that can recognize fake URLs 2. Build a decision tree that can iterate through the criteria 3. Train our model to recognize fake vs real URLs 4. Evaluate our model to see how it performs 5. Check for false positives/negatives GET STARTED: INSTALL ML TOOLS WITH THIS READY-TO-USE PYTHON ENVIRONMENT To follow along with the code in this Python phishing detection tutorial, you’ll need to have a recent version of Python installed, along with all the packages used in this post. The quickest way to get up and running is to install the Phishing URL Detection runtime for Windows or Linux, which contains a version of Python and all the packages you’ll need. In order to download the ready-to-use phishing detection Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you! For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment: powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Phishing-URL-Detection" For Linux users, run the following to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment: sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Phishing-URL-Detection 1 — HOW TO IDENTIFY A FRAUDULENT URL A fraudulent domain or phishing domain is an URL scheme that looks suspicious for a variety of reasons. Most commonly, the URL: * Is misspelled * Points to the wrong top-level domain * A combination of a valid and a fraudulent URL * Is incredibly long * Is just be an IP address * Has a low pagerank * Has a young domain age * Ranks poorly on the Alexa Top 1 Million Sites All these are characteristics of a phishing URL that can help us distinguish it from a valid URL. These characteristics can be converted into machine learning feature sets such as numbers, labels and booleans. The University of California, Irvine put together a dataset identifying fraudulent versus valid URLs. Feature sets are divided into four main categories: 1. Address Bar-Based Features – these are features extracted from the URL itself, like URL length >54 characters, or whether it contains an IP address, uses an URL shortening service like TinyURL or Bitly, or employs redirection. Addition features may also include: * Adding a prefix or suffix separated by (-) to the domain * Having sub-domain and multi-sub-domains * Existence of HTTPS * Domain registration age * Favicon loading from a different domain * Using a non-standard port 2. Abnormal Features – these may include: * Loading images loaded in the body from a different URL * Minimal use of meta tags * The use of a Server Form Handler (SFH) * Submitting information to email * An abnormal URL 3. HTML and JavaScript-Based Features – these can include things like: * Website forwarding * Status bar customization typically using JavaScript to display a fake URL * Disabling the ability to right-click so users can’t view page source code * Using pop-up windows * iFrame redirection 4. Domain-Based Features – these can include: * Unusually young domains * Suspicious DNS record * Low volume of website traffic * PageRank, where 95% of phishing webpages have no PageRank * Whether the site has been indexed by Google 2 — BUILDING A DECISION TREE Given all the criteria that can help us identify phishing URLs, we can use a machine learning algorithm, such as a decision tree classifier to help us decide whether an URL is valid or not. First, let’s download the UC Irvine dataset and explore its contents. The feature list contains: * having_IP_Address { -1,1 } * URL_Length { 1,0,-1 } * Shortining_Service { 1,-1 } * having_At_Symbol { 1,-1 } * double_slash_redirecting { -1,1 } * Prefix_Suffix { -1,1 } * having_Sub_Domain { -1,0,1 } * SSLfinal_State { -1,1,0 } * Domain_registeration_length { -1,1 } * Favicon { 1,-1 } * port { 1,-1 } * HTTPS_token { -1,1 } * Request_URL { 1,-1 } * URL_of_Anchor { -1,0,1 } * Links_in_tags { 1,-1,0 } * SFH { -1,1,0 } * Submitting_to_email { -1,1 } * Abnormal_URL { -1,1 } * Redirect { 0,1 } * on_mouseover { 1,-1 } * RightClick { 1,-1 } * popUpWidnow { 1,-1 } * Iframe { 1,-1 } * age_of_domain { -1,1 } * DNSRecord { -1,1 } * web_traffic { -1,0,1 } * Page_Rank { -1,1 } * Google_Index { 1,-1 } * Links_pointing_to_page { 1,0,-1 } * Statistical_report { -1,1 } And finally, the Result designates whether the URL is valid or not: * Result { -1,1 } Where -1 denotes an invalid URL and 1 is a valid URL. Now let’s now jump into the code. First, we load the required modules: # To perform operations on dataset import pandas as pd import numpy as np # Machine learning model from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier # Visualization from sklearn import metrics from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt import seaborn as sns from sklearn.tree import export_graphviz Next we read and split the dataset: df = pd.read_csv('.../dataset.csv') dot_file = '.../tree.dot' confusion_matrix_file = '.../confusion_matrix.png' And then print the results: print(df.head()) -1 1 1.1 1.2 -1.1 -1.2 -1.3 -1.4 -1.5 1.3 1.4 -1.6 1.5 -1.7 1.6 ... -1.9 -1.10 0 1.7 1.8 1.9 1.10 -1.11 -1.12 -1.13 -1.14 1.11 1.12 -1.15 -1.16 0 1 1 1 1 1 -1 0 1 -1 1 1 -1 1 0 -1 ... 1 1 0 1 1 1 1 -1 -1 0 -1 1 1 1 -1 1 1 0 1 1 1 -1 -1 -1 -1 1 1 -1 1 0 -1 ... -1 -1 0 1 1 1 1 1 -1 1 -1 1 0 -1 -1 2 1 0 1 1 1 -1 -1 -1 1 1 1 -1 -1 0 0 ... 1 1 0 1 1 1 1 -1 -1 1 -1 1 -1 1 -1 3 1 0 -1 1 1 -1 1 1 -1 1 1 1 1 0 0 ... 1 1 0 -1 1 -1 1 -1 -1 0 -1 1 1 1 1 4 -1 0 -1 1 -1 -1 1 1 -1 1 1 -1 1 0 0 ... -1 -1 0 1 1 1 1 1 1 1 -1 1 -1 -1 1 This dataset contains 5 rows and 31 columns, where each column contains a value for each of the attributes we discussed in the above section. 3 — TRAIN THE MODEL As always, the first step in training a machine learning model is to split the dataset into testing and training data: X = df.iloc[:, :-1] y = df.iloc[:, -1] Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0) Since the dataset contains boolean data, it’s always best to use a Decision Tree, RandomForest Classifier or Logistic Regression algorithm since these models work best for classification. In this case, I chose to work with a Decision Tree, because it’s straightforward and generally gives the best results when trying to classify data. model = DecisionTreeClassifier() model.fit(Xtrain, ytrain) 4 — EVALUATE THE MODEL Now that the model is trained, let’s see how well it does on the test data: ypred = model.predict(Xtest) print(metrics.classification_report(ypred, ytest)) print("\n\nAccuracy Score:", metrics.accuracy_score(ytest, ypred).round(2)*100, "%") We used the model to predict Xtest data. Now let’s compare the results to ytest and see how well we did: precision recall f1-score support -1 0.95 0.95 0.95 1176 1 0.96 0.96 0.96 1588 micro avg 0.96 0.96 0.96 2764 macro avg 0.96 0.96 0.96 2764 weighted avg 0.96 0.96 0.96 2764 Accuracy Score: 96.0 % Not bad! We made literally no modifications to the data and achieved an accuracy score of 96%. From here, you can dive deeper into the data and see if there’s any transformation that can be done to further improve the accuracy of prediction. 5 — IDENTIFY FALSE POSITIVES & FALSE NEGATIVES The results of any decision tree evaluation are likely to contain both false positives (URLs that are actually valid, but that our model indicates are not), as well as false negatives (URLs that are actually bad, but our model indicates are fine). To help resolve these instances, let’s draw out a confusion matrix (a table with 4 different combinations of predicted and actual values) for our results. The matrix will help us identify: * True Positives * True Negatives * False Positives (Type 1 Error) * False Negatives (Type 2 Error) mat = confusion_matrix(ytest, ypred) sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False) plt.xlabel('true label') plt.ylabel('predicted label'); plt.savefig(confusion_matrix_file) As you can see, the number of false positives and false negatives are pretty low compared to our true positives and negatives, so we can be pretty sure of our results. To see how the decision tree panned out in making these decisions, we can visualize it with sklearn, matplotlib and sns. export_graphviz(model, out_file=dot_file, feature_names=X.columns.values) >> dot -Tpng tree.dot -o tree.png We use export_graphviz to create a dot file of the decision tree, which is a text file that lets us visualize the actual bifurcations in decisions. Then, using the command line tool dot we convert the text file to a PNG image which shows our final “tree” of decisions (open it in a new tab to view the details): PHISHING URL DETECTION WITH PYTHON: SUMMARY These days, when everyone is working for home, there’s a lot less opportunity to just casually ask your office colleagues if they’ve received a suspicious email like the one you just got. And attackers know it, driving a 300% increase in cybercrime since the start of the pandemic. It’s always good practice to check every link before you click on it, but of course, busy employees can get careless. This blog post showed you how, given a set of criteria that can typically identify phishing URLs, you can build and train a simple decision tree model to evaluate any given URL, and indicate whether it is actually valid or not with 96% accuracy. Now, if only it was as easy as this to prevent people from clicking fraudulent links in the first place! * You can find the criteria for evaluating phishing URLs in UC Irvine’s dataset. * To get started building your own URL phishing detector, sign up for a free ActiveState Platform account so you can download our Phishing URL Detection runtime environment and get started faster. RECOMMENDED READS > Top 5 Cybersecurity Tools for a Work-from-Home World > Using Python for CyberSecurity Testing Swaathi Kakarla SWAATHI KAKARLA Guest blogger: Swaathi Kakarla is the co-founder and CTO at Skcript. She enjoys talking and writing about code efficiency, performance, and startups. In her free time, she finds solace in yoga, bicycling and contributing to open source. Home » ActiveState Blog on Decision tree » Machine Learning » Phishing URL Detection with Python and ML decision tree, machine learning, phishing, URL detection BLOG AUTHOR Swaathi Kakarla Guest blogger: Swaathi Kakarla is the co-founder and CTO at Skcript. She enjoys talking and writing about code efficiency, performance, and startups. In her free time, she finds solace in yoga, bicycling and contributing to open source. PRACTICAL INFO IN YOUR INBOX Get our latest blogs, resources and insights to help you create more value with open source languages Languages & Tools * ActiveState Platform * State Tool * Perl * Python * Tcl * Ruby * Komodo IDE Product Info * Why ActiveState * Enterprise Solutions * Enterprise Security * Plans & Pricing * ActivePython vs Anaconda * Product Updates * Open Source Maintainers Support * Contact Support * Service Status * Documentation * FAQs * Community Forum * Komodo Forum * Videos Company * About Us * Contact Us * Resellers * Customers * Careers * Leadership * Press STAY UP-TO-DATE ON ACTIVESTATE NEWS Email* You can unsubscribe at any time. For more information, consult our Privacy Policy. © 2022 ActiveState Software Inc. All rights reserved. ActiveState®, ActivePerl®, ActiveTcl®, ActivePython®, Komodo®, ActiveGo™, ActiveRuby™, ActiveNode™, ActiveLua™, and The Open Source Languages Company™ are all trademarks of ActiveState. Legal - Privacy Policy - Accessibility What’s the state of your software supply chain Take the 2023 survey! word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 mmMwWLliI0fiflO&1 × ×