mj12bot.com Open in urlscan Pro
5.77.61.60  Public Scan

Submitted URL: http://mj12bot.com/
Effective URL: https://mj12bot.com/
Submission: On October 28 via manual from IN — Scanned from GB

Form analysis 0 forms found in the DOM

Text Content

ABOUT MJ12BOT

Bot TypeGood crawler
(always identifies itself) IP RangeDistributed, Worldwide Obeys Robots.txtYes
Obeys Crawl DelayYes Data served atMajestic.com

Majestic is a UK based specialist search engine used by hundreds of thousands of
businesses in 13 languages and over 60 countries to paint a map of the Internet
independent of the consumer based search engines. Majestic also powers other
legitimate technologies that help to understand the continually changing fabric
of the web.

Web site owners can see data about their own websites on majestic.com.

MJ12Bot does not currently cache web content or personal data. Instead it maps
the link relationships between websites to build a search engine. This data is
available to technologies and the public, either by searching for a keyword or a
website at Majestic. Details about the community project behind the crawlers are
at Majestic12.co.uk.

WHAT IS MJ12BOT DOING ON MY SITE(S)?

We spider the Web for the purpose of building a search engine with a fast and
efficient downloadable distributed crawler that enables people with broadband
connections to help contribute to, what we hope, will become the biggest search
engine in the world. Production of a full text search engine at Majestic-12 is
currently in the research phase, funded in part by the commercialisation of
research at Majestic.

WHAT HAPPENS TO THE CRAWLED DATA?

Crawled data (currently only a web graph of links) is added to the largest
public backlinks search engine index that we maintain as a dedicated tool called
Site Explorer. Learn about your own backlinks from the extensive backlinks
index.

MY WEB HOST IS BLOCKING YOUR BOT, WHY?

Some ISPs and badly configured firewalls may stop MJ12Bot from crawling your
website. This is usually because the ISP or Firewall does not understand that in
doing so, they are blocking genuine visitors to your website at a later date.
Some also do this to minimize bandwidth. In these instances, some ISPs can
remove the block for all their users when they understand the purpose of the
bot. If your ISP will not allow our bot, we recommend that you consider moving
ISPs.

WHY DO YOU KEEP CRAWLING 404 OR 301 PAGES?

We have a long memory and want to ensure that temporary errors, website down
pages or other temporary changes to sites do not cause irreparable changes to
your site profile when they shouldn't. Also if there are still links to these
pages they will continue to be found and followed. Google have published a
statement since they are also asked this question, their reason is of course the
same as ours and their answer can be found here: Google 404 policy.

YOU ARE CRAWLING LINKS WITH REL=NOFOLLOW

This is a common misunderstanding of the (perhaps poorly named) nofollow
attribute. Google introduced the 'rel=nofollow' attribute in 2005 stating that
links so marked would not influence the target's Pagerank, it does not stop the
crawler from visiting the target page, this becomes particularly obvious if the
target page has several links to it, some may have this attribute, some may not.
If you wish to stop bots from crawling a page then the robots.txt file should be
used to disallow the target page.

More information on rel=nofollow can be found here: Wikipedia Nofollow

HOW CAN I BLOCK MJ12BOT?

MJ12bot adheres to the robots.txt standard. If you want the bot to prevent
website from being crawled then add the following text to your robots.txt:

User-agent: MJ12bot
Disallow: /

Please do not block our bot via IP in htaccess - we do not use any consecutive
IP blocks as we are a community based distributed crawler. Please always make
sure the bot can actually retrieve robots.txt itself. If it can't then it will
assume that it is okay to crawl your site.

If you have reason to believe that MJ12bot did NOT obey your robots.txt
commands, then please let us know via email: bot@majestic12.co.uk. Please
provide URL to your website and log entries showing bot trying to retrieve pages
that it was not supposed to.

WHAT COMMANDS IN ROBOTS.TXT DOES MJ12BOT SUPPORT?

The current crawler supports the following non-standard extensions to
robots.txt:

 * Crawl-Delay for up to 20 seconds (higher values will be rounded down to the
   maximum our bot supports)
 * Redirects (within the same site) when trying to fetch robots.txt
 * Simple pattern matching in Disallow directives compatible with Yahoo's
   wildcard specification
 * Allow directives can override Disallow if they are more specific (longer in
   length)
 * Certain failures to fetch robots.txt such as 403 Forbidden will be treated as
   blanket disallow directive

WHY DID MY ROBOTS.TXT BLOCK NOT WORK ON MJ12BOT?

We are keen to see any reports of potential violations of robots.txt by MJ12bot.

There are a number of false positives raised - this can be a useful checklist
when configuring a web server:

 1. Off site redirects when requesting robots.txt - MJ12Bot follows redirects,
    but only on the same domain. The ideal is for robots.txt to be available at
    "/robots.txt" as specified in the standard.
 2. Multiple domains running on the same server. Modern webservers such as
    Apache can log accesses to a number of domains to one file - this can cause
    confusion when attempting to see what webserver was accessed at which point.
    You may wish to consider adding domain information to the access log, or
    splitting access logs on a per domain basis
 3. Robots.txt out of sync with developer copy. We have had complaints that
    MJ12Bot has disobeyed robots.txt - only to find out that the developer was
    testing against a development server, which was not in-sync with the live
    version

HOW CAN I SLOW DOWN MJ12BOT?

You can easily slow down bot by adding the following to your robots.txt file:

User-Agent: MJ12bot
Crawl-Delay: 5

Crawl-Delay should be an integer number and it signifies number of seconds of
wait between requests. MJ12bot will make an up to 20 seconds delay between
requests to your site - note however that while it is unlikely, it is still
possible your site may have been crawled from multiple MJ12bots at the same
time. Making high Crawl-Delay should minimise impact on your site. This
Crawl-Delay parameter will also be active if it was used for * wildcard.

If you have an MJ12bot section, this section will be taken in preference over
the * wildcard section, not in addition to it, so if you have a crawl-delay in
your * wildcard section, this must be copied to the MJ12bot section too, if this
bot specific section exists for it to be conveyed to our bot.

WHAT ARE THE CURRENT VERSIONS OF MJ12BOT?

Current v1.4.x series operating versions of MJ12bot are:

 * v1.4.8 (Current - April 2017)
 * v1.4.7 (Being Replaced with 1.4.8 - End 2018)
 * v1.4.6 (Being Replaced with 1.4.7 - June 2016)
 * v1.4.5 (Phased out - June 2016)
 * v1.4.4 (phased out May 2014)

HOW DO I VERIFY REQUESTS ARE FROM YOU?

As a community project unfortunately we don't have the ability to restrict our
bots to a limited number of IP addresses, as some of our better funded
counterparts do. However we can send a pre-arranged ident string with all
requests to your site. This can be sent as part of the http or https headers in
the 'CRAWLER-IDENT' field, or as part of the User-Agent string. This string will
not be shared by us with anyone else or send it to any other domain or subdomain
than you request so requests including this string can be validated as coming
from our network. To make use of this facility please contact
bot@majestic12.co.uk with details of your site and the ident you would like
sending, or if you prefer we can generate a random ident string for you.

If you have not been satisfied with the information above then feel free to
contact us: bot@majestic12.co.uk

Majestic-12 Ltd



Faraday Wharf, Holt Street, Birmingham, West Midlands, B7 4BB, UK

Majestic

© Copyright Majestic-12 Ltd registered in England with company number 05269210.
All Rights Reserved. VAT registration number: GB894864750