www.techradar.com Open in urlscan Pro
151.101.194.114  Public Scan

URL: https://www.techradar.com/news/the-story-of-the-fight-to-archive-the-internet
Submission Tags: falconsandbox
Submission: On February 21 via api from US — Scanned from DE

Form analysis 2 forms found in the DOM

GET https://www.techradar.com/search

<form class="search-box" action="https://www.techradar.com/search" method="GET" data-component-tracked="19">
  <label for="search-input" class="sr-only">Search TechRadar</label>
  <input tabindex="0" type="search" name="searchTerm" placeholder="Search TechRadar" class="search-input" id="search-input">
  <button type="submit" class="search-submit" aria-label="Search">
    <span class="search-icon">
      <svg class="icon-svg" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 1000 1000">
        <path d="M720 124a422 422 0 1 0-73 654l221 222 132-131-222-222a422 422 0 0 0-58-523zm-92 504a291 291 0 1 1-412-412 291 291 0 0 1 412 411z"></path>
      </svg> </span>
  </button>
</form>

POST https://newsletter-subscribe.futureplc.com/v2/submission/submit

<form data-hydrate="true" class="newsletter-form__form newsletter-form__form--inbodyContent" method="POST" action="https://newsletter-subscribe.futureplc.com/v2/submission/submit"><input type="hidden" data-hydrate="true"
    class="form__hidden-input form_input form__hidden-input form__hidden-input--inbodyContent" name="NAME"><input type="email" data-hydrate="true" class="form__email-input form_input form__email-input form__email-input--inbodyContent" name="MAIL"
    required="" placeholder="Your Email Address"><input type="hidden" data-hydrate="true" class="form__hidden-input form_input form__hidden-input form__hidden-input--inbodyContent" value="XTP-X" name="NEWSLETTER_CODE"><input type="hidden"
    data-hydrate="true" class="form__hidden-input form_input form__hidden-input form__hidden-input--inbodyContent" value="EN" name="LANG"><input type="hidden" data-hydrate="true"
    class="form__hidden-input form_input form__hidden-input form__hidden-input--inbodyContent" value="60" name="SOURCE"><label class="form__checkbox-label"><input type="checkbox" data-hydrate="true"
      class="form__checkbox-input form_input form__checkbox-input form__checkbox-input--inbodyContent" name="CONTACT_OTHER_BRANDS">Contact me with news and offers from other Future brands</label><label class="form__checkbox-label"><input
      type="checkbox" data-hydrate="true" class="form__checkbox-input form_input form__checkbox-input form__checkbox-input--inbodyContent" name="CONTACT_PARTNERS">Receive email from us on behalf of our trusted partners or sponsors</label><input
    type="submit" data-hydrate="true" class="form__submit-input form_input form__submit-input form__submit-input--inbodyContent" value="Sign me up" required=""></form>

Text Content

Skip to main content
Tech Radar
 * Tech Radar Pro
 * Tech Radar Gaming

Open menu Close menu
Tech Radar Pro TechRadar IT Insights for Business
Search
Search TechRadar
Subscribe
RSS
(opens in new tab) (opens in new tab) (opens in new tab) (opens in new tab)
US Edition


Asia

Singapore

Europe

Danmark


Suomi


Norge


Sverige


UK


Italia


Nederland


België (Nederlands)


France


Deutschland


España

North America

US (English)


Canada

Australasia

Australia


New Zealand

Technology Magazines
Why subscribe?
 * The best tech tutorials and in-depth reviews
 * Try a single issue or save on a subscription
 * Issues delivered straight to your door or device

From€8
View
 * 
 * News
 * Reviews
 * Features
 * Website builders
 * Web hosting
 * Perimeter 81
 * Security




Trending
 * Best standing desk deals
 * Best cloud storage 2023
 * Everything you need to WFH
 * What is Microsoft Teams?
 * Windows 11 for business



When you purchase through links on our site, we may earn an affiliate
commission. Here’s how it works.


 1. Home
 2. News
 3. Computing


THE STORY OF THE FIGHT TO ARCHIVE THE INTERNET

By Joel Khalili
published December 18, 2021

Meet Brewster Kahle, the internet’s chief librarian


(Image credit: Shutterstock / 300 librarians)

Audio player loading…

On the same day in 1996, Brewster Kahle founded two separate but closely
connected organizations. The first went on to make him very wealthy, and the
second has earned him not a single dime.

Alexa Internet (often confused with Alexa, the voice assistant) was a service
that crawled the web for metadata and other information, which was then served
up via the browser to help people make sense of the content on a website. 



A few years later, the company was acquired by Amazon in a deal worth $250
million, and converted into an SEO service. However, despite the change of
ownership, Alexa Internet continued to supply the data it collected to the
second organization Kahle had founded: a non-profit called the Internet Archive.

It was Kahle’s vision that the Internet Archive would become a modern version of
the Library of Alexandria, and provide “universal access to all knowledge," he
told TechRadar Pro.

This digital library, over which he still presides, is now home to many billions
of archived web pages (accessible for free via a service called the Wayback
Machine) and millions of digitized books.

Earlier this year, the Archive celebrated a landmark 25th anniversary, but Kahle
is still unsatisfied with its scope. The project is also facing threats unlike
any it has encountered before.



Brewster Kahle, founder of the Internet Archive (Image credit: Brewster Kahle)


AN EARLY TASTE

Kahle’s preoccupation with both the internet and the exchange of information can
be traced back to the Massachusetts Institute of Technology (MIT), where he
studied for a degree in computer science in the 1980s.

At MIT, Kahle and his cohort had access to the Advanced Research Projects Agency
Network (more commonly known as ARPANET), a precursor to the internet as it
exists today and the source of the first ever email.

ARPANET allowed computers to communicate with one another over telephone lines
using a technique called packet switching, whereby data is broken down into
small chunks, fired across a network and reassembled at its destination. ARPANET
quickly became a hotbed for innovation in the fields of computing and
networking.

“We were using the ARPANET intranet for pretty much everything,” said Kahle.
“And already we were witnessing some of the problems that would end up playing
out over the next 40 years.”

He described an experiment whereby a mailing list was created that included all
ARPANET users. The idea was to see what would happen if different virtual
communities (represented at the time by a series of smaller mailing lists and
Usenet groups) were thrown into one space.

“It was chaos, anarchy and misinformation - it was terrible!” explained Kahle,
with a wry smile. “We could basically see civil discourse dissolving in front of
our eyes.”

“However, we also saw the power of connecting people across institutions and
across the world, with minimal friction and delay.”

From this time onwards, Kahle says, constructing a grand digital repository for
knowledge became his primary focus. But he lacked almost all of the tools that
would make this possible.

After leaving MIT, he channeled his ambitions into a company called Thinking
Machines, which aimed to commercialize research into parallel computing
architectures. Here, Kahle was lead engineer on a supercomputer called the
Connection Machine (the fastest in the world at the time), which he later used
to devise a form of search engine.



Brewster Kahle (second from the right) and his team, next to a prototype of
Connected Machine-1. (Image credit: Tamiko Thiel)

The next step was to build a network publishing system that could be used to
disseminate digital information. To fill this gap Kahle developed WAIS (short
for Wide Area Information Server), an open system that was adopted by companies
like the New York Times and Britannica, which wanted to control the distribution
of their content in the coming digital age. All of this took place before the
internet even existed, it must be remembered.

“I think we were seen as visionaries, but the goal was always to build the
digital Library of Alexandria,” Kahle told us. “And this was not a new concept;
there was already As We May Think (opens in new tab), a key paper by Vannevar
Bush from 1945, and Ted Nelson was already doing hypertext and Project Xanadu
(opens in new tab).”

“In the 1980s, [the library] was something that I thought was already promised,
just not yet delivered. So I set out to build it.”


THE LIBRARY OF ALEXANDRIA 2.0

Since its conception, the Internet Archive has amassed an impressive 70 petabyte
(70,000 terabyte) library of content, comprising 635 billion webpages, but also
34 million books, 14 million audio recordings and more.

This treasure trove of content is stored in high-capacity hard drives at the
Internet Archive headquarters, but is also backed up partially in the
Netherlands and (as a symbolic gesture) in Alexandria, Egypt.

The non-profit has so far preserved the writings of more than 100 million
people, and Kahle has ambitions to increase this figure by a factor of ten. But
with more content now published online than the Archive can hope to keep up
with, the central question becomes: what is worthy of preservation?

“The Internet Archive crawls the World Wide Web in the same way search engines
do,” Kahle explained. “To figure out what to crawl, we work with hundreds of
libraries and librarians, who determine what is important to scrape and at what
frequency. These people build collections on the subjects they are expert in.”

Approximately 3,000 crawls are performed simultaneously every day, each with
different mandates. Some specialize in news, social media or a particular
region, for example, and others are steered by the recommendations of the
public, who submit web pages they believe are worth archiving.



The TechRadar homepage on January 11, 2008 - the first day the site was captured
by the Internet Archive. (Image credit: Internet Archive)

These crawls capture a main web page, but also a number of offshoots that users
can navigate between via the Wayback Machine, creating something that feels much
more alive than a static screenshot.

“It’s a massive undertaking by thousands, if not hundreds of thousands, of
people to decide what should be saved,” said Kahle. “We’re interested in any
signal that can show us what’s worth preserving.”

As well as archiving web pages for posterity, the organization also sees its
role as a tool for safeguarding digital evidence. It has been used by
journalists, for example, to access material an individual or company has later
removed from the public web. It is also fertile ground for students and
academics studying the evolution of online culture and digital communication.

However, keeping the Wayback Machine updated with current data is just one way
in which the organization seeks to achieve its ultimate goal; the digitization
of books is another important facet.


THE BUSINESS OF BOOKS

Asked whether the mission or purpose of the Internet Archive has changed over
its quarter-century history, Kahle returned a resounding “no”. But while the
core mission has remained the same, the way in which people use the resource has
certainly evolved.

During the pandemic, for example, students were locked out of their libraries
and school rooms, and forced to rely on e-learning services and the valiant
efforts of parents. Kahle says the Archive saw the use of its digital book
lending service skyrocket, and received a flood of messages from libraries that
wanted to lend their collections in digital form.

Spurred into action, the Internet Archive launched the National Emergency
Library (opens in new tab). Usually, the organization lends one digital book for
every physical copy it owns (a practice known as controlled digital lending
(opens in new tab)), which means a digital copy can only be loaned out to one
person at a time. But under this emergency scheme, the waitlist-based system was
discarded for a period of fourteen weeks.

Many students, teachers and other readers celebrated the initiative, but the
Emergency Library was met with disgust by copyright organizations that saw it as
a flagrant breach of the rights of authors, who were also struggling due to the
pandemic. A collective of publishers (including Penguin Random House, Harper
Collins, Hachette and Wiley) is also taking the Internet Archive to court (opens
in new tab) over “wilful mass copyright infringement”.

"The Internet Archive does not seek to 'free knowledge'; it seeks to destroy the
carefully calibrated ecosystem that makes books possible in the first place —
and to undermine the copyright law that stands in its way,” assert the
publishers.

As you might imagine, Kahle disagrees. “We’ve been lending books for ten years.
These publishers contend that we are not allowed to lend - and it’s outrageous,”
he said, with uncharacteristic forcefulness.

“What libraries do is buy, preserve and lend materials. But these lawsuits
represent a massive threat to the core function of libraries in the digital
world; publishers are saying you cannot buy, cannot preserve and cannot lend.”

At the time of writing, the lawsuit is in discovery, with further statements to
be delivered in the spring.


AN OPPORTUNITY LOST

Over the years, the Internet Archive has been sustained by a combination of
funds from Kahle’s own pocket, fees charged to libraries for digitization
services, and contributions from members of the public.

However, keeping its services operational will become more and more expensive as
the library expands, unless technical advances cut the cost of data storage,
server hosting and the other technologies on which the non-profit relies.

Although Kahle says his personal wealth is sufficient to guarantee the longevity
of the Internet Archive (or at least its trove of data), he recently put out a
call for donations (opens in new tab) to help fight the ongoing lawsuit, but
also other obstacles to the free flow of information.

“The internet community has not done enough to build reliable and responsible
organizations to support the digital world. And we could see the dangers from
the very beginning,” said Kahle, referring both to the crisis of misinformation
and the stranglehold of Big Tech.

“If we do not strike a good balance, we could end up with an information
environment where everything we read is monitored and vetted by a small group of
companies and governments. We will have lost the opportunity the internet has
given us.”

To highlight these issues, the Internet Archive recently launched the Wayforward
Machine (opens in new tab), a satirical take on the Wayback Machine that
promises to let users “visit the future of the internet”.



A vision of the future of the internet, courtesy of the Wayforward Machine.
(Image credit: Internet Archive)

Plugging a URL into the Wayforward Machine generates a page plastered with an
endless stream of pop-ups, some of which demand payment or personal information,
while others simply note that access to information is denied. The message is
hardly subtle.

“We don’t hold the levers of power, but we run a library. Although a library
cannot solve all these problems, it’s a necessary component for a digital
ecosystem. We need libraries to be supported, used and defended. If we do not
defend our open institutions, they will be crushed,” said Kahle.

“We can have platforms and systems that are driven by altruism, not advertising
models. We can have a world with many winners, where people participate, learn
and find new communities.”

Asked whether he is optimistic about reaching this utopian ideal, Kahle nodded:
“But we need to really want it.”

 * Also check out our list of the best cloud hosting services around




ARE YOU A PRO? SUBSCRIBE TO OUR NEWSLETTER

Sign up to theTechRadar Pro newsletter to get all the top news, opinion,
features and guidance your business needs to succeed!

Contact me with news and offers from other Future brandsReceive email from us on
behalf of our trusted partners or sponsors
By submitting your information you agree to the Terms & Conditions (opens in new
tab) and Privacy Policy (opens in new tab) and are aged 16 or over.
Joel Khalili
Social Links Navigation
News and Features Editor

Joel Khalili is the News and Features Editor at TechRadar Pro, covering
cybersecurity, data privacy, cloud, AI, blockchain, internet infrastructure, 5G,
data storage and computing. He's responsible for curating our news content, as
well as commissioning and producing features on the technologies that are
transforming the way the world does business.

See more Computing news





TechRadar is part of Future US Inc, an international media group and leading
digital publisher. Visit our corporate site (opens in new tab).

 * About Us (opens in new tab)
 * Contact Us (opens in new tab)
 * Terms and conditions (opens in new tab)
 * Privacy policy (opens in new tab)
 * Cookies policy (opens in new tab)
 * Advertise with us (opens in new tab)
 * Web notifications (opens in new tab)
 * Accessibility Statement
 * Careers (opens in new tab)

© Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.