|
The
Internet Archive contains over 100 Terabytes of compressed
data. This data is collected in collaboration with Alexa
Internet. Alexa sends its crawlers out into the web roughly
once every 2 months, retrieves copies of virtually everything
it encounters, and donates a copy of this data to the Internet
Archive. During periods of particular interest, such as a
presidential election or extraordinary breaking news, relevent
sites will be crawled more frequently, roughly every 2 to
8 hours.
The Internet
Archive began archiving data in 1996. The archive grows at
a rate of approximately 70 megabytes per second. A data pool
of this magnitude offers a myriad of research ideas worth
exploring and we encourage you to do so!
Archive
Infrastructure
The Archive
data is stored on approximately 150 desktop computers, each
containing four 160 GB hard drives. These drives are mounted
on /0, /1, /2, /3. In general, drives /1, /2, and /3 are filled
to capacity with the archived files. Drive /0, however, is
only half used. The other half (~77GB) is reserved for temporary
space that can be used for data manipulation. It may not always
be the case that the temporary space is located on drive /0.
The alias /0/.final/tmp will always refer to the actual temporary
space on the host.
Each computer
host has a name in the general form ia00###, where ### can
be in the range 100 - 177, 200 - 277, or 300 - 337. The digits
### refer to the physical location of machine within the Archive
computer cluster. The computers are situated on rows of racks
in the San Francisco Mission District Facility. The first
number in the ### name refers to the rack; the second refers
to shelf; and the third refers to the machine on the shelf.
The entire listing of hosts is stored within the environment
variable $ARCS. Subset listings of machines are stored in
the environment variables $rack1, $rack2, and $rack3, which
contain the listing of the machine from 100 - 177, 200 - 277,
and 300 - 337, respectively.
Research.archive.org houses the personal files of the users
on the system. Each user has access to the directory /home/<login>
for file storage. Since research.archive.org is NFS mounted
on all of the hosts, a user's home directory is always accessible
from any remote host in the cluster as if the home directory
were physically stored on each individual host. Altering files
on homeserver mounted on one remote host will immediately
affect the files on homeserver mounted another host because
each host mounts the very same (and only) research.archive.org
host.
Individual
hosts can be accessed using the remote shell (rsh) UNIX command.
The hosts in the cluster have an auto-authenticating script,
so the secure shell (ssh) command is unnecessary. Access to
the hosts is limited depending on the type of user account
that is held. User accounts directly on research.archive.org
have access to all of the machines located in $rack1.

How
the Data is Stored
All of
the archived web data is stored in ARC
and DAT files. The ARC files
contain the actual archived documents (html, gif, jpeg, ps,
etc.) each preceded by some header information about the document.
These archived files are individually compressed and individually
accessible. There are a number of AV
data mining tools provided for this purpose.
Each ARC
file has a corresponding DAT file. The DAT files contain meta-information
about each document; outward links that the document contains,
the document file format, the document size, etc..
ARC and
DAT files are indexed with CDX
files. Each host provides an index, complete.cdx, located
in /0/tmp/. This index may be joined against path_index.txt,
located in the same directory, for the full path of the ARC
file containing the archived document.
In addition to the indices located on each host, the archive
also contains an archive-wide index split accross 6 remote
hosts. These are aliased as index1 - index6. The CDX file
on each of these hosts is located in /0/wayback.cdx.gz and
is formatted slighty differently than the other CDX files
located on each remote host. Refer to the legend on the first
line of any CDX file for information on how to interpret the
data.
|