Quantcast

Using git For Large Scale Digital Archiving: An Outline

Here are some notes on how one might re-architect Internet Archive infrastructure to meet some additional goals:

  • easy to set up and replicate
  • provide versioning and transactions
  • handle more media types well
  • better ingest/locate/read apis
  • better search

The current architecture looks like this:
iaarch.png

The diagram is simplified a lot. There are currently about 1800 nodes in the cluster, most of which are storage nodes (low power 1U nodes with 4 1TB hard drives). The deriver nodes are used for crunching things like pdfs and h.264s, and there are about 300 of those. There are 5 www frontends, hidden behind a couple load balancers, and database server has at least one read-only secondary.

What I like about the current infrastructure:

  • Easy to add more storage. Some other archival solutions do not scale well, since they insist all hard drives be connected to the same machine. This starts to break down at the petabox scale.
  • Easy to add more bandwidth. Currently IA is pushing 5+Gbps of outbound bandwidth. Every storage node runs an Apache server, which lessens load on the homenode, which is a problem with other archival systems.
  • Database hits are not required to locate an item on the cluster. When an item is requested through the Locator service, a multicast is sent, and machines that have the item will respond. The lessens load to the DB server, which is important when getting thousands of web requests per second.

What I find interesting about the current infrastructure:
  • RAID is not used. Items are backed up on to a secondary machine when added to the archive.
  • This is mostly due to “RAID is hard to get right” and cost
  • This means there are two machines (and two apaches) ready to serve the same content.
  • One machine can be taken down for repair while the content is still online.
  • I would like to see use of either RAID or maybe RAID_Z

An idea on how to re-architect things using git as a storage backend to provide versioning and transactions
  • git is the version control system used for the linux kernel.
  • git is a totally new way to operate on data. Read this if you are a non-believer.
  • We could keep the infrastructure mostly the same as IA, but store items as git repositories. This would not be a large architecture change.
  • git would become a supported access protocol, in addition to http, ftp, and rsync. Backups could be simple a git pull. We could git clone the entire cluster.
  • We would get versioning!

Changes needed to repo.git to make it useful in an archive cluster:
  • Change reguser.cgi to tie into the existing user database (talk to dbserver)
  • Change regprog.cgi to work in a cluster environment. Repositories are inited in /{0-4}/items/id/id.git on a primary node (talk to catalog/homenode)
  • Use post-commit hook to queue backup and derive tasks (talk to catalog)
  • Change gitweb to show custom view of movie, audio, texts (book), and photo collections. Software collections would show standard gitweb view.

I don’t think this would take too long to implement, but I’m lacking co-conspirators these days.. Maybe when shag makes it to SF we will have to knock something out :)

No comments yet. Be the first.

Leave a reply