Precomputed Caching – Web Development


One of the other big architecture pieces
we’ve added to help us scale was this notion of a precomputed cache. We found ourselves running these queries
to generate the hot page for Reddit over and over and over again. You may cache it for a minute but then once that
minute expired we had to recalculate it. We had a job–a kind of job that would
run and just compute it and then put the stored value in memcached–
that worked okay but then we had to do it, we
had all of our users pages. Every user had their own listings of
things they’ve submitted and liked and their top things, and every Reddit had a new
page and a hot page and a bunch of different sorts So we stared precomputing everything. The way we did that is that we have
this whole other database stack. These are the replicas of the link database–basically,
more link databases. They could lag a little bit. It wasn’t a big deal. Every time a vote would come in, we put in this queue–queue is just a list of things to be done– and we have this machine that basically
manages huge list of things, and we had a couple of other machines
that we called the precompute servers. What these things would do is they take jobs
off the queue. This link has been voted on. Actually, what the apps would do is
when a link was voted on, they would add a number of jobs to the queue. The jobs might be to recompute Reddit’s front page,
recompute this user’s liked page, recompute this user’s top page,
recompute Reddit’s top page. There are all sorts of different listings
that are affected by a particular vote. These machines would pull off these jobs and then they would run those
queries against the database. They would just mercilessly as fast as
they could take a job off the queue run the query against these databases. These databases were really, really hot
and had no real-time requests. No request from the Internet actually ever
touched these precompute machines. It’s only these guys, these precompute servers would
touch these preocompute databases. When the job was done running, we would take
the results and store them to memcached. That way almost every page you looked at on Reddit
would be fetched out from memcached. There are very few things you could do on Reddit that would actually directly manipulate a database. Once we got to that point of scaling,
things got a lot easier. These are really just kind of the last resort
primary sources of data, but any data you can access on Reddit in real time
is actually served out of memcache. Every single listing is precomputed and stored in memcached for Reddit on the whole site. This is the reason why now you can’t go back beyond about a thousand links on any particular listing. It used to be, you can go to Reddit’s front page and
hit next, next, next, next, next and then go all the way back to the beginning of time,
which will just really, really trounce our databases, do a lot of damage, slow the site down,
etc, etc–you can’t do that anymore. We only store the top thousand for each sort,
which is one of the limitations of doing this precompute thing, but on the upside
the cycle is very, very fast. There are very few legitimate reasons to go all the way back to the beginning of time on Reddit anyway. This worked out really nicely, and the site to this day
still has this general structure, although a lot of the technologies have changed, and
that’s what we’re going to talk to Neil about.

Add a Comment

Your email address will not be published. Required fields are marked *