Google engineer Ben Treynor has posted an explanation of the Gmail outage Tuesday, which lasted nearly two hours and made email addicts very nervous.
"We took a small fraction of Gmail's servers offline to perform routine upgrades," he explains on the Gmail blog. "We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!" This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers."
The problem was fixed once Google brought more routers online and spread the traffic among them. Google says it is tweaking its architecture so that the problem doesn't happen again.