At the last count, Facebook had more than 800 million users and more than 2,000 employees. How does a company with this many eyeballs watching manage to operate like an Agile startup? What’s its secret, and can we emulate it?
I was practicing Agile development methodologies a decade before they came to popular attention. As someone who has engineered enterprise software, managed teams of engineers, and been a product manager, here’s my take...
Requirements? We don’t require no Stinkin’ Requirements!
A traditional software development lifecycle starts with some sort of requirement specification. This is usually the job of individuals called PMs — product managers (or program managers, in Microsoft parlance). PMs document what’s needed and work with the development team to get it implemented. They prioritize the various requirements, and should be sufficiently technically-minded to understand engineering constraints.
However, in the Facebook culture, PMs don’t own the specification; engineers do. The PMs kick off the process, but the specification is a living document, with changes usually made by engineering staff. The Facebook culture encourages PMs to take a back seat, for example, in cross-team meetings.
Engineers also have a big say in setting the priorities, which is also a task traditionally performed by PMs. Centralized resourcing decisions are rare. Instead, engineers essentially choose the projects on which they’d like to work next; PMs lobby groups of engineers to try to get them excited about an idea.
The Facebook approach treats engineers as responsible adults, not peons at the will of the PMs. This tends to avoid the “them and us” sentiment that can be so corrosive in other organizations.
The downside of the traditional approach can be a lack of ownership when a problem occurs. Software engineers are typically highly linear-thinking individuals, so there’s always the possibility that an error or ambiguity in the specification means the end-result isn’t what's actually required — or at least that time is wasted discovering the problem later than it could have been.
Product managers in some organizations are simply too full of their own importance. Cultures that allow PMs to “throw a spec over the wall” suffer bigger delays than those where there’s a collaborative approach to building the product. Facebook’s culture tends to avoid these pitfalls.
PMs also need to be reasonably technical; engineers quickly lose respect for PMs who ask for six impossible things before breakfast. Facebook’s inclusive approach selects for the more technical PMs.
However, Facebook suffers from a lack of attention to chronic, glaring problems — e.g., spam, API bugs, troublesome administration of fan pages, and the often-broken e-mail system. There are also frequent, widespread complaints that the company often changes the website’s user interface for no good reason. I tend to think that a culture of engineers choosing to implement their favorite features is at the heart of this. It sounds to me like this process needs more adult supervision.
Agile Engineering; Uncommon Quality Process
Facebook developers use an Agile development methodology, with heavy reliance on automated testing and mandatory code reviews (also known as walkthroughs).
Unlike environments such as Microsoft, there are no teams dedicated to quality assurance (QA). Engineers are personally and publicly responsible for their own quality. Statistics relating to any bug and its root causes are visible to all, including the analysis of which engineer “caused” the bug.
In other words, metrics that might normally be visible only to management are visible to all Facebook engineers. This is one essential characteristic of a “self-managed team,” in the Agile-development sense of the phrase. While introducing bugs isn’t a firing offense per se, peer pressure tends to keep the number of bugs lower than it might otherwise be.
Ultimately, persistent poor performers are ruthlessly weeded out from the company. The euphemistic cause often used is, “You’re not a good culture fit.”
The Facebook culture strengthens bonds between team members with events such as the All-Night Hackathon, during which “Facebook engineers create working prototypes of projects that they always wanted to build but couldn’t ever pursue during their regular hours,” as well as the Hackamonth, which seeks to increase individual engineers’ mobility between teams. These are in contrast to Google’s way of mixing it up: the famous 20% time.
Facebook engineers continue to be responsible for bugs in code they’ve written, even after moving on to other projects. However, some of the more menial bug-fixing duties are given to new employees, as part of their six-week initiation into the Facebook codebase (the so-called Engineering Boot camp).
Many code reviews happen as a matter of course, using an automated workflow; module owners are notified when another engineer checks in a change to their code, which prompts them to review the change immediately.
Automated testing plus code reviews is a powerful combination: one that far too many engineering departments ignore.
In my experience, the trick with automated testing is to artfully choose a suitable mix of unit-, whitebox-, and blackbox-testing. The code review process is also an art form, but extremely effective when done right. Making code reviews mandatory is a genius move.
However, the lack of dedicated QA staff is deeply worrying. Having worked in both types of organizations, my take is that the fresh perspective provided by QA people results in higher code quality. Engineers who believe they can get away without QA tend to have an exaggerated view of their own “superstar” skills. While I’m a great believer in hiring superstar engineers, they are, by their nature, rare.
I also love the positive use of peer pressure to reinforce desired behavior, but this idea is sheer poison to many pre-existing organizational cultures — organizational antibodies will seek it out, destroying it on sight.
With getting on for a billion users, any errors introduced into the Facebook service become very visible, very quickly. Avoiding such hold-the-front-page, uh-oh moments is a priority for the company.
Facebook employs as many operations people as it does software engineers. Culturally, the operations staffers are well-respected and highly–qualified.
The operations team is responsible for smoothly rolling out new releases of the Facebook software to its huge cloud server farm, totaling tens of thousands of Web, cache, Hadoop, MySQL, and ancillary servers, with 35 PB of storage. The team does that by following a carefully-planned, staged rollout process, incorporating several “go/no-go” checkpoints along the way.
Typically, the Facebook software is updated every week; builds are released every Tuesday. After clearing a few days of internal alpha testing — what Microsoft used to call dogfooding, before the word was banned — builds are rolled out to a tiny subset of public servers.
In essence, Facebook enforces a beta test on a few hundred users every week. The operations team watches those servers for problems, before starting the rollout to the other tens of thousands of servers.
The problems for which the operations team watches don’t simply include logged errors, performance, memory utilization, and the like, but also a statistical analysis of user behavior. If these hard data show users interacting with the site in a different way than normal, this can be a sign that all is not well.
Engineers who contributed changes to this week’s build are required to be in the office and on-call during the rollout. Again, the culture creates peer pressure by naming and shaming engineers who flout this rule.
A highly-visible cloud service would do well to emulate the Facebook culture of valuing its operations staff.
Several Facebook engineers have said that they believe their use of dogfooding and the initial, quasi-beta rollout were adequate substitutes for a dedicated QA team. While they may have found that to be so with Facebook’s unique set of measures and constraints, I dare say it wouldn't be acceptable in most enterprise environments.
In my experience, Facebook is doing a number of things right. However, other parts of this story leave much to be desired — or at least, won’t be helpful in other environments.
Acknowledgements: I’m indebted to Yee Lee, and the fine denizens of his blog, as well as those of Quora, Reddit, and Hacker News, where this topic has been discussed and dissected. Their ranks notably included many faceless Facebook employees who contributed data on processes used inside the company. Facebook engineers also frequently publish interesting internal information on their Notes page (albeit seemingly sanitized by Facebook PR).