richi

What Facebook Can Teach Us About Making Software

by Administrator on 03-11-2011 03:08 PM

At the last count, Facebook had more than 800 million users and more than 2,000 employees. How does a company with this many eyeballs watching manage to operate like an Agile startup? What’s its secret, and can we emulate it?

I was practicing Agile development methodologies a decade before they came to popular attention. As someone who has engineered enterprise software, managed teams of engineers, and been a product manager, here’s my take...

Requirements? We don’t require no Stinkin’ Requirements!

A traditional software development lifecycle starts with some sort of requirement specification. This is usually the job of individuals called PMs — product managers (or program managers, in Microsoft parlance). PMs document what’s needed and work with the development team to get it implemented.  They prioritize the various requirements, and should be sufficiently technically-minded to understand engineering constraints.

However, in the Facebook culture, PMs don’t own the specification; engineers do. The PMs kick off the process, but the specification is a living document, with changes usually made by engineering staff. The Facebook culture encourages PMs to take a back seat, for example, in cross-team meetings.

Engineers also have a big say in setting the priorities, which is also a task traditionally performed by PMs. Centralized resourcing decisions are rare. Instead, engineers essentially choose the projects on which they’d like to work next; PMs lobby groups of engineers to try to get them excited about an idea.

Analysis

The Facebook approach treats engineers as responsible adults, not peons at the will of the PMs. This tends to avoid the “them and us” sentiment that can be so corrosive in other organizations.

The downside of the traditional approach can be a lack of ownership when a problem occurs. Software engineers are typically highly linear-thinking individuals, so there’s always the possibility that an error or ambiguity in the specification means the end-result isn’t what's actually required — or at least that time is wasted discovering the problem later than it could have been.

Product managers in some organizations are simply too full of their own importance. Cultures that allow PMs to “throw a spec over the wall” suffer bigger delays than those where there’s a collaborative approach to building the product. Facebook’s culture tends to avoid these pitfalls.

PMs also need to be reasonably technical; engineers quickly lose respect for PMs who ask for six impossible things before breakfast. Facebook’s inclusive approach selects for the more technical PMs.

However, Facebook suffers from a lack of attention to chronic, glaring problems — e.g.,  spam, API bugs, troublesome administration of fan pages, and the often-broken e-mail system. There are also frequent, widespread complaints that the company often changes the website’s user interface for no good reason. I tend to think that a culture of engineers choosing to implement their favorite features is at the heart of this. It sounds to me like this process needs more adult supervision.

Agile Engineering; Uncommon Quality Process

Facebook developers use an Agile development methodology, with heavy reliance on automated testing and mandatory code reviews (also known as walkthroughs).

Unlike environments such as Microsoft, there are no teams dedicated to quality assurance (QA). Engineers are personally and publicly responsible for their own quality. Statistics relating to any bug and its root causes are visible to all, including the analysis of which engineer “caused” the bug.

In other words, metrics that might normally be visible only to management are visible to all Facebook engineers. This is one essential characteristic of a “self-managed team,” in the Agile-development sense of the phrase. While introducing bugs isn’t a firing offense per se, peer pressure tends to keep the number of bugs lower than it might otherwise be.

Ultimately, persistent poor performers are ruthlessly weeded out from the company. The euphemistic cause often used is, “You’re not a good culture fit.”

The Facebook culture strengthens bonds between team members with events such as the All-Night Hackathon, during which “Facebook engineers create working prototypes of projects that they always wanted to build but couldn’t ever pursue during their regular hours,” as well as the Hackamonth, which seeks to increase individual engineers’ mobility between teams. These are in contrast to Google’s way of mixing it up: the famous 20% time.

Facebook engineers continue to be responsible for bugs in code they’ve written, even after moving on to other projects. However, some of the more menial bug-fixing duties are given to new employees, as part of their six-week initiation into the Facebook codebase (the so-called Engineering Boot camp).

Automated- and semi-automated testing use the PHPUnit, Watir, JSSpec, and JUnit frameworks, plus Boost test libraries and HipHop runtime logging, along with some internally-developed toolsets.

Many code reviews happen as a matter of course, using an automated workflow; module owners are notified when another engineer checks in a change to their code, which prompts them to review the change immediately.

Analysis

Automated testing plus code reviews is a powerful combination: one that far too many engineering departments ignore.

In my experience, the trick with automated testing is to artfully choose a suitable mix of unit-, whitebox-, and blackbox-testing. The code review process is also an art form, but extremely effective when done right. Making code reviews mandatory is a genius move.

However, the lack of dedicated QA staff is deeply worrying. Having worked in both types of organizations, my take is that the fresh perspective provided by QA people results in higher code quality. Engineers who believe they can get away without QA tend to have an exaggerated view of their own “superstar” skills. While I’m a great believer in hiring superstar engineers, they are, by their nature, rare.

I also love the positive use of peer pressure to reinforce desired behavior, but this idea is sheer poison to many pre-existing organizational cultures — organizational antibodies will seek it out, destroying it on sight.

Cloudy Operations

With getting on for a billion users, any errors introduced into the Facebook service become very visible, very quickly. Avoiding such hold-the-front-page, uh-oh moments is a priority for the company.

Facebook employs as many operations people as it does software engineers. Culturally, the operations staffers are well-respected and highly–qualified.

The operations team is responsible for smoothly rolling out new releases of the Facebook software to its huge cloud server farm, totaling tens of thousands of Web, cache, Hadoop, MySQL, and ancillary servers, with 35 PB of storage. The team does that by following a carefully-planned, staged rollout process, incorporating several “go/no-go” checkpoints along the way.

Typically, the Facebook software is updated every week; builds are released every Tuesday. After clearing a few days of internal alpha testing — what Microsoft used to call dogfooding, before the word was banned — builds are rolled out to a tiny subset of public servers.

In essence, Facebook enforces a beta test on a few hundred users every week. The operations team watches those servers for problems, before starting the rollout to the other tens of thousands of servers.

The problems for which the operations team watches don’t simply include logged errors, performance, memory utilization, and the like, but also a statistical analysis of user behavior. If these hard data show users interacting with the site in a different way than normal, this can be a sign that all is not well.

Engineers who contributed changes to this week’s build are required to be in the office and on-call during the rollout. Again, the culture creates peer pressure by naming and shaming engineers who flout this rule.

Analysis

A highly-visible cloud service would do well to emulate the Facebook culture of valuing its operations staff.

Several Facebook engineers have said that they believe their use of dogfooding and the initial, quasi-beta rollout were adequate substitutes for a dedicated QA team. While they may have found that to be so with Facebook’s unique set of measures and constraints, I dare say it wouldn't be acceptable in most enterprise environments.

In Summary

In my experience, Facebook is doing a number of things right. However, other parts of this story leave much to be desired — or at least, won’t be helpful in other environments.

Acknowledgements: I’m indebted to Yee Lee, and the fine denizens of his blog, as well as those of Quora, Reddit, and Hacker News, where this topic has been discussed and dissected. Their ranks notably included many faceless Facebook employees who contributed data on processes used inside the company. Facebook engineers also frequently publish interesting internal information on their Notes page (albeit seemingly sanitized by Facebook PR).

Comments
by CustomerOpinion(anon) on 21-11-2011 03:15 PM

Now I understand why I find Facebook so frustrating to use. 

by cmeskee(anon) on 17-04-2012 09:48 AM

Pros:

No documentation, so the "code documents itself".  Meaning, when a change is needed, no one really knows how the system currently works until a timely and costly code review and then translation into plain english that users can understand is completed. 

Release candidate, compilable and testable builds are pushed every iteration.  Meaning, even though the high-level requirements outlined for the sprint aren't yet complete, we're going to cut work off at a time-boxed build date regardless of development status and push the missed features to the next iteration.  And then when the full set of iterations outlined in the project plan are done, all of the features that users actually cared about that didn't get built, but we sure built some cool stuff that no one asked for!

Not burdened by having to maintain a plan.  There's a high-level plan sure, but day-to-day individual developers get to let the creative juices flow because there's no requirements document and there are no pesky users asking annoying questions about where the features they asked for are at.  Plus, there's no real date set for when things have to be finished and no repercussions for not delivering the features requested by any date, so there's no real way to hold people accountable to completing expected work in an expected timeframe.  It's fantastically liberating to be able to do whatever you want whe you want with no annoying big brother looking over your shoulder making sure you're actually doing your job.

Developers own requirements.  As we all know, developers love to be in meetings all day, sitting in front of business users, interviewing them, asking the 5 "Whys" and making sure that what users really want is being built and then spending the time to make sure that needs are properly recorded so that they can be built according to user needs.  It's a perfect personality match for a stereotypical nerdy introvert that excels in math and linear thinking.

Cons.  See pros...

Post a Comment
Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.

The HP Input Output site is sponsored by HP and features articles and content from HP and third-party contributors. Third-party articles and content, while paid for by HP, do not necessarily represent the views and opinions of HP. HP does not endorse this content and is not responsible for its accuracy, availability and quality.

Follow Us
Spotlight
The Permissions Your Database Users Really Need (Video) The 16 Linux Shell Commands Every Desktop Linux User Should Know 7 Deadly Sins of Job Searching: Why You Still Don't Have a Job, and How to Get Back on Track 9 Tech Analogies That No Longer Mean Anything To Those Young Whippersnappers
┼ Based on energy, paper and toner savings from regular printer usage. Results may vary.