The sheer scale of Twitter is amazing: 2.8-billion tweets a day works out to about 5,000 tweets a second. Each tweet of 140 characters (or about 200 bytes) has to be sent, recorded, and retransmitted to up to 20 million followers in less time than it took you to read this paragraph.
So, how does Twitter do it? With Linux and open-source software.
Chris Aniszczyk, Twitter's open-source manager and a leading Eclipse developer, offered a detailed explanation of how Twitter tweets at LinuxCon, the Linux Foundation's annual North American technology conference, and the Palmetto Open Source Software Conference.
“On the surface, Twitter is a simple real time service where the unit currency is 140 character messages called Tweets. However, if you look underneath the surface, there are over 2.8 billion tweets being sent out a day at an average steady state of 5,000 Tweets a second,” Aniszczyk says. “At this scale, you have to deal with some interesting real time engineering problems.”
Twitter uses open source to solve these problems because, Aniszczyk says, it's a no-brainer. “Open-source software allows us to customize and tweak code to meet our fast-paced engineering needs as our service and community grows,” he says. “When we plan new engineering projects at Twitter, we always make sure to measure our requirements against the capabilities of open source offerings, and we prefer to consume open source software whenever it makes sense.” As a result, much of Twitter is built on open source software and it’s “a pervasive part of our culture,” Aniszczyk says. “There is a positive cycle of teaching and learning within open source communities that we benefit from.”
One effect of that cultural value is Twitter's philosophy is to open-source almost all things. “We take our software inspiration from Red Hat's development philosophy: 'default to open,'” Aniszczyk says. “The majority of open-source software exclusively developed by Twitter is licensed under the liberal terms of the Apache License, Version 2.0. The documentation is generally available under the Creative Commons Attribution 3.0 Unported License. In the end, you are free to use, modify, and distribute any documentation, source code or examples within our open source projects, as long as you adhere to the licensing conditions present within the projects.” Twitter's open-source software is kept on GitHub.
Twitter's own base programs are almost entirely open source, but not everything is open. As has been in the news lately, Twitter is restricting how independent software vendors can use its APIs.
Linux is a major component of the Twitter architecture. Aniszczyk explains, “Linux powers the majority of Twitter and serves as our technology backbone. We have tens of thousands of machines running all types of services that run a customized version of Linux.”
Twitter prefers Linux because, Aniszczyk says, it lets the company innovate faster given the flexibility to customize the operating system.
Specifically: “We use a few different versions to see what works best in production, but as of today, we are mainly on the 2.6.39 release,” Aniszczyk says. “We customize the kernel by adding some patches such as enhanced core dump functionality, UnionFS support and the ability to allow TCP congestion window (PDF Link) to be set on a socket basis.”
At first, Twitter was written in Ruby on Rails, but it has moved on to Java-based programs. This was not a matter of performance, because Java was “faster” than Ruby on Rails. As former Twitter architect Blaine Cook said at the time of the shift, "languages don't scale, architectures do." For large networks and big data, what's important isn't how quickly your code runs so much as it is how well the entire system runs once it scales from thousands of users and dozens of servers to millions of users and thousands of servers.
So how does Twitter scale? Let's start with a single tweet. You type in, “Go Mountaineers! Beat Pitt!” and the text is sent over the Internet to the Twitter website. There, your tweet is registered as a status update. It's given a unique ID by snowflake, a network service that generates unique ID numbers as quickly as possible.
The tweet is then checked by the open-source URL shortener and spam detector program t.co; the URL itself is checked by a program with the unlikely name SpiderDuck. Once past this stage, each tweet is stored in MySQL by Gizzard, a flexible sharding framework for creating eventually-consistent distributed data stores. Twitter uses its own open MySQL fork for primary storage.
Of course, MySQL, even with Gizzard, isn't fast enough for hyperactive social networks. To get around waiting for drive storage, Twitter, like most social networks, use Memcache. Memcache, which was created by LiveJournal, is an open-source high-performance, distributed memory object caching system, which is used to speed up dynamic web applications by alleviating database load.
Twitter uses its own open-source version of Memcache, Twemcache, in “hundreds of dedicated cache servers keeping over 20TB of data from over 30 services in-memory, including crucial data such as user information and Tweets,” according to Aniszczyk. “Collectively, these servers handle almost 2 trillion queries on any given day. That’s more than 23 million queries per second.”
If all has gone well, in about 20 milliseconds, your Tweet is tagged, modified, and on its way to storage. It next sends your Web browser a HTTP 200 message acknowledging your tweet.
The tweet has not, however, started its journey to your Twitter followers. Before that happens, the tweet's data goes to Bing and other search programs using the Firehose API.
Finally, your tweets are ready for fanout: the process of being sent out to your eager followers. To do this, the list of your followers are pulled from a FlockDB. Gizzard also helps to sort out which tweets go to which persons. If all goes well, your tweet appears on your follower’s Twitter stream within a couple of seconds.
While Twitter still has its problems, when you consider that Twitter may have as many as half-a-billion users, all expecting their messages to get to all of their friends in near real-time, the successes far out-shine its failures. Twitter, by any reasonable measure, is a remarkable open-source programming success story.
If you'd like to be part of it, Aniszczyk points out: They’re hiring. “If this type of work interests you, might I remind you that we are looking for Linux developers to join the flock and would love to hear from you.”