Ex-Twitter tech lead says platform's infrastructure can sustain engineering layoffs

Twitter logo hanging on a clothes line by a clothes peg

Twitter systems are safe from collapse in the immediate future due to years of infrastructure planning, according to a senior engineer who left the platform in August.

Matthew Tejo, a former Site Reliability Engineer (SRE) at Twitter, explained in a blog post that much of his career at the firm was spent automating systems where possible, and disaster planning where it was not, and that the platform can continue to function providing there are no major changes to the systems in place.

The explanation of how Twitter's infrastructure was designed comes after members of the tech industry questioned whether Twitter would be able to run after new CEO Elon Musk fired large portions of engineering staff.

Tejo said that Twitter relies heavily on cache memory to handle traffic, keep response speeds low across the website, and massively reduce overall server costs.

These caches are then run on the Aurora framework, itself encompassed on the open source Apache Mesos project. While Aurora allocates applications to servers, Mesos aggregates servers, removing them in the case of breaks.

As Mesos is not capable of detecting all hardware issues, Twitter relies on manual monitoring from its IT department to check for problems such as bad disks. If one is found, repair workers in the data centre are automatically sent to rectify the problem.

The small number of Twitter’s remaining workforce - believed to be just 20% of its peak following the most recent round of resignations - could prove problematic, as the same amount of work now has to be completed by fewer engineers.

However, Tejo also revealed that at any given time, Twitter has two concurrently-running data centres capable of handling a total failure of the site, with each capable of running all the core services on the platform. This means that Twitter constantly has 200% capacity, for use in worst-case scenarios, and therefore is incredibly unlikely to die through a lack of server resources.

Twitter also uses custom tools to ensure that servers are safely distributed from the moment they are allocated: “Those tools make sure the team doesn’t have too many physical servers on a rack and that everything is distributed in a way that won’t cause problems if there are failures,” said Tejo.

Unknown problems with the infrastructure, or changes to it made in the wave of alterations brought in by new Twitter CEO Elon Musk, could still destabilise the platform. Reflecting on the amount of effort that has gone into making Twitter at least partially self-sustaining, Tejo nevertheless acknowledged that he is “sure there are some bugs lurking somewhere”.

In the immediate aftermath of Musk taking over, Reuters reported that Musk was seeking to make $1 billion in infrastructure cuts in the coming months. The source reporters spoke to indicated that $1.5 to $3 million in server and cloud services costs had been identified as unnecessary, suggesting that the excessive safety redundancies which Tejo helped establish might not be maintained.

“I don’t want to be using systems or services that are hurriedly assembled under extreme duress, standards will slip, data will get lost,” Jeff Watkins, CPTO at xDesign told IT Pro.

“Worse still is that the likely outcome will be a great brain drain. As a result, the remaining team will likely not be the A-team.

“So the user data impact could be bad, but Twitter isn’t just used by the tweeting users via the website and mobile applications, it also has some interesting side-effect usage through its APIs, like detecting downtime in systems (from tweets mentioning the particular company). Destabilising what has become almost a social shadow-IT could have some unexpected consequences on a global scale.”

RELATED RESOURCE

The Total Economic Impact™ of IBM Spectrum Virtualize

Cost savings and business benefits enabled by storage built with IBMSpectrum Virtualize

FREE DOWNLOAD

It remains unclear if Twitter's infrastructure will sustain the platform in the long term with fewer engineers working on it. Despite the excess server resources available, bugs are rampant in software and experienced engineers are required to address them to ensure the smooth running of services.

Twitter has undergone a period of rapid changes since Elon Musk completed his acquisition of the platform on 27 October. In the weeks since, a number of senior figures at the company left their roles, half its employees were fired overnight amidst chaotic scenes of workers being locked out of their emails, and a large number of remaining workers responded to Musk’s demands of harsher work conditions with a resignation ‘revolt’.

On Monday, The Verge reported that Musk made huge cuts to Twitter staff benefits, slashing company allowances for childcare, home internet, and wellness. The same report stated that staff will now have to provide higher ups with a full rundown of their completed work at the end of each week.

Rory Bathgate
Features and Multimedia Editor

Rory Bathgate is Features and Multimedia Editor at ITPro, overseeing all in-depth content and case studies. He can also be found co-hosting the ITPro Podcast with Jane McCallion, swapping a keyboard for a microphone to discuss the latest learnings with thought leaders from across the tech sector.

In his free time, Rory enjoys photography, video editing, and good science fiction. After graduating from the University of Kent with a BA in English and American Literature, Rory undertook an MA in Eighteenth-Century Studies at King’s College London. He joined ITPro in 2022 as a graduate, following four years in student journalism. You can contact Rory at rory.bathgate@futurenet.com or on LinkedIn.