Adding "civil union" to relationship statuses broke Facebook

Adding "civil union" to relationship statuses broke Facebook

The Pragmatic Engineer • 2025-06-12 • 1:26 minutes • YouTube

🤖 AI-Generated Summary:

Why Test-Driven Development (TDD) Isn’t Always the Best Fit: Lessons from a Real-World Feature Launch

When it comes to software development, Test-Driven Development (TDD) is often hailed as a best practice for ensuring code quality and preventing bugs. However, real-world experiences sometimes reveal that TDD isn’t a one-size-fits-all solution. Here’s a candid reflection on a feature launch that challenges the conventional wisdom around TDD and highlights the complexities of building large-scale, rapidly evolving systems.

The Feature: Expanding Relationship Types

The first feature implemented was simple on paper: expanding the list of relationship statuses to include “civil union” and “domestic partnership” alongside existing options like “single,” “complicated,” and “married.” The goal was straightforward, and the development process followed TDD principles—writing tests before the code.

The Reality: TDD as a Waste of Time

Despite rigorous TDD, the rollout hit a snag. The notifications system broke due to implicit coupling between components—an interdependency that wasn’t obvious or directly testable. The error was subtle and escaped detection during testing, leading to an increase in error rates post-launch.

Thankfully, a colleague noticed the issue, quickly developed and deployed a hotfix, and the problem was resolved. This incident underscored a crucial insight: the source of many bugs wasn’t complex algorithms or logic errors but rather configuration and subsystem relationships that tests couldn’t easily cover.

The Culture: “Nothing at Facebook Is Somebody Else’s Problem”

An important cultural element helped mitigate risks. At Facebook, the mantra “nothing is somebody else’s problem” fostered ownership and proactive problem-solving across teams. When errors occurred, people didn’t pass the buck; instead, they jumped in to fix issues swiftly, ensuring system stability despite rapid innovation and scaling.

The Takeaway: Test What Matters, Don’t Over-Test What Doesn’t

The experience highlights that while TDD can be valuable, it’s not always the best tool in environments where:

Bugs stem from implicit coupling and configurations rather than isolated code units.
Tests can’t easily capture the interplay between subsystems.
The system is evolving rapidly, requiring quick iterations and flexible responses to unforeseen issues.

In such contexts, focusing heavily on TDD can be inefficient and may create a false sense of security. Instead, investing in robust monitoring, quick feedback loops, cross-team collaboration, and a culture of shared responsibility can be more effective strategies for maintaining quality and stability.

Conclusion

TDD is a powerful technique but not a silver bullet. Real-world software development, especially in complex, large-scale systems, demands a nuanced approach that balances testing with other practices like vigilant monitoring and a collaborative culture. Embracing this balance can lead to more reliable, scalable, and innovative software products.

By sharing these insights, developers and teams can better understand when and how to apply TDD—and when to complement it with other strategies—to build systems that truly work in practice, not just in theory.

📝 Transcript (37 entries):

The first feature I implemented and launched was adding to the relationship type. So you could say, "I'm single. It's complicated. I'm married." And I added civil union and domestic partnership to that list. I used TDD. It was a big waste of time. It rolled out.

The notifications code broke because there was implicit coupling between the two and you couldn't find it, but it was there. Somebody else saw the error rate go up, went and fixed it, rolled out a hot fix. I called them up. I'm like, "Oh, I'm so sorry that you had to do that." It's like, "Yeah, that's what happens." There was a poster that was very popular there that said nothing at Facebook is somebody else's problem. And everybody acted like that was true. And because of that, if you add all those different feedback loops together, we had a relatively stable, rapidly innovating, and rapidly scaling system all at the same time. The mistakes that actually caused problems like the calculation of some string was not a hairy computer science dynamic programming blah blah blah that could go wrong. What would go wrong is configuration stuff, the relationship between subsystems, stuff you couldn't write tests for. So writing test for things that didn't break and didn't catch the actual errors, it just didn't make any sense in that kind of environment with that risk profile.

Yeah, it didn't make sense.

YouTube Deep Summary