Voici ce que vient de
tweeter Renaud Guerin...
I'm a former Site Reliability Engineer at Facebook (albeit one from many moons ago).
Here's my modest take on #facebookdown :
Assuming the account of events below is true (it does sound plausible) we could still be looking at a few more hours of outage, here's why Arrow pointing rightwards then curving downwards
1/n
Et en français...
Je suis un ancien ingénieur en fiabilité de site chez Facebook (bien qu'il y ait de nombreuses lunes).
Voici mon modeste point de vue sur #facebookdown :
En supposant que le récit des événements ci-dessous soit vrai (cela semble plausible), nous pourrions encore envisager quelques heures de panne supplémentaires, voici pourquoi la flèche pointe vers la droite puis se courbe vers le bas
1 / non
Alors, qu'est-ce qui est arrivé?
What happened ?
It appears Facebook has inadvertently cut itself off from the rest of the Internet. More accurately, it mistakenly removed every "road sign" worldwide pointing at its network.
The "why" will be a very interesting post mortem to read. How hard is it to fix ?
2/n
--
It depends on 2 factors :
1. Whether there's a *workable* backchannel for remote engineers to access not only the systems themselves, but crucially the comms tools and documentation they need.
2. How well rehearsed of a disaster recovery plan they had for this kind of issue.
3/n
--
You can reasonably assume they had some sort of emergency out of band remote access set up.
But do they still have access to all the fancy internal comms + incident management tools + documentation right now ?
This is less certain and could slow down remediation hugely.
4/n
--
How often did they run drills for a "network down" situation ? Did they have contact numbers in their phones ? Documentation printouts with them ?
Honestly, an outage like this one is so far fetched that it's unlikely they would have had 100% of bases covered.
5/n
--
WFH, of course, would have made this worse.
When the tools you rely on for your daily comms with colleagues are unavailable, it adds an extra burden to the already sky high cognitive load of troubleshooting a thorny and high stakes technical issue.
6/n
--
Needless to say this is an extremely stressful event for the engineers involved, but Site Reliability / Operations engineers are in it for the adrenaline. They will no doubt remember this day for the rest of their careers. Sparing a thought for my former colleagues !
7/7
--
Ok, if the below is confirmed and having also spoken to former colleagues, I'm clearly more pessimistic as to their preparedness for the issues I explained above Face with cold sweat
En
référence à...
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
-- --
Mais il y a pire parce que si l'on se fie à ce tweet de kayy, Facebook serait DISPARU, carrément!
-- -- --