Mediafly’s DNS (the service that translates a URL like www.mediafly.com to a machine-usable IP address like 184.73.166.104) is served by GoDaddy. On Monday, GoDaddy suffered a major outage which brought down millions of websites. As you might expect, this had the potential to severely disrupt our business.  However, with a couple of thoughtful decisions made years ago, our customers were barely impacted by the outage.  The iOS and Android device apps continued to operate smoothly for our customers, and some customers actually had higher usage than the previous business day (despite the GoDaddy outage). Â
More importantly, our customers stuck with us throughout the day, and we believe they truly appreciated both our frequent updates and our attempts to minimize any impact.
Over the past several years, we have operated under several guiding product and company principles that helped us thrive through this day of disruption:
- Offline is a First Class concept. Mediafly’s mobile apps assume that your connection is spotty at best, and do everything they can to bring the media you will use down to the device before you need it. Even with a broken Internet, users can continue to securely access their sales presentations, documents, and video.
- Fast and frequent customer communication is absolutely critical. When something goes wrong, you should be the first to let your customers know. We pride ourselves on proactively communicating when something goes wrong to our customers. Many of our larger customers work in the IT departments of their organizations, and they have customers of their own that are using our software. Being able to arm them with information in advance of their users finding issues, and keeping that information fresh, is very valuable.
- Detailed customer communication is just as critical. Customers will work with you if they understand A.) the root cause, B.) estimated time to resolution, and C.) any mitigating steps in the meanwhile. In this case, the answers were straightforward to deliver:
- A. Root cause: DNS is down. We pointed our customers to the TechCrunch article to show how widespread the problem is.
- B. Estimated time to resolution: While it’s impossible to confidently deliver an ETA in the case of a broken Internet, we can talk about our past experiences with this kind of issue. We’ve seen service recovery trickle in starting 4-12 hours later, and recovery complete within 8-24 hours.
- C. Mitigating steps: Because Offline is a First Class concept, the mitigating steps are to ask customers to bring their users into Airplane Mode, and to hold off trying to publish new media or organize existing media, until services have been restored.
- The whole team must watch and react to service issues quickly. As an example, when we realized the extent of the problem, we quickly dove into our production systems and modified all of our config files to switch from URLs to IP addresses. With this change, we took GoDaddy out of our internal loop and were able to continue using services internally. This allowed us to continue monitoring our services for actual errors and issues, vs. assuming everything was related to DNS.
We will be re-evaluating our decision to host our DNS at GoDaddy after this outage. We had been considering re-evaluating for some time now, and this outage accelerated our decision.
I’d like to thank our customers for their help and understanding with this issue.
Comments are closed.