Tuesday 31 October the server had a hardware failure. Two banks of defect memory. After replacement of the memroy we still had the same issues and we replaced the whole server. Unfortunately this new server also had a memory error. We needed to take serious action and quickly moved to a different server we had to configure from scratch.
In order to prevent that a hardware or data-center problem can stop the channel.me service we have prepared two different servers in separate area's. This allows us to easily switch from server A to server B if one server fails or when there are problems in the area where the server is located.
But it even got worse, 9 November the power went down at our (DNS) domain provider. So our backup plan of two different servers that we could easily switch didn't work, why? Even if our new servers are running, they where not reachable because the power failure at our DNS domain provider. A DNS domain provider is like a phone book, it tells everybody where to send traffic for channel.me needs to. Server A or server B.
How can you assure this will not happen again?
- We have replaced the current server (from 2013) with two newer servers.
- Servers will be in the same data center AMS-01 and located in Hall3 and Hall4. This means they are in separated area's protected from fire, power outages, and network problems.
- There will be an automatic synchronization from master,- to the fallback server. - Server to server sync will be implemented next week. This means that we can quickly resume normal operation on the fallback server.
- When one server goes down we will receive an SMS so we can instantly switch to the other (online) server within one hour.
- We will move to a DNS provider with more backup name servers.
Is there any data lost?
No, we had a good backup plan and could rebuild a new server from scratch.
Follow us on twitter:
We will also start posting updates on twitter:
https://twitter.com/Channel_me
Any other questions or feedback, let us know below:
Feedback
Regina
Very Nice, this explanation. Thnx! Posted 7 years, 1 month ago.
Jimmy
Thanks for following up on this. We like to see transparency and we like to see lessons learned from structural mishaps. Some comments:
- Who gets the SMS if a server goes down? Is it just one tech person or a number of staff (who know how to get hold of a tech person)?
- Our company uses a number of (partially overlapping) checks to ensure our website is behaving normally. If you use SMS, have you anything to fall back on that is independent of the machine that sends SMS?
- You don't mention 24/7 support. Can you reassure us that problems will be addressed even if they happen in the evening, at night or on weekends?
- You mention that you will be posting updates on Twitter. Will that be your preferred mode of keeping co-browsing clients up to speed? Will you be confirming when all items on the present action list have been done, for instance? Posted 7 years, 1 month ago.
Maas @ Channel.me
Thanks for your feedback. Both the tech and staff people get the SMS. We check our servers every 5 minutes to ensure that it works properly. This are inde
pendent monitors which check different parts of the service. The checks are run from 70 locations. Both the master and slave server will be checked with t
his monitoring service.
Your question regarding a fallback solution for monitoring solution made us think. We are going to let the master and slave server check each other. When
something happens it will send us a notification via email. We may implement SMS warnings in the future.
We usually have business hour support. Usually the period which we can respond is a little bit longer. If you really need 24/7 support, please contact Jus
tin. He can help you with this.
We run our servers on Erlang which can handle some level of failures by itself. There are built in supervisors everywhere in the software. These supervisors will restart services on their own without causing downtime. FYI, if you are technical, here is a nice online chapter on this part of Erlang which explains how this works: http://learnyousomeerlang.com/building-applications-with-otp
At this moment we prefer to inform clients by email, but in case of a DNS failure, this will be a problem. As we noticed the hard way. We are planning to use Twitter to update our clients when there are import things happening, or when there are new roadmap items. We are open for other suggestions. Maybe Li
nkedIn? Posted 7 years, 1 month ago.