Through 2015 and particularly towards the end we were experiencing slowdowns and outages that were essentially related to high load.
High load in cloud applications is a bit like the weather (in NZ anyway) you are never quite sure what is going to hit you or where it is going to come from. All load issues can be fixed by reprogramming or by server upgrades. The problem is that these fixes are not instant but load is. You have to suffer poor performance while the application is reprogrammed (optimised in our language) or servers are added.
There is now a solution to this and it is what we are working towards but first a bit more on where load comes from. Load comes from increased use (more users) or more data to process. As data builds up it gets slower to process, there is more to go through to get a result. To get the balance of an account that is 1 year old is faster than it is to get a balance from an account that is 10 years old. We have both situations, more shops using Circle and existing shops adding more and more data.
As I mentioned slowness to get results can be solved by optimisation or server upgrade. In the account balance example you could limit the data you hold to say 3 years or pre-calculate the balances so that they are instantly ready when needed. You could also add more RAM and CPU to the server.
Optimisation treats the cause so is the best solution but it takes time (weeks often), adding CPU and RAM to a physical server is faster but requires down time and there are limits to how much RAM or how many CPUs can be added to a single server. It is better or more unlimited to be able to add an additional server(s) but the application has to be written in such a way as to be able to be distributed over multiple servers and even then getting and setting a extra server is not instant it takes days.
I mentioned that load can be unpredictable, user increase can be sudden (SEO work has greatly increased web traffic for example) or you can reach a tipping point where something that worked ok is suddenly very slow. Another thing that sometimes happens (which is currently the main problem) is that a slow part of the system is triggered which creates a log jam that slows everything down to it is done.
The ultimate solution to all of this is two pronged. We need to be able to almost instantly add servers to buy time while we optimise problem areas. Cloud server technology has evolved to the point where this is now possible and we have selected Google servers to move to. The work to restructure the application to work across multiple servers has been underway since October last year, we have setup a test server on Google, done a lot of with code optimisation, restructured major parts of the application to enable them to be run on other servers and added servers here. When all this is bedded in on current servers we will be ready to make the move and want it to be this quarter.
On Google infrastructure we will be able to literally add a server at the click of a mouse. Even better we think this can be automated so that when a certain load threshold is reached another server will be automatically added.
Below are a couple of graphs showing optimisation progress since October and a zoomed in view showing a load spike/log jam situation on a particular day.
The main reason we are changing you to CNAME is that you will not have to do anything when we change servers/IP addresses. We just need to change the IP address of the two CNAMEs we have and your site will follow to the new servers.
You currently have a hardcoded IP address in your DNS setup. If you stayed with this (and you can if you really want to) you would need to manually update the IP address to the address of the new servers at the same time as we move (and turn the current server off) to avoid your site going off line. Very few shops are going to be able to do this which is why we are moving everyone to CNAME now ahead of the move.