Wednesday, October 10, 2012

Client Side Load Balancing - it rocks

Server-side clustering and load balancing is bad mkay.

It's far too much of a shotgun approach to deliver real efficiency, and it brings its own problems (more load balancing layers, costs, SPOFs, etc.).


How about a simpler, stronger, more scalable approach ?


1. Get your server list !


  • Option a)AnyCast to the "closest" server to receive the local server list (hosting on CDN is probably an option)
  • Option b)The server list could even be loaded with the application
  • In both cases, the list ends up cached in the application, purely IP beyond the first load
(file name = australia.global.json) 
[["url1","url2","url3","whatever"],["url4","url5","url6","other"],["url7","url8"]]

 List is of course a JSON string sorted with location (0,1,2) in order of ping distance, trimmed to needs (i.e. maybe only listing 4 servers for each secondary country instead of transferring a huge JSON file for the fun of it), pre-computed for each country and stored in the CDN.

2. Select server from list @ random

function rand(x,y){return Math.floor((Math.random()*y)+x);}//java style random sucks
var index=rand(0,jsonArray.length-1); //JS is readable. for everyone - including liberal arts ;)
var tries=1;
var url=jsonArray[index];

3. If a request fails, client picks another server and retries

jsonArray.splice(index,1);
index=rand(0,jsonArray.length-1); 
tries++;
url=jsonArray[index];

4. If three different servers from the closest country (0) fail, the app starts trying with the next one (1), etc.

If too many errors occur, the application contacts any of all the servers it knows to get a new list. If even that fails, it resorts to asking the CDN.

5. When a server becomes overloaded, it starts trashing requests on purpose

And the application retries on other servers, thereby balancing the inevitable imbalance of the probabilistic load balancing solution.


Why it's better:


Simpler, More Scalable, More Efficient, More Reliable, Better Availability - all by design (less components, no central overloaded ASIC, no server-side cluster management, no possible LB failure, potential live redundancy - i.e. API calls sent to multiple servers - across N servers).

Because failover is handled in the application itself, as long as the user has internet connection, you can guarantee 100.00% uptime (as in 100.00% of user connection uptime, the theoretical maximum anyway), without any concern for possible browser stupidity, DNS caching from browser to global server, DDoS, cache poisoning, unresponsive server that looks alive to the load balancer, overloaded servers, etc.

It enables you to deliver REAL availability to your end users because that's the only thing the failover mechanism can actually monitor.

Much better than that, with a properly designed application, the user will at most need to log in a second time  in case of failure, unless login/auth was implemented using some external service like google or facebook, in which case the failure has absolutely no effect at all from a user perspective.

Since DNS is not even used except in improbable cases (accessing the CDN for a new server list in case every single server in the original list failed to reply - something that usually means we got hit by an atom bomb, it's on the other side of the great firewall OR we lost internet connection, something that's of course checked after a few non-responsive servers), we can safely say that the following have no effect:

DNS shutdown
DDoS on someone who shares DNS with you
DDoS on anyone's DNS actually
DNS cache poisoning


And of course, it can be implemented in any web application, web site, application, mobile app, ...

Why it doesn't work like that



This solution looks so much better than what is actually being done that I expect someone to hop in at anytime and tell me why that cannot work.

So please go ahead, enlighten me and I'll come up with a much better solution then ;)

No comments:

Post a Comment