M.Hassnain Arshad 082 Nida Usmani 168

[Submitted to Dr. Ali Haider]

1. List few companies that use Amazon Cloud EC2. Can you name a few hosted applications by these companies?  BrowserMob: Provides the „load testing‟ (1) to software companies. It is used to model the use of software by simulation of many users getting access to the program simultaneously (2).  Chaordic Systems: Provides „e-commerce personalization‟ facilities making the users get the products most suitable to them and the sellers to get more profit (2).  Conduit : Provides platform for the publishers to make apps utilizing their brands and contents themselves (2).  Ensembl: Provides information of all the existing genomes free of cost (free database) (2).  ftopia: Online file sharing application (2). 2. When did the mentioned problem occur? How long did the outage last?  The mentioned problem began at 12:24 pm on December 24, 2012(2).  The outage took place at the Christmas Eve when people all over the United States and outside could not stream videos to watch movies. The outage lasted to for about a day and got fixed on the next day-Tuesday.  The company officially announced the fixation of the problem at 12:05 pm on Dec. 25, 2012 (3).It is approximated that the outage lasted for about 20 hours. (4)  With careful calculations, the outage had prolonged for 23 hours and 41 minutes and, at its peak, disabled 6.8 % of load balancers in the proximity(8). 3. How many states (US) and organizations were affected? Approximately how much was lost in revenue?      The people who were affected due to this outage were in millions residing in both North and South America including Canada, United States and Brazil. The states in the US affected were those which lie in the US East (North Virginia) region which include Virginia and other states. The organization which that severely affected was “Netflix” which is a US based streaming media online service provider to almost all the Pacific countries (10). Moreover there was another company named “Heroku” which had a poor impact on its performance due to the discussed mal-function (4). According to a financial report AMAZON suffered a loss of about 274 million US Dollars while its net income was 63 million US in the same period of previous year (11).

4. What caused the problem? Find out the cause of this failure.    The problem initiated due to the human error (as claimed by AMAZON (5)). During the maintenance on the East Coast Elastic Balancing System, a developer by mistake had logically deleted some data (6). This affected the capability of tracking the backend hosts of the routed traffic by each load balancer. High latency and error rates for application programming interface were observed (7). In the beginning, the experts were unable to spot the root cause of the outage for many hours and the data could not be restored in time before its accidental deletion (9).

    

Another confusion faced in locating the fault root was the fact that new application programing interfaces were being created by the users and the new load balancers were also managed yet the already present load balancers and APIs were not functioning. The above problem arose owing to the fact that Elastic load balancer control panel was trying to make modifications in the already present load balancers. The team was puzzled as many APIs were succeeding (customers were able to create and manage new load balancers but not manage existing load balancers) and others were failing. This outage also pointed out that ELB panel lacked some essential state data for the accomplishment of the above changes and hence the changes made were not configured properly. Inappropriate configuration badly affected the performance and errors were spotted in the usage of these modifies customer-applications (2).

5. Find another instance where a similar problem occurred with another organization.      References :
(1) (2) (3) (4) (5) (6) (7) (8) (9) http://en.wikipedia.org/wiki/Load_testing#Software_load_testing http://aws.amazon.com/message/680587/ http://aws.amazon.com/message/680587/ http://www.datacenterknowledge.com/archives/2012/12/25/major-christmas-outage-for-amazons-cloud/ http://www.digitaltrends.com/web/amazon-apologizes-for-netflix-outage-on-christmas-eve/ http://news.cnet.com/8301-1023_3-57561454-93/amazon-apologizes-for-netflixs-christmas-eve-streaming-outage/ http://aws.amazon.com/message/680587/ http://gigaom.com/2012/12/31/amazon-blames-human-error-for-xmas-eve-outage-netflix-vows-better-resiliency/ http://gigaom.com/2012/12/31/amazon-blames-human-error-for-xmas-eve-outage-netflix-vows-better-resiliency/

A similar outage had occurred in Microsoft‟s Windows Azure Storage service. The problem occurred from December 28, 2012 till December 30, 2012. It had affected nearly 2% of the millions of Microsoft users (14). The problem aroused due to mal-functioning of one of the multiple storage node stacks called “storage stamps” (12). The Azure users were unable to get the updates regarding the service‟s performance as the storage stamp responsible for these updates was facing problems. To worsen the problem the transition to a new primary node by mistake stimulated an action against the storage nodes that were not appropriately protected (13). Finally after an effortless attempt of two days, the problem was fixes and the consumers were offered cent per cent compensation (12).

(10) http://en.wikipedia.org/wiki/Netflix (11) http://paidcontent.org/2012/10/25/amazons-crummy-earnings-report-sends-shares-sliding-after-hours/ (12) http://rcpmag.com/articles/2013/01/17/microsoft-explains-dec-azure-outage.aspx (13) http://www.neowin.net/news/microsoft-reveals-reasons-for-late-december-azure-outage (14) http://www.zdnet.com/microsofts-december-azure-outage-what-went-wrong-7000010021/

Sign up to vote on this title
UsefulNot useful