Saturday, 27 April 2013

Case Study - Cloud Computing Challenges


Challenges
The following is a summary of an interview with the ICT team who manages the datacenter of (****NAME HIDDEN****)corp. that counts for 135 employs among which 105 engineers are working in software R&D. The ICT staff is in charge of maintaining the operations of about 1,000 machines, scattered in 9 different labs of around 50m2 each, and a central computer room of 160m2. The datacenter is a horrendous collection of different computers with various hardware configurations ranging from recycled personal workstations, inefficient old servers (e.g. Ultra1 and Ultra2 boxes) stocked in racks, to more recent medium-to-high-end multi-core servers interconnected with large storage systems. The central computer room is primarily used to host the file and backup servers of the labs. It is interconnected with the 9 labs through fast 10 Gb ethernet links. The heterogeneity and aging of the machines populating the datacenter make it difficult to cost-effectively manage the datacenter because it is time consuming to maintain such a diverse ICT environment and requires a significant inventory of spare parts to replace faulty components.

Among the 1,000 or so machines, it is estimated that a small proportion cannot be easily externalized to the cloud because they are either used extensively—with, for example, release engineering that runs non-regression tests on a daily basis using large amounts of data that would be too costly and impractical to migrate to the cloud—or require control over the hardware specifications of the machines—like a SPARC versus an x86 processor architecture—to conduct performance benchmarks, product qualification, and other similar platform certification tasks. Conversely, it is estimated that the vast majority of the x86-based servers could be externalized to the cloud for general development and testing purposes on the condition that the solution can ensure both excellent access performance and security. Amazon's Virtual Private Cloud, as we will see below, and CohesiveFT were cited as companies who can tackle the issue of bridging the datacenter with public cloud services in a secure manner to ensure that data at rest and data in-flight are not compromised. CloudSwitch5 was also cited as a well regarded startup that pushes the concept of hybrid cloud computing further by offering a service that takes care of all the networking, isolation, management, security and storage concerns related to moving VMWare-based virtual machines to Amazon EC2.

The primary motivation for GEC to outsource parts of its development and test infrastructure to the cloud arises from the actual inefficiency in the use of the lab resources. Secondly, the organization is seeking to reduce its hardware procurement and maintenance costs as well as its electricity bills for cooling and running the systems, hence inducing a greener IT positioning by reducing CO2 emissions from the sprawling of servers and storage farms. Thirdly, they are seeking to become more agile and productive by providing developers and testers with a unified and more effective computing environment.

Currently, GEC is facing several challenges that hamper its ability to reach these objectives:
There are many contention issues among lab users between developers and testers to get access to available machines of the right configuration type. QA teams are generally scrambling to get access to machines when an alpha release is delivered. They rely heavily on the engineering teams to free up boxes from development.
Developers tend to be very inefficient in their use of lab resources because they step in and out randomly with various engineering tasks. Meanwhile, they retain control of the machines they own for weeks and even months because they are reluctant to release them as doing so would mean that they would have to reinstall and reconfigure the different pieces of the working environment every time they get a machine back from the pool. It is common to see developers keep a dozen or so machines up their sleeve while they are in actual fact only using one or two at a time. Similarly, QA has to reinstall the entire software stack every time they get an available machine. This is because part of a real testing cycle is to ensure that the application mirrors a production environment, which means the machine needs to have a clean install with appropriate versions of all the software components. It can take days to manually set-up a proper QA environment, from the time the alpha version is released to QA and the time that QA actually gets an environment set-up properly to begin the test.

The demand for resources is overall very spiky over the course of a year. The ICT staff does not keep an exact track of this, but their “gut feeling” is that , in general, machines are most of the time in an idle state, which leads to a utilization rate below 10%. For example, QA activities are linked to the product's life-cycles, which concentrate a stiff peak of load twice a year for a couple of months at most, thus involving the allocation of a large amount of resources, yet ephemeral, to be able to release the product on time. The worst thing that could happen is to delay the release of a product as a result of a lack of available resources.
To prevent such impediments, the policy, so far, has been to largely over-provision the capacity of the datacenter. But in times of cost cutting, budgetary constraints and energy efficiency pressures, this approach is no longer viable. So far, the ICT staff has been able to contain this precarious situation, while reducing the number of physical servers, through an aggressive server consolidation process started a couple of years ago, by using the virtualization technology of VMWare. But, for historical and customary reasons, not all servers are yet virtualized. In addition, ICT staff is seeing today a demand surge for larger server configurations—typically fast eight-core CPU machines with 16 GB of memory and more—that the current virtualization layout cannot easily fulfill because most of the underlying physical servers in the datacenter are small-to-medium size machines.

In addition to the points outlined above, the interview uncovered some other relevant goals that the project should address:
·        Allow administrators to centrally manage the labs' infrastructure and configuration, including policies, which determine what resources can be allocated and consumed by whom (i.e. groups of users or individuals).
·        Allow testers to safely reserve virtual machines without conflict and provision automated test configuration scenarios on a scheduled or on-demand basis with no manual intervention.
·        Allow testers to reliably request and securely access test configurations 24/7 through a standard browser that enables R&D personnel—in any location worldwide—to deploy test configurations via a self-service portal without requiring access to the physical hosts.

No comments:

Post a Comment

Please Share Your Views