Challenges
The
following is a summary of an interview with the ICT team who manages the
datacenter of (****NAME HIDDEN****)corp. that counts for 135 employs among
which 105 engineers are working in software R&D. The ICT staff is in charge
of maintaining the operations of about 1,000 machines, scattered in 9 different
labs of around 50m2 each, and a central computer room of 160m2. The datacenter
is a horrendous collection of different computers with various hardware
configurations ranging from recycled personal workstations, inefficient old
servers (e.g. Ultra1 and Ultra2 boxes) stocked in racks, to more recent
medium-to-high-end multi-core servers interconnected with large storage systems.
The central computer room is primarily used to host the file and backup servers
of the labs. It is interconnected with the 9 labs through fast 10 Gb ethernet
links. The heterogeneity and aging of the machines populating the datacenter
make it difficult to cost-effectively manage the datacenter because it is time
consuming to maintain such a diverse ICT environment and requires a significant
inventory of spare parts to replace faulty components.
Among
the 1,000 or so machines, it is estimated that a small proportion cannot be
easily externalized to the cloud because they are either used extensively—with,
for example, release engineering that runs non-regression tests on a daily
basis using large amounts of data that would be too costly and impractical to
migrate to the cloud—or require control over the hardware specifications of the
machines—like a SPARC versus an x86 processor architecture—to conduct performance
benchmarks, product qualification, and other similar platform certification
tasks. Conversely, it is estimated that the vast majority of the x86-based
servers could be externalized to the cloud for general development and testing
purposes on the condition that the solution can ensure both excellent access
performance and security. Amazon's Virtual Private Cloud, as we will see below,
and CohesiveFT were cited as companies who can tackle the issue of bridging the
datacenter with public cloud services in a secure manner to ensure that data at
rest and data in-flight are not compromised. CloudSwitch5 was also cited as a
well regarded startup that pushes the concept of hybrid cloud computing further
by offering a service that takes care of all the networking, isolation, management,
security and storage concerns related to moving VMWare-based virtual machines
to Amazon EC2.
The
primary motivation for GEC to outsource parts of its development and test
infrastructure to the cloud arises from the actual inefficiency in the use of
the lab resources. Secondly, the organization is seeking to reduce its hardware
procurement and maintenance costs as well as its electricity bills for cooling
and running the systems, hence inducing a greener IT positioning by reducing
CO2 emissions from the sprawling of servers and storage farms. Thirdly, they
are seeking to become more agile and productive by providing developers and
testers with a unified and more effective computing environment.
Currently,
GEC is facing several challenges that hamper its ability to reach these
objectives:
There
are many contention issues among lab users between developers and testers to
get access to available machines of the right configuration type. QA teams are
generally scrambling to get access to machines when an alpha release is
delivered. They rely heavily on the engineering teams to free up boxes from
development.
Developers
tend to be very inefficient in their use of lab resources because they step in
and out randomly with various engineering tasks. Meanwhile, they retain control
of the machines they own for weeks and even months because they are reluctant
to release them as doing so would mean that they would have to reinstall and
reconfigure the different pieces of the working environment every time they get
a machine back from the pool. It is common to see developers keep a dozen or so
machines up their sleeve while they are in actual fact only using one or two at
a time. Similarly, QA has to reinstall the entire software stack every time
they get an available machine. This is because part of a real testing cycle is
to ensure that the application mirrors a production environment, which means
the machine needs to have a clean install with appropriate versions of all the
software components. It can take days to manually set-up a proper QA
environment, from the time the alpha version is released to QA and the time
that QA actually gets an environment set-up properly to begin the test.
The
demand for resources is overall very spiky over the course of a year. The ICT
staff does not keep an exact track of this, but their “gut feeling” is that ,
in general, machines are most of the time in an idle state, which leads to a utilization
rate below 10%. For example, QA activities are linked to the product's
life-cycles, which concentrate a stiff peak of load twice a year for a couple
of months at most, thus involving the allocation of a large amount of
resources, yet ephemeral, to be able to release the product on time. The worst
thing that could happen is to delay the release of a product as a result of a
lack of available resources.
To
prevent such impediments, the policy, so far, has been to largely
over-provision the capacity of the datacenter. But in times of cost cutting,
budgetary constraints and energy efficiency pressures, this approach is no
longer viable. So far, the ICT staff has been able to contain this precarious situation,
while reducing the number of physical servers, through an aggressive server
consolidation process started a couple of years ago, by using the
virtualization technology of VMWare. But, for historical and customary reasons,
not all servers are yet virtualized. In addition, ICT staff is seeing today a
demand surge for larger server configurations—typically fast eight-core CPU
machines with 16 GB of memory and more—that the current virtualization layout
cannot easily fulfill because most of the underlying physical servers in the
datacenter are small-to-medium size machines.
In
addition to the points outlined above, the interview uncovered some other
relevant goals that the project should address:
·
Allow administrators to centrally manage the
labs' infrastructure and configuration, including policies, which determine
what resources can be allocated and consumed by whom (i.e. groups of users or
individuals).
·
Allow testers to safely reserve virtual machines
without conflict and provision automated test configuration scenarios on a
scheduled or on-demand basis with no manual intervention.
·
Allow testers to reliably request and securely
access test configurations 24/7 through a standard browser that enables R&D
personnel—in any location worldwide—to deploy test configurations via a
self-service portal without requiring access to the physical hosts.
No comments:
Post a Comment
Please Share Your Views