FAA Cloud Services Playbook

low gear: play 1

Migrate Apps off "Stacked" Infrastructure

The Tri-Modal Approach to Cloud Services described the migration of low gear applications to the cloud as uploading virtual machines images from a source VMWare ESXi environments to an destination FCS cloud domain. This “drag-and-drop” is clearly attractive, but an implicit requirement in this approach is that the virtual machine images migrating to the cloud effectively serve as a deployment unit of a particular application.

In FAA hosting environments, the requirement is largely unmet, as individual applications run on top of common middleware platforms (such as .NET application servers or Oracle RAC clusters) that host components of multiple applications on top a single virtual machine. More likely than not, these applications have different funtional requirements, security controls, and system owners, leading to different outcomes during a Cloud Suitability Assessment. While it made rational sense to minimize the number of platforms when conducting manual systems administration in a traditional FAA data center, this model does not work well in cloud environments.

As a result, applications moving to the cloud must be migrated to virtual machines that provide middleware services for only one application. This prevents infrastructure components from crossing system boundaries and enables migration of existing VM images to the cloud. While highly discouraged, it may be permissible to migrate common infrastructure to the cloud if all systems that reside on that shared infrastructure are cloud suitable without code or configuration changes. However, this should be considered a temporary measure and only done alongside a near-term commitment to refactor the various applications onto logically isolated deployment units in the cloud.

Key Questions

Is your application virtualized?
Is your application running on shared virtual machines that also host components of several applications?
Is there a well defined deployment unit running on the shared hosting environment? (e.g. Java EAR or WAR file, .NET CLR assemblies)
Does the middleware licensing agreement permit migration to cloud?
Does the middleware licensing agreement permit refactoring in FAA data centers?

Checklist

Complete a detailed Cloud Suitability Assessment to determine if your application is a candidate for migration
Work with Operations to understand how the application is currently hosted
Work with Legal to understand if your middleware licenses can be migrated to the cloud
Refactor applications within FAA data centers to prevent accidental data leakage of neighbor applications.

low gear: play 2

Configure Apps to scale vertically by resizing VMs

Despite the fact that NIST standards equate equate elasticity to the horizontal scaling of nodes, it is also possible to vertically scale virtual machines by shutting down a virtual machine instance, resizing the virtual machine instance using a larger virtual machine size, and restarting the virtual machine. Modern operating systems and properly-configured middleware detect these larger resources and begin using them without additional manual intervention.

This practice is familiar to VMWare administrators and likely occurs with FAA applications today, but unlike the FAA’s VMWare environment, cloud providers do not support concepts such as “Memory Hot Add” or “CPU Hot Plug” which allows hypervisor to dynamically increase virtual machine resources without having to reboot. A desire for these sorts of services likely influences the inclusion of VM_VPUA_ADD and network I/O bandwidth supplemental features in the FCS contract, but the actual implmementation of these services would require a reboot.

While this is disruptive and likely requires downtime, this approach allows existing applications that do not have an architecture suitable to horizontal scaling a quick and easy path to increased capacity. To the greatest degree possible by the FCS contract and the Agility Platform, low gear applications should use this capability to resize applications in anticipation of known spikes in demand or in response to degraded user experience.

Key Questions

Does your application currently take advantage of “Memory Hot Add” or “CPU Hot Plug” features in the ESXi environment?
Does your application rely on static configuration files that must be modified before being able to use additional resources?
Does your application have usage characteristics that provide off-peak hours condusive to rebooting?

Checklist

Collect historical performance data for the application if available, including all configuration changes to give the underlying compute infrastructure additional resources, such as “Memory Hot Add” or “CPU Hot Plug.”
Develop Load Tests and UX Load Tests based on the typical usage characteristics of your appliction.
Analyze the resulting performance metrics and determine if the application is (http://stackoverflow.com/questions/868568/what-do-the-terms-cpu-bound-and-i-o-bound-mean)[CPU bound, Memory bound, I/O bound, or Cache bound.]
Develop metrics and scaling criteria to address the performance limitations at scale and implement these criteria using an orchestration engine.
Test to ensure scaling action works as expected.

low gear: play 3

Backup automatically via Snapshots

Less-critical applications should consider implementing a “backup and restore” model of disaster recovery by periodically backing up the disk volumes associated with an application’s VMs as snapshots in object-storage. This approach potentially offers basic disaster recovery services for applications with a lower recovery point objective (RPO).

In the FCS environments, the creation of snapshots can be automated by creating an Agility Platform Event Policy with a periodic event time based on the target RPO. Recovery of low gear applications occurs by reverting volumes to an existing snapshot, but applications shifting to medium gear may consider alternative recovery processes that utilitize orchestration scripts to redeploy a recovery environment from scratch in order to maintain the existing environment for problem determination or to gracefully cut over to a fresh environment when the existing infrastructutre is facing a “sick but not dead” scenario.

Note: AWS has backend restrictions that limit an account to five simultaneous snapshot actions at a single time.

Key Questions

Is the application’s RPO able to support the “backup and restore” model of disaster recovery?
Does the total storage cost of snapshots (# of snapshots per day needed to achieve RPO X Average Size of Incremental Updates per Snapshot X Data Retention Requirements) make snapshots uneconomical when compared to continuous replication to a hot backup using continuous replication? If so, can the data retention policy be satisfied by shifting older snapshots to lower-cost archival object storage?
Does the number of disk volumes exceed the cloud service provider’s restricutions on the number of simultaneous snapshot actions allowed in parallel? If so, does the lack of synchronization of disk snapshots make this approach unworkable?

Checklist

Review existing recovery objectives prior to migrating to the cloud.
Once in the cloud, configure an Agility Platform Event Policy with a periodic event time based on the target RPO that creates snapshots of all disk volumes.
Adjust the Event Policy and policy for associated object storage buckets to ensure that proper security measures are in place for this data in flight and at rest, including access controls and encryption.
If permissable, adjust the Agility Platform Event Policy to shift older snapshots to lower-cost archival object storage.
Periodically test recovery actions for this application and adjust policy and procedures based on lessons learned.
If the application eventually creates orchestration scripts capable of deployment, use these orchestration scripts during restore activities.

low gear: play 4

Replicate app to physically seperated backup

When disaster recovery requirements exceed those provided by basic backup and recovery of snapshots, the components of a Low Gear application shall be replicated from a hot primary environment to a cold standby environment with mirrored VMs located in a separate availability zone, region, or data center. The FAA envisions outsourcing the planning and execution of this capability to our cloud integrator through the higher-level Disaster Recovery CLINs outlined in section 7.6 of the FCS Cloud Computing Services Description.

If possible, the management platform (Agility Platform or vCenter) shall be configured with automation that performs health checks and powers up and cuts over to the back-up environment when a failure is detected. Alternatively, the management platform shall notify a human operator for manual failover. This solution can be made highly available by having a warm standby running at all times.

Key Questions

Does the application currently use replication software such as EMC Avamar to replicate to a standby environment?

Checklist

Review existing recovery objectives and align to either DR-TF1 or DR-TF2.
In conjunction with migration planning, contact the cloud integrator to allow them to begin DR planning.
Order DR-TF1 or DR-TF2 along with other FCS CLINs.
Complete acceptance testing, including an intial DR test.
Work with the cloud integrator to perform an annual DR test.

low gear: play 5

Administer by Command Line and Automate by Scripts

To the greatest degree possible, manual systems administration shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.

PowerShell Training

Most Windows admins use the GUI, and if they know a command line, it’s probably the DOS Prompt and scripting via *.bat and *.cmd scripts. Over the past few years, Microsoft has spent considerable time building out a richer and full-reatures shell environment called PowerShell, and mastering this environment should be a top priority for all FAA personnel that regularly work with the Microsoft stack.

PowerShell 4 Foundations at CBT Nuggets
Windows PowerShell v2-v3-v4 Ultimate Training at CBT Nuggets
Getting Started with PowerShell 3.0 Jump Start at Microsoft Virtual Academy
Advanced Tools & Scripting with PowerShell 3.0 Jump Start at Microsoft Virtual Academy
Getting Started with PowerShell Desired State Configuration (DSC) at Microsoft Virtual Academy
Advanced PowerShell Desired State Configuration (DSC) and Custom Resources at Microsoft Virtual Academy

Linux/UNIX Bash Training

Most UNIX and Linux systems administators are culturally experienced with working in the Bash shell. However, many personnel can benefit from training beyond core shell scripting to the use of more robust scripting languages such as Python. Here is a range of materials to cover novice to experienced administrators.

Introduction to Linux at edX
Advanced Bash-Scripting Guide at Linux Documentation Project
Python for Unix and Linux System Administration at Linux Tone

Key Questions

Do developers or administrators currently use any scripts to simplify their day-to-day work?
Are FAA developers and administrators trained on working in a command line environment and creating automation scripts?

Checklist

Identify any existing scripts and assess for suitability to formalize as an automation asset under version control.
Work to ensure FAA and contractor personnel are trained in CLI and scripting techniques.
In conjuction with migration to the cloud, update Windows Server instances to the latest FAA-approved version of the Windows Management Framework..
Perform administration via command line and judiciously generate and place new automation assets under formal version control.

medium gear: play 6

Separate app tiers into separate VM pools and subnets

Excepting the most trivial of applications, application tiers shall be refactored into separate pools of one or more virtual machines running in seperate subnets. This achieves greater functional decoupling of the applications tiers and enables greater flexibility in implementing security controls between those tiers using Network ACLs or other security models provided by the deployment environment.

Key Questions

What are the deployment units for the various application tiers?
What are the protocols and ports used to communicate between tiers?

Checklist

Seperate out tiers into seperate VMs located in seperate subnets.
Tighten down the security controls for each subnet, allowing through only the identified and documented protocols and ports.

medium gear: play 7

Automate deployment of Operating Systems using Stem Cell Images

Avoid golden images and implement virtual machines on base Stem Cell Images, which eschew specialization, middleware, or server roles and only provide the absolute minimal subset of operating system components and management agents (either FAA or external partners such as CSGov) to provide a pre-hardened base operating system targeting a particular deployment environment (on-premise FAA versus cloud).

In the case of FCS, these stem cell images are orderable in the Agility Designer tool as Agility Workloads. In order to ensure commonality between FCS and on premise FAA environments, the FAA should formally define and maintain standard server images that apply across all environments and work with CSGov to have the images behind the FCS Operating System supplemental features implement these base standards.

Key Questions

Do you currently have one or more golden images for your application?
Do you have any waivers or exemptions for current FAA server standards?

Checklist

In preparation for this play, AIF must adopt standards and change control for server images similar to client workstations.
In preparation for this play, AIF must create standard “stem cell images” based on our core server standards and work with our cloud integrator to ensure functional commonality.
Assess golden image state against new the new FAA standard “stem cell image” and document state changes for inclusion in automation to be deployed by automated configuration management system.

medium gear: play 8

Automate deployment of Platforms and Middleware stacks using an Automated Configuration Management tool

In concert with the base Stem Cell images discussed above, automate the installation and configuration of middleware using automation assets developed on a common automated configuration management platform that spans all FAA data centers and cloud environments. These automation assets shall deploy the standard configurations of software stacks, such as those outlined in AIT Business Plan Item 15C.119B1, Standard Configurations and Platforms. Examples of automated configuration management tools include Puppet, Chef, and Ansible

Key Questions

Does the application currently have automation under version control involved in deployment or administration?

Checklist

Prior to play, AIT must create a standard for automated configuration management and work with our cloud integrator to use this solution.
Use the gap analysis results of comparing current golden image and new FAA-wide “stem cell” image standard to write a specification for the desired state needed to achieve functional parity with your pre-existing golden image.
Identify additional manual deployment steps and add to specification.
Implement automation to achieve this desired state using the standard automated configuration management tool.
Test to ensure automation works properly
Maintain automation under version control and manage as a software asset.

medium gear: play 9

Automate deployment of Applications using an Orchestration Engine.

A medium tier application most likely is a complex multi-tiered application with isolated tiers implemented on pools of virtual machines. In the previous two plays, the deployment of operating systems, platforms, and middleware was automated through the combination of “stem cell” images and automated configuration management assets. However, in order to fully automate the deployment of a complex multi-tiered application, an orchestration engine must deploy these components according to an application architecture. In the case of Agility Platform, this is performed by creating a top-level Agility Blueprint through the Agility Designer Tool using embedded blueprints, workloads, and packages. If architecturally appropriate, these Agility Blueprints would implement horizontal scaling through scaling plans that tie to CCSD 6.1.1 monthly-priced virtual machines with the operating system and elasticity supplemental features.

Key Questions

Has the deployment of all operating systems, platforms, and middleware been successfully automated?
Does the orchestration engine fully align with the system security boundary or does the application integrate with legacy systems of record?

Checklist

Create a master orchestration template for the applicaton that uses images and automation assets to implement the application’s architecture.
Place this master orchestration template under version control and work with configuration management and security to make this the primary unit of analysis for their respective processes

medium gear: play 10

Develop Images, Automation, and Orchestration using agile techniques and SCM tools.

In an Infrastructure as Code world, traditional operations teams need to form agile infrastrucutre teams capable of building and managing a service catalog of automation assets using the same agile techniques and professional software configuration management (SCM) tools used by application developers. In a DevOps setting, these agile infrastructure teams work closely with agile application / digital services teams to smooth out the traditional division between development and operations. The assets produced by these agile infrastructure teams populate a service catalog and are consumed by application teams on an as needed basis. The application teams are consumers of these infrastructure automation assets and stakeholders that make feature and function requests and potentially fork automation code and initiate pull requests. The agile infrastructure team works with configuration management and security to get the asset pre-approved and pre-authorized in order to minimize the authorization footprint for individual FAA applications.

Key Questions

Where should an agile infrastrucutre team reside within the FAA’s IT Shared Services organization?

Checklist

AIT must form agile infrastrucutre teams focused on delivering automation assets
Leverage existing SCM tools used for applications for the creation of automation assets
Create and execute a development lifecycle for automation assets

medium gear: play 11

Deploy Apps through dev, test, and prod using a master orchestration script

Applications combine stem cell images, automation assets, and nested orchestration scripts into a single master orchestration script that acts as the major deployment unit for applications as they move through dev, test, and prod throughout the software development lifecycle. These orchestration scripts shall programmatically define scale units and scaling criteria for deployment environments that provide elasticity.

The master orchestration script acts as the deployment unit across the DevOps toolchain, and in the case of the FCS cloud, this process of deploying applications across this toolchain should be managed by the Agility Release Manager.

Checklist

Develop a standard FAA DevOps Toolchain.
Configure this FAA DevOps Toolchain in the Agility Release Manager.

medium gear: play 12

Push software updates through an automated configuration management system

Updates to all software installed and configured by automation assets shall not occur via manual systems administration, but shall occur across the FAA’s portfolio of applications by updating the automation asset and having an automated configuration management system push out the update. In situations where manual systems administrations is required, this shall occur using a command line interface such as BASH shell or PowerShell and all activities shall be logged and retained.

Checklist

Identify undates or hotfixes that span multiple systems.
Develop an automated configuration management unit of work that can be pushed out to multiple systems, inlcuding a plan to rollback the change if needed.
Test the unit of work against a representative sample of systems.
Work with security and configuration change management to get blanket approval to push this change across multiple systems.
Validate successful deployment of change and close out change with security and configuraiton change management teams.

high gear: play 13

Reuse U.S. Digital Services Assets and Default to Open

Members of the U.S. Digital Services community have spent considerable time developing a general framework for designing cloud-native digital services and they strongly encourage reuse of their assets. In fact, this site uses one of their templates. In accordance with OMB wishes, high gear applications shall be designed using the U.S. Digital Services Playbook and use implemented using the U.S. Web Design Standards as a starting point.

Checklist

Search USDS and 18F assets when investigating the development of new capabilities.
Consider using 18F RFP / contract ghostwriting services when pursuing new capabilities.
With the exception of priviliged or protected information, make FAA assets openly available and share with DOT and other agencies.

high gear: play 14

Design for cloud by default... even if you can't get deploy there yet.

Cloud-native apps use highly distributed architectures that decomposed functionality into load-balanced pools of loosely-coupled stateless nodes that communicate over well defined standard interfaces, such as message queues. This architecture enables horizontal scaling and designs for failure using automated health checks that look for node failure and automatically restore service.

Even in situations where latency or coupling with legacy systems of record might seem to make a particular app unsuitable for cloud in the near term, development should target these architectural principals anyways. Federal cloud is rapidly evolving to include support for FISMA high and DHS-compliant TIC cloud services, so there is a high likelihood that if not initially cloud suitable, new applications shall become suitable for cloud services during their lifespan. Until that becomes reality, the FAA operations team should partner with our vendors to judiously bring cloudy innovations back into FAA data centers. In a world where Node.js and MongoDB apps run on Ubuntu on mainframe and VMWare offers drop-in modules for OpenStack and Docker containers, there is little excuse not to design with cloud in mind.

Key Questions

Does your application rely on legacy systems of record such as IBM mainframe, high-end UNIX systems, or non-virtualized NAS systems?
If an aspect of the application acts as an inhibitor to moving to the cloud, can pieces of your application move to the cloud?
If unable to move to the cloud, is your current hosting provide able to establish production-ready equivilent services by your need-by date?

Checklist

Complete a detailed cloud suitability assessment
Identify major cloud inhibitors
Work with the FCS program office to compensate for those inhibitors
Work with your current hosting provider to provide API-driven provisioning, container hosting, and other core cloud features

high gear: play 15

Maximize use of vendor-agnostic services capable of deployment to alternative CSPs

The FAA Cloud offers a hybrid cloud brokerage model that allows applications to migrate between AWS, Azure, and private clouds via the Agility Platform. That may not seem like a killer feature to developers, but it most certainly does to operations, enterprise architects, and the IT executive team. As such, new high gear applications should maximize use of vendor-neutral subset of cloud capabilities offered by the FCS contract via the Agility Store cloud service catalog. All applications seeking specialty high-value proprietary cloud services should contact the FCS program office early during their requirements gathering process to allow for cost-benefit analysis weighing functionality against lock-in, portfolio management, and integration work.

For example, a system owner may consider the Watson Visual Recognition service to be a perfect for a new Google Glass safety inspector app. The FCS program office may not be able to provide this particular service, but it will work with our strategic cloud partner to bring an architectural alternative into the cloud service catalog to fulfill your requirements, such as Google’s open-source Tensorflow technology provisioned on top of virtual machine contract line items. Alternately, the FCS program management team may approve the use of a specialty proprietary cloud services contingent on the implementation of compensating controls, such as the abstraction of APIs into modularized connectors that enable development of a drop-in replacement for the service should the need arise.

Key Questions

Does my application require non-commodity specialty cloud services?
Are there special APIs or SaaS capabilities that offer significant potential value but require FCS program office approval?
Does my need-by date provide me ample time to work through the FCS processes to onboard new cloud services?

Checklist

Map functional requirements against services in the Agility Store
Work with the FCS program office to perform functional gap analysis and identify new services worth adding to the FCS contract.

FCS Playbook

FAA Cloud Services Playbook

Introduction to Tri-Modal Approach to Cloud Services

Low Gear

Medium Gear

High Gear

low gear: play 1

Migrate Apps off "Stacked" Infrastructure

Key Questions

Checklist

low gear: play 2

Configure Apps to scale vertically by resizing VMs

Key Questions

Checklist

low gear: play 3

Backup automatically via Snapshots

Key Questions

Checklist

low gear: play 4

Replicate app to physically seperated backup

Key Questions

Checklist

low gear: play 5

Administer by Command Line and Automate by Scripts

PowerShell Training

Linux/UNIX Bash Training

Key Questions

Checklist

medium gear: play 6

Separate app tiers into separate VM pools and subnets

Key Questions

Checklist

medium gear: play 7

Automate deployment of Operating Systems using Stem Cell Images

Key Questions

Checklist

medium gear: play 8

Automate deployment of Platforms and Middleware stacks using an Automated Configuration Management tool

Key Questions

Checklist

medium gear: play 9

Automate deployment of Applications using an Orchestration Engine.

Key Questions

Checklist

medium gear: play 10

Develop Images, Automation, and Orchestration using agile techniques and SCM tools.

Key Questions

Checklist

medium gear: play 11

Deploy Apps through dev, test, and prod using a master orchestration script

Checklist

medium gear: play 12

Push software updates through an automated configuration management system

Checklist

high gear: play 13

Reuse U.S. Digital Services Assets and Default to Open

Checklist

high gear: play 14

Design for cloud by default... even if you can't get deploy there yet.

Key Questions

Checklist

high gear: play 15

Maximize use of vendor-agnostic services capable of deployment to alternative CSPs

Key Questions

Checklist