Patch Management on Microsoft Azure

This blog post is intended to provide a detailed guidance around setting up a Patch Management process on Microsoft Azure Cloud.

For all Cloud IaaS deployments, having a Patch Management process is essential. It is as Important as Patch Management process at your on-premises DC

Why we need Patch management?

There are many compelling reasons, like:

Plugging any security vulnerabilities in the OS or Installed Application Software
Proactive protection against newer threats and malware
Fixing existing platform/software bugs
Performance and Stability Improvements
Addressing known Issues
Meet Compliance requirements (like SOX)
And many more…
In this post, we will look at Patch Management for Cloud IaaS deployments, specifically on Microsoft Azure, and for Windows Server-based Azure VMs. We will not specifically cover Linux based Azure VMs here, but the same base guidance would apply to them equally.

However, what we discuss here would equally apply to any other Cloud IaaS platform like AWS or GCP. Though we will occasionally reference the traditional on-premises Patch management process, wherever required.

Fundamentally, the Cloud IaaS model is a virtualized abstraction of physical Infrastructure. It is built on underlying clusters of physical host servers of various capacities/capabilities. The responsibility of patching these underlying physical host servers rests with the Cloud provider. In the case of Azure, Microsoft holds this responsibility.

However, for VMs provisioned on the Cloud IaaS layer, VM maintenance is the sole responsibility of the customers. This model of shared responsibility is the same across all Cloud providers, like Azure, AWS, GCP, etc.

Now let’s focus on Patch Management from Microsoft Azure perspective.

Microsoft does regularly update VM Images they have published in the Azure Marketplace, with the latest patches. These Images are thoroughly tested for stability, before being published in the Marketplace. However, Microsoft does not make the frequency/schedule public for updation of these VM Images. Hence, whenever you create a new VM based on an image from the Azure Marketplace, you would be lucky if you get one that has been just updated with the latest patches. That will save you from applying any additional updates (rare chance). In nearly most cases, depending on how far did Microsoft updates the Image, you will have to download a larger/smaller delta of the applicable patches.

Given that both Windows Server OS and Azure Platform are Microsoft products, it would have been Ideal if Microsoft had a native automated patch management service in Azure.

[Update Start: 28th Jan 2018]

However, Microsoft does not currently have any full-featured standard Azure-based native service offering for Patching/Update management. At best what they offer is a revised “Update Management” solution (still in preview as of the date of this blog post version) through Azure Automation, which is linked with another external Microsoft service called Microsoft Operations Management Suite (OMS). This new Update Management solution collects Updates related data from all the VMs (Windows/Linux) deployed in Azure and/or On-premises (Hybrid setup) through Microsoft OMS agents installed on those VMs and pushes that data to OMS. Thereafter, you can use OMS to monitor the Update status of the monitored VMs to see which ones are missing any updates and push the Installation of those missing Updates unilaterally. However, OMS based Update management solution currently misses many critical features/capabilities essential for a good Update/Patch management solution and is a no-go option for any production IaaS deployments.

[Update End: 28th Jan 2018]

Microsoft still expects customers to either manually do the patch management themselves (using native tools like WSUS, MBSA, PowerShell, etc.), or use commercial patch management systems. This strategy does not make things any easier for customers. However, it does Indirectly benefit promoting an ecosystem of ISVs, who build such products to be sold commercially.

You can see my feedback to Microsoft around this concern at the Official Azure User Voice forum here: Azure User Voice. Once I get a response to this feedback, I will update this post with the response.

Organizations considering either migrating their existing on-premises workloads to Azure, or building net new Cloud Infrastructure will necessarily need to consider having a Cloud Patch management process.

Orgs already having an existing and mature patch management process at on-premises, would assume that all they need to do is follow the same process on Azure. While that is true to some extent, they will still need to revisit their existing process and fine-tune it for Azure IaaS model
Orgs who do not already have an existing or mature Patch management process can follow the guidance in this post to help them establishing one for their Azure IaaS environment.
Let’s look at the following step-wise approach an Organization should consider, for establishing a patch management process on Azure (or any Cloud IaaS for that matter):

Prepare Patch Inventory
Perform VM Baselining
Discover Patch Notification & Repository Channels
Setup Patch Management System
Patch Testing & Authorization
Patch Monitoring

Stage 1: Prepare Patch Inventory

You should first create a Patch Inventory, which should capture the following information for your IaaS deployment:

Identify and list of all patches, past and present, for each VM Server OS versions – You can start with patches applied to the VM Baseline you prepare. See Stage 2 below.
All patches which failed during testing, and were eventually never applied in production– When? With reason(s)
All patches which failed during testing, but were later fixed and applied in production – Why/How/When?
Details of any patch related support Incidents raised with Microsoft PSS or an external Support provider
Authorization status for each patch – This will come after Patch testing stage
Production Impact of applying each patch- This will come from Patch testing stage
Justification for applying the patch in production – This will come from Patch testing stage
Approvals for applying the patch in production – This will come after patch testing stage

Additionally, you should also prepare another related Inventory for production VM’s in your environment, which should capture the following information:

List of all Azure production VMs deployed in the concerned Azure IaaS Solution
For each production Azure VM:
Configuration Information like Server OS/version, Software/versions Installed
Role, function, business and security criticality
Access/ownership information
All patches applied to the VM in chronological order – Since VM provisioning till current date
For each patch successfully applied – Testing date and outcome status
Any patch rollbacks performed – Why/How/When?
All rollbacks performed, due to Issues arising from failed/rogue patches applied – Why/How/When?
Known security Issues, and newly discovered ones
Change tracking/history for any changes on Security levels

These Inventory Items should be regularly updated on a predefined frequency, which will depend on the patching cycle you may want to follow. Inputs for this Inventory will also come from later stages in the Patch management process, like from Patch Testing stage.

The above-listed Inventory data points are not absolutely exhaustive but should give you a fair idea on what level of Inventory you must have, before embarking on Incorporating a patch management process on Azure.

Stage 2 – Perform VM Baselining

Baselining VMs refers to building an initial stable configuration of the VMs, established at a specific point-in-time. This means that the VM Server OS, Application Software(s) Installed within, and any Initial configurations done on either of these, are thoroughly tested, found stable, and standardized for being used as a base VM configuration. Baselining VMs enables us to reliably restore them from any future state to a previously stable state, and helps to probe/to rectify any potential problems with a later version. It also helps to minimize the amount of patches/updates we need to deploy on the VMs as well as gives us an ability to monitor compliance at a granular level.

For baselining Azure VMs, you should consider following high-level process:

Group Azure VMs in your Azure IaaS deployment, into different Asset categories
Prepare and maintain standard VM baselines for each category, which should have similar Azure VM Server OS/version, Application Software/version, and patches
You could either have a single VM baseline for all Asset categories in your deployment or have different VM baselines for each Asset category
Whether you need a single or multiple VM baselines primarily depends on the differences between VM and Application Software configuration across different Asset categories, and how certain patches affect different baselines differently
Prioritize distribution of patches to Azure VMs on the basis of Asset categories

Stage 3 – Discover Patch Notifications and Repository Channels

Next, you would need to discover and set up channels for getting regularly notified on new patches for the VM Server OS/version and Application Software(s)/version installed within.

You will also need a remote repository source/mechanism to download these patches on an Update server (where they will be tested first against VM Asset categories), through an automated mechanism preferably.

For the Windows Server OS running on Azure VM, and any other Microsoft Application Software(s) Installed within, you can get regular notifications through Microsoft Security Bulletin Service from Microsoft Security Response Center (MSRC). You can then automatically trigger the download of these patches/updates through existing native services/tools (like WSUS, MBSA, etc.)

However, for non-Microsoft Application Software(s) installed in the Azure VM, this will vary greatly, and will depend on existing update notification channels for those Software vendors (if they exist, what frequency they operate on, and in what form) as well as downloading mechanism.

Stage 4 – Setup Patch Management System

After you have discovered and setup patch notification and repository channels, the next step would be to look at setting up a patch management system.

Before you move forward on selecting a patch management system, you should:

Determine one or more locations (a.k.a Update Servers), where the patches would be downloaded for further distribution. For Azure IaaS environment, you could either have these Update Server(s) located on Azure itself, or on-premises (In case of a Hybrid setup) or at both places. You will need to carefully decide on where these Update Server(s) should be for your specific scenario and will depend heavily on your current Infrastructure architecture. Some common scenarios are depicted below:
Cloud only scenario: If your entire Infrastructure is on Azure, you will obviously decide to have the Update Server(s) on Azure itself. If your deployment is spread across Azure regions/subscriptions, better Idea would have an Update Server for each region/subscription combine
Hybrid Cloud scenario: If you have a majority of your servers (>50%) on-premises, but relatively fewer servers on Azure (>10%), or vice-versa, you should consider having an Update Server both on-premises and on Azure. If you have very minimal number of servers (<10%) at either location, compared to the other location having majority of the servers (>=90%), you are better off having an Update Server only at the location with majority of the severs and distributing the patches/updates to the location with minimal servers
Remember this – If you have Update Server(s) on-premises, and would be pushing patches/update to Azure, or vice-versa, you will generate considerable traffic between boundaries, leading to the reliability, latency, and Cost Implications
Ensure that you maintain patch Inventory for production based on stable criteria), and for pre-production environments as given in stage 1 above – This will simplify the overall patch management process
There are a number of tools/solutions available for Patch management, few from Microsoft, and several from commercial vendors. Some of these tools/solutions support only Windows Server OS, and others also support Linux Server OS. You could use either of these tools/solutions for your Azure IaaS environment. However, your choice will depend on factors like Implementation efforts, time, cost of deployment, licensing, support options, etc.

Few such popular tools/solutions are listed below:

Microsoft Baseline Security Analyzer and WSUS – Free
System Center Configuration Manager (SCCM) – Paid
Microsoft OMS – Paid
SolarWinds Patch Manager – Paid
Shavlik Protect + Empower, and Shavlik Patch – Paid
LANDesk Patch Manager – Paid
GFI LanGuard – Paid
PDQ Deploy Pro – Paid
Some of these tools offer limited support for a few stages detailed in this post, but none of them supports the whole defined process end-to-end.

Stage 5 – Patch Testing & Authorization

You need to establish a mandatory Patch Testing process as part of the overall Patch Management process. Let us look at why.

Imagine a scenario, where you apply a new patch on one or more VMs in your Azure IaaS environment. You then discover that suddenly one or many things stopped working. Maybe you are unable to RDP into the VMs, or Installed Application starts misbehaving, or a host of other problems surface. These are some of the many common Issues, which frequently occur when you don’t test patches before applying them in production VMs.

Testing any patches, before applying on production Azure VMs is always deemed a mandatory step you will need to rigorously follow. Not doing so may lead to very serious Implications for your deployment.

You should NEVER consider applying any patches directly to the Azure VMs in your Production environments. It is a BIG RISK, any whichever way you look at it.
You should first test patches on Azure VMs in a test (Pre-Production/Staging) Infrastructure environment on Azure, with corresponding equivalent configuration/roles of the Production Environment Azure VMs. You might ask here on the need for requiring exact configuration/role Azure VMs are in the test environment. This should be so that you don’t get unpredictable outcomes from applying patches on different VM configurations.
However, misses do happen in real life, and few untested patches may very well make their way to production Azure VMs. Also, If the testing process is not thorough, problematic patches can easily escape undetected to production, causing Issues.

When untested patches make their way to the production environment, they may fail and also break the current configuration/operations of the VMs. Your patch management process should have the ability to rollback and restore those Azure VMs to an earlier restore point. Not being able to do so can seriously compromise the intended functioning of the concerned VMs.

For VM rollbacks to be possible, you need to be already performing regular backups of your Azure VMs. Couple options for taking backup are through Azure Recovery Services Vault and through System Center DPM.

All patch testing activity should be recorded in a separate testing repository, and should reference/record against the existing Patch Inventory from Stage 1.

Depending upon if a patch passed or failed during testing, and authorization status should be assigned to it in the Patch Inventory. This authorization status will determine if a patch is ready to be applied to the target VMs (or VM Asset categories), or needs to be deferred for future testing, or rejected.

After successful authorization of each patch, you also need to assess and record the Impact it will have when applied to an Individual VM or a VM Asset category in your deployment. The possible impacts could be like forced downtime, dependency on other patches/components, an order of applying, etc.

As a final step, each patch will need to undergo an approval process, based on justification you give on why is it Important to be applied to the production servers. This Information will also get captured in the Patch inventory.

Stage 6 – Patch Monitoring

Once you have the Patch testing process setup, you will then need to set up a Patch monitoring process. Here you will need to regularly probe all your Azure VMs to identify the following:

Missing Updates
Installed Updates
Failed Updates
Incomplete Updates
Once you are able to get the above information, you will need to compare that against the list of authorized/approved patches in the Patch Inventory. This way you will be able to find out which patches need to be applied/reapplied, where, when, and in what order. Thereafter, you can schedule for their manual/automated deployment accordingly.

Additionally, you should consider performing the following activities on a schedule as a part of patch monitoring:

Perform regular Audit for Installed vs Authorized Updates for your Azure VMs
Regularly track your patch inventory, and update Installation status/progress for all patches on Azure VMs in your deployment.
Conclusion:

The intent of this post was to give you a good understanding of how to plan for Incorporating patching management for your Azure deployments.

Hope you enjoy reading this post. I would really appreciate any feedback/thoughts/comments/questions you may have, which you can communicate through comments below or direct mail.

Update 4/11/2016:

I was asked an interesting question from a reader today after he read this post.

His question was:

“Why don’t we enable auto-update on all Cloud/Azure VMs, and let them update themselves whenever they need be? Windows already has this mechanism of Auto Updates, and the same can be scheduled similarly on Linux too. If any updates fail, we can always restore from the Backups, isn’t it?”

My response:

“Never should we allow auto-updates to happen on Windows or Linux servers in production, whether on-premises or on Cloud. If we do, we expose our production deployment to a Huge Risk as anytime an update related failure may occur, rendering our production environment unusable. This practice of disallowing auto-updates is mandatorily followed by most Orgs across the world, for both their on-premises and cloud deployments. You could maybe enable auto-updates in a dev/test environment because there is minimal Impact there.

Furthermore, all good Infrastructure deployments in the Cloud or on-premises will either never give VMs direct access to the Internet, or only give restricted access secured behind proxies/bastions/WAF’s. So enabling auto-updates over the Internet would anyways be not available”