The changing landscape of performance monitoring
July 15 2019
by Nancy Gohring, Mike Fratto
While application performance monitoring (APM) and infrastructure monitoring in cloud environments are fairly well understood, network monitoring in the cloud is still evolving. In containerized deployments, we see monitoring vendors across sectors vying for a foothold as organizations gain experience and come to understand the depth of visibility they require.
The 451 Take
Applications deployed in dynamic environments like containers, microservices, serverless and cloud require new approaches to monitoring. We're observing some overlap, in terms of value delivered to users by vendors from historically distinct monitoring categories. In particular, we're seeing some vendors employ techniques familiar to network performance monitoring (NPM), infrastructure monitoring and application performance monitoring that, while collecting and analyzing different data sets, deliver insight that solves some of the same sets of problems for users. We anticipate continued efforts by both NPM and infrastructure monitoring vendors to deliver application insight in ways that might steal some market share from traditional APM vendors by highlighting application interconnection and location-independent data. However, some of the specific capabilities for cloud and microservices environments are truly differentiating, and the specific product features will be more important for new customers adding net-new network and application data collection to their environments. The new capabilities should strengthen existing customer commitment to incumbent vendors as enterprises migrate applications – and monitoring – to cloud-native architectures. One of the most significant roadblocks, particularly for NPM vendors, is convincing application architects and developers that aren't their typical customers to consider using their products.
In order to stay competitive and retain customers, businesses want to move faster – quickly fixing problems, enhancing capabilities and adding features. IT recognizes this demand, as demonstrated by 24% of IT professionals in our 2018 Digital Pulse Vendor Evaluations survey, who identified 'responding faster to business needs in the next 12 months' as the most important goal for their organizations.
To support this demand, the adoption of continuous integration and deployment processes, as well as technologies like microservices, containers and cloud, continues unabated (see Figure 1).
Figure 1: Q: Which cloud-native technologies are important for your organization? (N=494)
451 Research DevOps Q1 2019
Additionally, we continue to see businesses shifting to the cloud, with company-owned datacenters slipping under 50% by 2020 (Figure 2) as the primary IT environment among decision-makers we surveyed. While new technologies like cloud, containers, microservices and serverless help IT organizations move faster, they also bring new monitoring requirements while maintaining existing application and network performance monitoring. Placing a physical or virtual network packet broker in-line of the traffic is insufficient to monitor applications that can scale up, down, in and out at a moment's notice. Containerized environments often host multiple application components within a single server or interconnect to multiple servers using encrypted overlays that make network traffic opaque to data collection. To address the demands from modern application environments, APM and NPM vendors are approaching data collection in new ways.
Figure 2: Primary IT environment locations
451 Research Digital Pulse, Budgets & Outlooks, 2019
Containers and microservices
The adoption of microservices and containers presents a new set of challenges that traditional monitoring tools don't address. A containerized application, for instance, may have many more containers than a traditional application might have VMs, and those containers could be dynamic, with new containers spinning up and down regularly. Containers supporting an individual application may run in public clouds, private clouds and traditional datacenters. Operations teams require insight into the relationship and communication between containers, as well as an understanding of the performance of the infrastructure supporting those containers. They also need to understand historical performance of a container that may no longer be running. Add Kubernetes and service meshes into the mix, and new challenges emerge in terms of dynamism and insight into which containers are supporting which applications.
For example, when new containers are added, the monitoring must begin immediately. That means identifying when a new workload is launched, identifying which application the workload belongs to, determining what data collection will take place, discovering the destinations to send the collected data and then preparing the destination to receive the collected data. When the container is stopped, a similar process must occur, and while the container is active, performance metrics must be collected and distributed. Complicating matters further, deployment patterns like service meshes are emerging, where applications in containers export functions outside of their pods, but inside the cluster. As such, there's a great deal of networking going on that is generally out of sight to the rest of the network.
APM, infrastructure monitoring and network performance monitoring take different approaches to delivering insight into microservices and container environments.
APM tools employ a host- or container-based agent to collect events and metrics, and increasingly to collect traces. APM tools commonly relate application performance to infrastructure performance. For instance, they correlate an application error with the performance of the container, host or (in a Kubernetes environment) pod that is running the application. Such correlation is important for troubleshooting.
Distributed tracing pieces together a transaction that may travel across microservices, containers, clouds and datacenters, allowing users to pinpoint slow spans and then troubleshoot. It is increasingly seen as a crucial approach to gaining visibility in containerized environments. However, there are challenges to deploying and benefiting from distributed tracing, particularly around data collection. Collecting and shipping every trace is unfeasible due to the volume of data required to do so. Saving a sample of traces comes with its own tradeoffs. The result is that, while many organizations recognize that distributed tracing is one of the most valuable approaches to monitoring in containerized environments, they're interested in exploring alternatives that might be simpler and less costly, and might suffice for some applications, even if they don't offer the same granular insight.
Infrastructure monitoring tools may also use an agent to collect performance information about infrastructure in a containerized environment, surfacing metrics and graphs about CPU, memory and disk usage, for instance. Infrastructure monitoring may be able to group containers by service so that an operations professional can use the tool to discover a performance problem with a particular application, and then dig into the infrastructure behind that application – including to the container level – in order to identify the cause.
Some infrastructure monitoring tools offer some insight into communications between containers. Sysdig, for instance, is able to understand TCP IP requests between containers, hosts and nodes. That enables Sysdig to show users when there is a slowdown in communications between elements that might indicate a network communications problem. Sysdig takes a different approach than many other vendors in that its instrumentation sits at the kernel level, enabling deep insight but opening the door to risks that some organizations prefer to avoid.
As Kubernetes grows in popularity as a container orchestration tool, Kubernetes monitoring is increasingly on offer from infrastructure monitoring vendors. They monitor Kubernetes itself, in addition to surfacing insight about the containers that Kubernetes manages. The open source Prometheus project has emerged as a key tool for collecting metrics from Kubernetes deployments.
Network monitoring usually involves packet capture, and vendors are taking two approaches to instrumenting data collection for virtualized and containerized workloads. The first approach is an agent installed on each VM and container image that is activated when the workload starts, making sure that IT will always have monitoring in place. The agent sends packet data to a collection destination, which then distributes the packet data to consuming destinations. This is the model that vendors such as Extrahop, Gigamon and Nubeva use. The benefit of this model is that processing – like filtering, slicing, de-duplication and other packet manipulations – is performed on a separate centralized server or service, reducing resource overhead on the workload container or VM. The downside is that a centralized service becomes a potential performance chokepoint and a single point of failure in data collection.
Collection agents can also perform all the processing on the workload instance, and then forward the packets directly to the destination analytics service. This model, used by IXIA's CloudLens and Nubeva, has the benefits of independence from a centralized processing engine and the removal of a centralized performance chokepoint, but also requires a negotiation protocol to distribute changes in the topology and negotiate which peer will receive the packets.
There are a few approaches to packet collection specifically for container environments, including a plug-in to the container networking environment, an agent on the container workload and sidecars. The container networking interface is a framework for network plug-ins for the container environment and is managed by the container management system. It is very similar to the virtual network tap used in hypervisor environments, where traffic passing through the container network is captured on the virtual network, processed and forwarded to its next destination. Packets can be collected via an agent installed on the container, and are launched with the workload or in a sidecar configuration, which is a container launched alongside the workload. Some of the container services, like Envoy proxy, also have built-in packet-capture capabilities that will compete with or replace other packet-capture capabilities.
Sidecars are likely to be the preferred deployment method because a sidecar runs alongside a container workload and requires no support from the underlying infrastructure, keeping the workload application free from unnecessary software, which simplifies modifications. Furthermore, sidecars can be easily removed or modified without impacting the workload container, and enable management independent of the workload.
The challenge in public cloud services is gaining access to the data streams to perform collection. While APM and infrastructure monitoring approaches in the cloud are fairly well established, network monitoring continues to be problematic for enterprises.
Infrastructure monitoring and APM are well established in the cloud. Cloud providers offer users access to metrics about infrastructure performance, although some customers require more granular data than the providers make available. Deploying an agent from an infrastructure monitoring tool allows users to collect additional data about infrastructure performance and marry it with data from noncloud environments with which cloud workloads may interact. APM operates similarly in the cloud and in noncloud environments, with users instrumenting their code to collect performance metrics.
In the cloud, businesses have embraced APM and infrastructure monitoring, but they are still grasping for the best approach to network monitoring. We anticipate further experimentation from vendors that seek to protect their installed bases while developing new approaches that respond to demands related to cloud visibility.
Unlike on-premises, where virtual ports can be configured to collect all packets like a virtual tap, cloud networking doesn't grant that type of access. It's a gap that cloud services are just starting to address, such as Microsoft's Azure VTAP, which is currently on preview, and we expect to see competitors follow suit. Capabilities like AWS's VPC Flow Logs and Azure's VTAP are rather limited in their use and capabilities, and will incur additional charges when used.
The preferred method for network data collection is to take an agent approach, one for each VM. Unlike containers, VMs tend to be long-lived, so installing a lightweight agent to collect data circumvents limitations in the cloud service and ensures monitoring will always take place. Like the container environment, some products require a central collection point while others send data directly to the destination analytics software. One of the benefits of a central collection VM in Microsoft Azure is that it can be the collector for VTAP, a very simple service that doesn't offer filtering or delivery to multiple destinations. However, the centralized collectors are essentially packet brokers, and can accept captured traffic from the VTAP, process and filter it, and then send it to one or more destinations. We suspect that if VTAP becomes popular, users will demand more features from the service, reducing the need for an intervening packet broker.
Current best practices are to collect and process network performance and security data in the cloud service to reduce the costs of shipping data to another location. Even if the data is reduced to metadata, the costs will be far lower than sending raw packets to a remote location.
Enterprise IT will continue to wrestle with new challenges in collecting, processing, and analyzing performance data in container, microservices and cloud environments. For example, how to collect the performance data in different environments, which tools to use for data collection and whether there is value in using the same tool everywhere, where processing and storage will take place (depending on where the workloads reside), and how IT will ensure that data collection is occurring as workloads change over time. IT will have to treat monitoring as a practice that covers all the application deployment environments used by the business and is adaptable to new environments, and that means working closely with application architects and developers to implement network data collection and monitoring as a foundational element of any new or migrated applications.
Adding support for cloud and container environments is an incremental addition for APM and NPM vendors. While there are several ways that data collection is implemented, the result is generally the same for all vendors. Some methods may suit an enterprise's particular needs better than others, but that will matter more for new sales and will be weak motivation to replace or add monitoring vendors to the customer mix. Nubeva, a startup in network data collection, has a unique cloud-native approach, but it lacks physical and on-premises collection capabilities. Its recently announced TLS decryption, including support for TLS 1.3, is a significant differentiator for the company if its efficacy and performance claims are accurate, and makes it a good acquisition target for NPM, APM and security monitoring vendors looking for their own cloud-native packet capture.