Elastic APM – the key to effective DevOps


Software architecture has over recent years has changed quite dramatically. A piece of modern software is now the sum of many parts. The pieces of the puzzle may be many small and sometimes sporadically distributed processes. The driving force behind this change has been more efficient development. Faster and more granular development iterations, which means faster fixes and faster improvements. However, its often forgotten that this speed is dependent on the speed and accurately of application monitoring.

The sooner we know a problem… 

Most DevOps efforts focus on automating elements of the application development lifecycle, with the aim of speeding up the development or testing process. The most important element of the develop – deploy – monitor – report cycle is arguably the monitoring tool. Its ability to quickly and accurately detect and report a performance issue is vital. Even if we have an entirely frictionless testing and deployment process and and awesome team of developers, if there are delays in the feedback loop the fix will still be slow.

Close the DevOps feedback loop

Surprisingly often DevOps processes end with a release to production. The problem reporting responsibility is then passed to customer facing teams who handle customer issues, and manually create bug tickets. APM has an essential role in providing the same feedback from production environment as from development and test environments. This allows us to proactively spot performance issues before they occur. Arguably APM is most valuable in production.

Elastic APM solves the microservice monitoring problem

A criticism of microservice architecture has long been that it is very difficult to monitor the application as a whole, as each service has its own logging mechanism. So, is it enough to simply aggregate logs to one UI for analysis? Well, that’s certainly an important step, but how can we sew together a myriad of events from many different processes into one coherent event chain?  Elastic APM instrumentation helps fix this issue. Through ECS (Elastic common schema), and APM Distributed Tracing, event log formats are normalised and homogenised upon arrival to the Elastic cluster. Also, APM pieces together all events relating to a single HTTP request as it threads its way through your code, then out to other services, and then makes its way back as a HTTP response. Elastic APM does this out of the box for supported stacks, with no further configuration needed. The root cause of slow response times can be quickly analysed.

The bits between the bits

What about if the problem is not your software. The logs look fine, but everything just feels sluggish. An inherent aspect of APM Distributed Tracing is Latency Tracking. This aims to fix troublesome, transient slowness by analysing the latencies between microservices. In doing so you gain invaluable performance data that helps you analyse the root cause. It may be a pod network, subnet, host rack location, or any number of network appliances sitting in between, but it isn’t an application problem.

This level of accuracy in pinpointing the problem potentially saves you money in wasted development time trying to fix a non-existent application problem.

Prediction is the new Detection with APM Machine Learning

Machine learning and APM is perhaps a match made in heaven. Perhaps the simplest form of ML, unattended learning, relies on two things in order to be effective. Firstly, the algorithms require vast amounts of data in order to build a meaningful model. Secondly it demands that we are open minded about what the model finds as significant.

In most cases APM generates a lot of data, and the subtle patterns that take place prior to a performance bottleneck may be very hard to define. We may suspect, but we often have no idea where or what the problem is. This makes building an Attended Learning model difficult.

By utilising Elastic’s build-in machine learning capabilities, we can use the ML models to potentially even spot an issue before it happens. Perhaps peek loads are not predictable or linked to specific events. ML could potentially spot the pattern of events or performance leading to an application peak load. With some ninja scripting, more infrastructure resources could be assigned based on the model’s alerts.

So there’s a problem, what then?

If there is an internal application performance issue, that is not helped simply by providing more infrastructure punch power, then it’s time to immediately inform the development team. With Kibana alerts we can easily send out intelligent alerts to any endpoint offering a REST API, or simply via email.

By analysing APM data we will know whether we are dealing with an infrastructure problem or a software problem, whether is a backend issue or database issue, even which backend microservices are to blame. Getting the alert to the right people means a quicker fix. 

Written by: Alex Hutchinson Certified Elastic Engineer and Software Architect


Share on facebook
Share on twitter
Share on pinterest
Share on linkedin

Related news