How to keep a close eye on your Atlassian deployments
No one responsible for managing enterprise IT wants to be told that an application has gone down, especially not second hand from end users. Monitoring is a crucial – but all too often overlooked – tool that can help operations teams avoid the embarrassment of hearing from users that an application is down and more importantly can reduce the frequency at which outages and errors occur.
In this post, we’ll be looking at how monitoring can be applied to your on-premise Atlassian deployments to prevent incidents and limit downtime, as well as how it can be used to optimise performance as your applications grow in size and capabilities.
Before we get into what you should be monitoring, let’s talk monitoring tools. Generally, most monitoring tools in their simplest form can be broken down into three components: a database that can store time series data known as metrics, a front end that can visualise these metrics and a method of ingesting data from various sources.
Many tools will have an additional vital component – a means of alerting you when a metric exceeds a defined threshold (more on how to set thresholds for alerting later). A typical setup would include a central application with a database for storing metrics, a web UI for visualisations and the capability to send out alerts by email, with a number of data forwarders or agents sending data from applications or infrastructure components back to a central hub.
There are a great many monitoring tools available on the market today, and the choice of which tool to use will largely be down to what fits best with your ops team – there is no magic bullet. There are a number of popular commercial products such as AppDynamics, New Relic, Dynatrace and Datadog, all of which have excellent features, however, can come with a hefty price tag.
If you’re hosting your Atlassian applications in a private cloud like Amazon Web Services or Microsoft Azure, you may instead be interested in a monitoring solution integrated into the cloud provider such as AWS CloudWatch or Azure Monitor.
Alternatively, if you’d prefer more of a ‘DIY’ approach, there are plenty of open source, highly customisable tools available. We’ve had success using Prometheus with Grafana internally, but other options could include Nagios or the Elastic Stack (ELK). A full comparison of these tools deserves its own blog post, so isn’t something we can discuss to the extent it deserves in this article.
A good place to start (and what I’d hope is common knowledge to any operations professional) is infrastructure monitoring.
Monitoring infrastructure for Atlassian applications is very similar to most other infrastructure. You’ll want to setup monitoring for common metrics such as CPU, disk space and memory utilisation.
Monitoring CPU can be a great performance indicator for your applications. Often when applications like Jira Software run into problems, you will see CPU utilisation shoot up while the web application grinds to a halt. Alerting on CPU can be effective in giving your ops team a head-start on resolving the issue.
Disk space is another critical metric to stay on top of. Running out of disk space is a sure-fire way of crippling your Jira or Confluence instances. Atlassian applications use the file system for storage of any attachments you have. In larger instances, the size of attachments can grow rapidly and easily chew through what you may have thought was sufficient disk space allocation. Monitoring disk usage will allow your ops team to provision more disk space or clean up unused files before the application runs out of disk space, preventing a costly and embarrassing incident. One further consideration is if you’ve set up your application home and install directories on separate disks – make sure you’re monitoring both!
Memory utilisation is a slightly more complex metric to look at. All of the core Atlassian applications are written in Java and therefore run in a Java Virtual Machine (JVM). Most of the memory allocation of the application will be within the JVM (we discuss how to monitor this below) which has defined limits on its memory usage. You’re less likely to detect incidents by monitoring OS memory as an out of memory error will most often occur within the JVM allocated space. Monitoring OS memory utilisation is still very useful however, as it can help you spot hard to diagnose issues such as a native memory leak.
Uptime. Its possibly the single most talked about metric in the age of SaaS and always-online services, so its certainly something you’ll want to be monitoring for your Atlassian applications. Availability monitoring is simply a test of ‘is my site online’ (which you’d hope is answered with a yes) and if so, how long does it take to receive a response from it. Receiving alerts to notify your Atlassian admins the moment your site goes offline is important to getting the site back up ASAP. On top of that, it’s very useful to spot outages occurring outside of working hours such as over the weekend when users might not notice it themselves. No one wants to come into work on Monday to find their precious Jira site offline. You can quickly set up your own availability monitoring using a tool like Uptime Robot or Pingdom.
All the application and infrastructure monitoring in the world won’t help you if your corporate network goes down. It might not even need to be the whole network. A firewall change gone wrong can easily break the connection between your Confluence app server and its database or between your Jira application and an external integration. Monitoring the connections between your applications, their databases and various integrations will help you to spot these issues as soon as they arise.
With Data Center deployments, you’ll also want to consider monitoring the connection between each node in your cluster and from each node to the shared home directory. Jira Data Center, for example, will run into all sorts of issues if the latency between each node grows too high.
As we mentioned earlier, the Atlassian applications are written in Java and therefore run in a JVM. Tuning a JVM is something of an art and can only be done well with plenty of monitoring data to guide the changes you make. There are a number of options as to how to monitor the JVM. One of the most common is to expose data through JMX endpoints, then collect this data using an agent from the monitoring tool you’ve chosen. Alternatively, there are tools available that will forward data directly from your Atlassian application to a monitoring tool without needing to worry about JMX.
The key JVM metrics to be interested are related to memory. You’ll want to be sure you’re collecting data on heap utilisation and garbage collection. As your applications grow in scale, it’s a good idea to periodically review the trend data your JVM monitoring has collected to determine if you need to retune your JVM configuration. Not doing so can result is degrading performance, inevitable dissatisfied users and worst of all, angry emails in your (or your boss’) inbox!
We’ve missed one last critical part of monitoring. At this point you may well be sitting in your operations command centre, happily looking at all of your metrics firmly in the green on your wallboards, when you spot a ticket raised in your IT service desk. Users are reporting that it’s taking them upwards of 10 seconds to load an issue in Jira and even longer to create! How can this be when all of your monitoring looks good?
Monitoring from the user’s perspective is critical for spotting issues such as we just described. To do this, many monitoring tools allow you to configure sample transactions. The tool will periodically execute a given task, for example creating an issue, and record the duration of the action. This can help you to spot more complex problems such as failures in Jira’s index.
The monitoring we’ve described so far can help you detect incidents the minute they occur. To do this however, you will need to know what a ‘normal’ day looks like. This is a process called baselining. You might be tempted to set the alerting threshold on your CPU metric to 90% and call it a day. However, if your application never usually reaches more than 50% CPU utilisation, then this threshold won’t be all that useful. The best approach is to treat alert thresholds as dynamic values that are updated as applications scale up.
A quick note on Alerting
We’ve talked a lot about alerting in this blog article and it wouldn’t be right not to mention Atlassian’s recently acquired incident management tool Opsgenie. We use Opsgenie internally to ensure critical alerts from our production monitoring tools are routed to the right person at the right time.
Go forth and monitor your Atlassian Applications!
Hopefully this post will arm you with the knowledge you need to successfully monitor your Atlassian applications.
Alternatively, we here at AC can take care of it for you. Get in touch today to hear how the Atlassian experts in our Support team can help out with monitoring your Atlassian applications and improving their performance.