Saturday, March 7, 2020

Monitoring your application metrics and setup Alarms


Monitoring your application metrics and setup Alarms

In this blog let’s explore how we can enable monitoring and create alerts for cloud-based service/applications. Cloud monitoring is a broad category that includes web and cloud applications, infrastructure, networks, platform, application, and microservices. This post targets monitoring Service /API level and JVM metrics. I am using the below components to demonstrate this
  • Dropwizard metrics for pushing your application metrics
  • Oracle Telemetry server (Time-series database) to collect the application and JVM metrics and create alarm and notification
  • Grafana dashboards for visualizing and analyzing the metrics

In case the terms Time series database/metrics/Grafana/Prometheus/Telemetry are Greek to you and you still manage to read so far, it's implicit that either you are my friend/stalker, or really serious about enhancing your knowledge on monitoring and related tools and systems. So, let’s first try to understand the major components here and then go into details.

There’s a vast number of metrics coming from different cloud services that could be overwhelming (API gateway metrics, Database metrics, Compute instance metrics, Application level metrics, JVM level metrics, etc), and you need to determine which metrics are the most important to track and find the tools that will report those metrics. Very simple examples could be the number of logins per minute, number of new accounts created JVM memory usage and so on (you got the idea). An important concept to note here is the metrics here are changing with time and so traditional databases won’t work here, we also may or may not care about past data and only interested in current time window data. Let me not go into details of the sliding window, timers, histogram and other low-level details here and stick to the topic.



This picture (pardon my handwriting and dirty whiteboard) show the overall architecture of Cloud Monitoring.

Time Series db/Telemetry Server : At central of this is a time-series database which collect metrics emitted from the cloud services. The service could be deployed on cloud hosting platforms or can be spring based app. The services push there metrics to Metric Registry and reports these via reporting mechanism.

I have listed the popular time-series databases.

Metrics Registry :the dropwizard metrics library, which has been established as the de facto standard for metrics in Java based applications. The library makes it very easy to measure different metrics within your application. It supports five different metric-types: Gauges, Counters, Histograms, Meters and Timers. You can also use numerous modules to instrument common libraries like Jetty, Logback, Log4j, Apache HttpClient, Ehcache, JDBI and Jersey.

Metrics:

Metrics viewer and analyzer -


Monitoring IDCS Metrics


IDCS metrics is a time series data that can be viewed and analyzed over Grafana dashboard https://grafana.oci.oraclecorp.com and OCI devops Portal https://devops.oci.oraclecorp.com
Metrics of service is associated with following tags
  • project
  • fleet
  • hostname
  • availabilityDomain

1) Viewing metrics over Grafana dashboard.

Grafana dashboard https://grafana.oci.oraclecorp.com cab be used to view IDCS Metrics.
The following sample dashboard show metrics for the cloud service over Grafana.
Sample Dashboard 1:
Sample dashboard 2

2) Creating Alarm and Pager duty alert for your metrics using Oracle Cloud Portal.

Login to OCI Portal https://devops.oci.oraclecorp.com/

Search for your metrics



View Metrics

Here are sample steps to create an Alarm on "create users" Metrics. If the metrics data is greater than the threshold (for e.g response time is higher than expected) this will trigger a Jira ticket and create Pager duty incident.

Turn a Metric into a Notification in three Easy Steps
  1. Create Alarm - that applies to your metric. An alarm converts your metric, which is a series of numbers, into a series of booleans that say whether the value of the metric is acceptable (OK) or not acceptable (ALARM).
  2. Create a Rule that applies to your alarm. Rules apply to an alarm(s) that define a condition under which a notification should be sent. Although the simplest rules apply to only one alarm, there are also rules referencing multiple alarms that we encourage you to set up 
  3. Create a Notification that refers to your rule, and specifies who and how to notify. When the rule is triggered, the delivery of this notification is then attempted.

1) Create Alarm 

Search for the metrics you want to create the Alarm for:

2) Create Rule

Specify alarm name created above
   

3) Create Notification 

Once the Notification is created, it will show up the notification configuration.

Pager Duty Integration

A PagerDuty incident is automatically:
  • Created when a:
    • New Jira ticket is created with severity 1/2
    • Jira ticket severity changes from a severity 3/4/5 to 1/2
    • Resolved/closed Jira ticket with severity 1/2 is reopened
    • Jira ticket component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
    • Jira ticket is moved to a different Jira project (this also implies that the component/item has changed)
    • Jira ticket status changes from Pending/In Progress to Open
  • Acknowledged when a Jira ticket status changed from Open to Pending/In Progress
  • Resolved when a Jira ticket:
    • Is resolved
    • Severity changes from severity 1/2 to 3/4/5
    • Component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
    • Moved to a different Jira project (this also implies that the component/item changed)

No comments:

Post a Comment