Wednesday, August 5, 2020

ELK Integration

In this blog, i will summarize about ELK, Before going further let me answer why you need a log aggregation solution. With Cloud native applications running multiple Micro services, with their own lifecycle and log mechanism debugging a production issue is a nightmare without a centralized log aggregation solution

ELK consists of 3 open source products 

E - Elastic NO SQL database 
L - Logstash log pipeline tool that ingests logs from diff sources and send them over to Elastic
K - Kibana visualization tool to view aggregated logs



Pretty straight forward to use from an application developer's perspective, all you need is some config changes in terms of pointing your application configuration to the elastic endpoint and port

Sunday, March 8, 2020

Building high availability systems

High availability is is one of the fundamental basic concept and can affect one badly if the design is flawed or you choose an infrastructure that don't provide the support for it. A highly available infrastructure (Uptime target 99.995%) has the following traits
  • Hardware redundancy
  • Software and application redundancy
  • Data redundancy
  • The single points of failure eliminated
Here are some of the best practices which can help building systems with high availability
1. Automatically detect outages - Your system should be smart enough to detect the possible outages by monitoring metrics, health of application, VMs, nodes

2. Eliminate Single Point of Failure High Availability vs. Redundancy - IT infrastructures must have backup components to replace the failed system.

3. Invest in high end VM (high RAM, CPU, storage) for setting up your failover instance capable enough to handle real time traffic for short duration

3. Ensure the failover instance is not co-located in same network as your primary instance.

4. Ensure the primary (active) and secondary (passive) instance stays in sync - example the primary instance would be interacting to another systems via configuration, the same configuration should be replicated to secondary, the state of both should be same even post upgrades

5. Invest in highly available and reliable Domain Name System (DNS) service which supports Health Checks and monitoring, automatic DNS failover,  Health checks - DNS Load Balancing
Ref http://blog.cloudharmony.com/2012/08/comparison-and-analysis-of-managed-dns.html

6. Invest in reliable Load Balancer which can detect unhealthy targets, stop sending traffic to them, and then spread the load across the remaining healthy targets. Again avoid single point of failure. By implementing redundancy for the load balancer itself, you can eliminate it as a single point of failure.


Saturday, March 7, 2020

Monitoring your application metrics and setup Alarms


Monitoring your application metrics and setup Alarms

In this blog let’s explore how we can enable monitoring and create alerts for cloud-based service/applications. Cloud monitoring is a broad category that includes web and cloud applications, infrastructure, networks, platform, application, and microservices. This post targets monitoring Service /API level and JVM metrics. I am using the below components to demonstrate this
  • Dropwizard metrics for pushing your application metrics
  • Oracle Telemetry server (Time-series database) to collect the application and JVM metrics and create alarm and notification
  • Grafana dashboards for visualizing and analyzing the metrics

In case the terms Time series database/metrics/Grafana/Prometheus/Telemetry are Greek to you and you still manage to read so far, it's implicit that either you are my friend/stalker, or really serious about enhancing your knowledge on monitoring and related tools and systems. So, let’s first try to understand the major components here and then go into details.

There’s a vast number of metrics coming from different cloud services that could be overwhelming (API gateway metrics, Database metrics, Compute instance metrics, Application level metrics, JVM level metrics, etc), and you need to determine which metrics are the most important to track and find the tools that will report those metrics. Very simple examples could be the number of logins per minute, number of new accounts created JVM memory usage and so on (you got the idea). An important concept to note here is the metrics here are changing with time and so traditional databases won’t work here, we also may or may not care about past data and only interested in current time window data. Let me not go into details of the sliding window, timers, histogram and other low-level details here and stick to the topic.



This picture (pardon my handwriting and dirty whiteboard) show the overall architecture of Cloud Monitoring.

Time Series db/Telemetry Server : At central of this is a time-series database which collect metrics emitted from the cloud services. The service could be deployed on cloud hosting platforms or can be spring based app. The services push there metrics to Metric Registry and reports these via reporting mechanism.

I have listed the popular time-series databases.

Metrics Registry :the dropwizard metrics library, which has been established as the de facto standard for metrics in Java based applications. The library makes it very easy to measure different metrics within your application. It supports five different metric-types: Gauges, Counters, Histograms, Meters and Timers. You can also use numerous modules to instrument common libraries like Jetty, Logback, Log4j, Apache HttpClient, Ehcache, JDBI and Jersey.

Metrics:

Metrics viewer and analyzer -


Monitoring IDCS Metrics


IDCS metrics is a time series data that can be viewed and analyzed over Grafana dashboard https://grafana.oci.oraclecorp.com and OCI devops Portal https://devops.oci.oraclecorp.com
Metrics of service is associated with following tags
  • project
  • fleet
  • hostname
  • availabilityDomain

1) Viewing metrics over Grafana dashboard.

Grafana dashboard https://grafana.oci.oraclecorp.com cab be used to view IDCS Metrics.
The following sample dashboard show metrics for the cloud service over Grafana.
Sample Dashboard 1:
Sample dashboard 2

2) Creating Alarm and Pager duty alert for your metrics using Oracle Cloud Portal.

Login to OCI Portal https://devops.oci.oraclecorp.com/

Search for your metrics



View Metrics

Here are sample steps to create an Alarm on "create users" Metrics. If the metrics data is greater than the threshold (for e.g response time is higher than expected) this will trigger a Jira ticket and create Pager duty incident.

Turn a Metric into a Notification in three Easy Steps
  1. Create Alarm - that applies to your metric. An alarm converts your metric, which is a series of numbers, into a series of booleans that say whether the value of the metric is acceptable (OK) or not acceptable (ALARM).
  2. Create a Rule that applies to your alarm. Rules apply to an alarm(s) that define a condition under which a notification should be sent. Although the simplest rules apply to only one alarm, there are also rules referencing multiple alarms that we encourage you to set up 
  3. Create a Notification that refers to your rule, and specifies who and how to notify. When the rule is triggered, the delivery of this notification is then attempted.

1) Create Alarm 

Search for the metrics you want to create the Alarm for:

2) Create Rule

Specify alarm name created above
   

3) Create Notification 

Once the Notification is created, it will show up the notification configuration.

Pager Duty Integration

A PagerDuty incident is automatically:
  • Created when a:
    • New Jira ticket is created with severity 1/2
    • Jira ticket severity changes from a severity 3/4/5 to 1/2
    • Resolved/closed Jira ticket with severity 1/2 is reopened
    • Jira ticket component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
    • Jira ticket is moved to a different Jira project (this also implies that the component/item has changed)
    • Jira ticket status changes from Pending/In Progress to Open
  • Acknowledged when a Jira ticket status changed from Open to Pending/In Progress
  • Resolved when a Jira ticket:
    • Is resolved
    • Severity changes from severity 1/2 to 3/4/5
    • Component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
    • Moved to a different Jira project (this also implies that the component/item changed)