High availability is is one of the fundamental basic concept and can affect one badly if the design is flawed or you choose an infrastructure that don't provide the support for it. A highly available infrastructure (Uptime target 99.995%) has the following traits

Hardware redundancy
Software and application redundancy
Data redundancy
The single points of failure eliminated

Here are some of the best practices which can help building systems with high availability
1. Automatically detect outages - Your system should be smart enough to detect the possible outages by monitoring metrics, health of application, VMs, nodes

2. Eliminate Single Point of Failure High Availability vs. Redundancy - IT infrastructures must have backup components to replace the failed system.

3. Invest in high end VM (high RAM, CPU, storage) for setting up your failover instance capable enough to handle real time traffic for short duration

3. Ensure the failover instance is not co-located in same network as your primary instance.

4. Ensure the primary (active) and secondary (passive) instance stays in sync - example the primary instance would be interacting to another systems via configuration, the same configuration should be replicated to secondary, the state of both should be same even post upgrades

5. Invest in highly available and reliable Domain Name System (DNS) service which supports Health Checks and monitoring, automatic DNS failover, Health checks - DNS Load Balancing
Ref http://blog.cloudharmony.com/2012/08/comparison-and-analysis-of-managed-dns.html

6. Invest in reliable Load Balancer which can detect unhealthy targets, stop sending traffic to them, and then spread the load across the remaining healthy targets. Again avoid single point of failure. By implementing redundancy for the load balancer itself, you can eliminate it as a single point of failure.

Monitoring your application metrics and setup Alarms

In this blog let’s explore how we can enable monitoring and create alerts for cloud-based service/applications. Cloud monitoring is a broad category that includes web and cloud applications, infrastructure, networks, platform, application, and microservices. This post targets monitoring Service /API level and JVM metrics. I am using the below components to demonstrate this

Dropwizard metrics for pushing your application metrics
Oracle Telemetry server (Time-series database) to collect the application and JVM metrics and create alarm and notification
Grafana dashboards for visualizing and analyzing the metrics

In case the terms Time series database/metrics/Grafana/Prometheus/Telemetry are Greek to you and you still manage to read so far, it's implicit that either you are my friend/stalker, or really serious about enhancing your knowledge on monitoring and related tools and systems. So, let’s first try to understand the major components here and then go into details.

There’s a vast number of metrics coming from different cloud services that could be overwhelming (API gateway metrics, Database metrics, Compute instance metrics, Application level metrics, JVM level metrics, etc), and you need to determine which metrics are the most important to track and find the tools that will report those metrics. Very simple examples could be the number of logins per minute, number of new accounts created JVM memory usage and so on (you got the idea). An important concept to note here is the metrics here are changing with time and so traditional databases won’t work here, we also may or may not care about past data and only interested in current time window data. Let me not go into details of the sliding window, timers, histogram and other low-level details here and stick to the topic.

This picture (pardon my handwriting and dirty whiteboard) show the overall architecture of Cloud Monitoring.

Time Series db/Telemetry Server : At central of this is a time-series database which collect metrics emitted from the cloud services. The service could be deployed on cloud hosting platforms or can be spring based app. The services push there metrics to Metric Registry and reports these via reporting mechanism.

I have listed the popular time-series databases.

Metrics Registry :the dropwizard metrics library, which has been established as the de facto standard for metrics in Java based applications. The library makes it very easy to measure different metrics within your application. It supports five different metric-types: Gauges, Counters, Histograms, Meters and Timers. You can also use numerous modules to instrument common libraries like Jetty, Logback, Log4j, Apache HttpClient, Ehcache, JDBI and Jersey.

Metrics:

Metrics viewer and analyzer -

Monitoring IDCS Metrics

IDCS metrics is a time series data that can be viewed and analyzed over Grafana dashboard https://grafana.oci.oraclecorp.com and OCI devops Portal https://devops.oci.oraclecorp.com

Metrics of service is associated with following tags

project
fleet
hostname
availabilityDomain

1) Viewing metrics over Grafana dashboard.

Grafana dashboard https://grafana.oci.oraclecorp.com cab be used to view IDCS Metrics.

The following sample dashboard show metrics for the cloud service over Grafana.

Sample Dashboard 1:

Sample dashboard 2

2) Creating Alarm and Pager duty alert for your metrics using Oracle Cloud Portal.

Search for your metrics

View Metrics

Here are sample steps to create an Alarm on "create users" Metrics. If the metrics data is greater than the threshold (for e.g response time is higher than expected) this will trigger a Jira ticket and create Pager duty incident.

Turn a Metric into a Notification in three Easy Steps

Create Alarm - that applies to your metric. An alarm converts your metric, which is a series of numbers, into a series of booleans that say whether the value of the metric is acceptable (OK) or not acceptable (ALARM).
Create a Rule that applies to your alarm. Rules apply to an alarm(s) that define a condition under which a notification should be sent. Although the simplest rules apply to only one alarm, there are also rules referencing multiple alarms that we encourage you to set up
Create a Notification that refers to your rule, and specifies who and how to notify. When the rule is triggered, the delivery of this notification is then attempted.

1) Create Alarm

Search for the metrics you want to create the Alarm for:

2) Create Rule

Specify alarm name created above

3) Create Notification

Once the Notification is created, it will show up the notification configuration.

Pager Duty Integration

A PagerDuty incident is automatically:

Created when a:
- New Jira ticket is created with severity 1/2
- Jira ticket severity changes from a severity 3/4/5 to 1/2
- Resolved/closed Jira ticket with severity 1/2 is reopened
- Jira ticket component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
- Jira ticket is moved to a different Jira project (this also implies that the component/item has changed)
- Jira ticket status changes from Pending/In Progress to Open
Acknowledged when a Jira ticket status changed from Open to Pending/In Progress
Resolved when a Jira ticket:

- Is resolved
- Severity changes from severity 1/2 to 3/4/5
- Component/item changes, and the new component/item has a different PagerDuty service key (create a new incident in the new service key, and then resolve the old incident in the previous service key)
- Moved to a different Jira project (this also implies that the component/item changed)

Security and Identity Management

Thursday, December 24, 2020

From Monolith to Cloud native

Wednesday, August 5, 2020

ELK Integration

Sunday, March 8, 2020

Building high availability systems

Saturday, March 7, 2020