Monitoring and Alerting

Overview

When something goes wrong (perhaps a REST endpoint you rely on goes down, or an integration starts taking five minutes when you expect it to take five seconds) your incident response team should be alerted right away. It should never be the case that customers are calling you to inform you that your integrations are down. Your team should be able to proactively, or at least quickly reactively, respond to issues with your integrations.

With properly configured monitoring and alerting you can put your mind at ease - no news is good news!

Prismatic alert monitors are configurable. You can choose from a variety of alert triggers, including triggering on elevated log levels, long execution times, failed executions, etc., and you can notify your integration team via email and SMS, or via Slack, Pager Duty, OpsGenie, or any other notification system you choose by using webhooks.

Terminology

  • An alert group is a set of users to notify (by email or SMS) and webhooks to invoke when when an instance does something noteworthy or unexpected, like failing to run to completion.
  • An alert trigger is a noteworthy or unexpected event that causes an alert monitor to fire. An alert trigger may fire if an instance takes longer than expected or logs error or warning messages unexpectedly. You can also trigger on positive things, like successful instance runs or when you set an instance to enabled. A full list of alert triggers are below.
  • An alert monitor is a combination of an alert group and some alert triggers, and is configured for an instance. You add an alert monitor to an instance, specify when the monitor should be triggered, and which alert groups should be notified in the event of a trigger.
  • An alert event is created when an alert trigger causes an alert monitor to fire. For example, one alert event might notify the DevOps team at 07:30 AM that an instance failed to run. If the instance is scheduled to run every 15 minutes, another event would be created 15 minutes later if the issue hadn't been resolved.

Alert Triggers

Many events can trigger an alert monitor:

  • Execution Completed: This will trigger an alert upon a successful run of an instance. You could use this to notify customers when an instance runs to completion.
  • Execution Duration Matched or Exceeded: Does your integration normally take 5 seconds? Do you want to be alerted if it takes longer than 10 seconds? Specify the maximum number of seconds you expect an instance to take, after which you'd like to be notified.
  • Execution Failed: This will trigger an alert upon a failed run of an instance.
  • Execution Overdue: Do you expect your integration to run every X minutes? This will trigger an alert if X has been reached.
  • Execution Started: This will trigger an alert upon a start of an instance.
  • Instance Disabled: This will trigger if an instance is disabled.
  • Instance Enabled: This will trigger if an instance is enabled. You might want to use this to notify project managers when an instance is ready for a customer.
  • Instance Removed: This will trigger if an instance is deleted.
  • Log Level Matched or Exceeded: Are fatal, error, or warn log lines expected in standard execution of your integration? Presumably not. Specify a log level (fatal, error, or warn), and if log lines are written that match or exceed that log level, an alert is triggered.

For More Information: Log Levels

Alert Webhooks

In addition to email and SMS notifications, you can configure alert monitors to invoke a webhook URL with a payload of your choice. An alert webhook could be used to send alert info the PagerDuty or OpsGenie APIs, your own DevOps alert endpoint, or any other alerting service with an HTTP-based API.

Creating Alert Webhooks

To create or modify a webhook endpoint, click into the Settings page and select the Alert Webhooks tab. Click the +Alert Webhook button, enter an appropriate name for your alert webhook, URL, and payload information.

Alert webhooks are meant to be general enough that they can be used by multiple alert monitors, and their payload templates help with that. Within the Payload Template section you can enter certain keywords, which are replaced when an alert monitor fires with information about the alert monitor, instance, trigger, and monitor URL.

  • $SUBJECT - The string literal "Prismatic.io Alert"
  • $NAME - The name of the Alert Monitor that was triggered
  • $INSTANCE - The title of the Instance that triggered the Alert Monitor
  • $TRIGGER - The name of the trigger on the Alert Monitor that was triggered
  • $URL - The URL that will navigate to the specific Alert Monitor that was triggered

After creating the alert webhook, you can optionally add HTTP headers under the Headers tab. Headers are frequently used for passing an authorization token to a webhook.

Editing Existing Alert Webhooks

To modify an existing alert webhook, click Settings on the left-hand sidebar and then select the Alert Webhooks tab. Click into an existing alert webhook. In this screen, you can modify the payload template in the Template tab, HTTP headers in the Headers tab, or webhook name and URL in the Settings tab.

Deleting Alert Webhooks

To delete an alert webhook open the Settings page from the left-hand sidebar. Click the Alert Webhooks tab and select an alert webhook. Within the alert webhook's Settings tab click Delete Alert Webhook. Confirm deletion by clicking REMOVE ALERT WEBHOOK.

Sending Incidents to PagerDuty with Alert Webhooks

Many operations teams prefer to use an incident response service like PagerDuty to track production issues. Alert webhooks can be configured to generate PagerDuty incidents by invoking PagerDuty's API.

To send alerts to PagerDuty, point an alert webhook at https://events.pagerduty.com/v2/enqueue and then configure a payload template that contains PagerDuty API's required fields:

{
"routing_key": "YOUR-PAGERDUTY-KEY",
"event_action": "trigger",
"links": [{ "href": "$URL", "text": "Link to Prismatic alert monitor" }],
"payload": {
"summary": "$NAME triggered - $INSTANCE failed to run.",
"severity": "error",
"source": "$SUBJECT"
}
}

Additional fields listed in PagerDuty's docs can be added to the payload template to add additional information to the PagerDuty incident. No special headers are required for this alert webhook since the PagerDuty key is passed in as part of the payload. When an alert monitor using this alert webhook fires, an incident is created in PagerDuty:

Sending Notifications to Slack with Alert Webhooks

Many operations teams use Slack to notify themselves of production issues. Prismatic alert webhooks can be configured to send messages to a Slack channel.

To send alerts as messages to Slack, first generate a new Slack webhook:

  1. Navigate to https://api.slack.com/apps
  2. Click Create New App, adding an app to your workspace.
  3. Under Add features and functionality select Incoming Webhooks
  4. Activate Incoming Webhooks and then Add New Webhook to Workspace
  5. Take note of the Webhook URL. It should be of the form https://hooks.slack.com/services/foo/bar/baz

Use the Slack webhook URL that you generated in a Prismatic alert webhook, and configure the payload template to read similar to this:

{
"text": "$NAME triggered - $INSTANCE failed to run. See $URL"
}

No special headers are required for this alert webhook. When an alert monitor that uses the alert webhook next fires, a message will be sent to your Slack channel.

Alert Groups

You will likely want to alert the same group of people if integration X fails and if integration Y fails. To do that, you can create an alert group that can be assigned to multiple alert monitors. That way, if you hire a new DevOps engineer, you can quickly add them to the DevOps alert group and they'll automatically be added to each alert monitor the DevOps group is attached to.

Note that you can add both organization team members and customer users to alert groups. If you wish to notify customers when alerts trigger, for the sake of reusability we recommend creating an alert group per customer, and alert group(s) for your team. You can then attach your team's alert group(s) to all alert monitors, and your customer's alert group to the monitors only for their instances.

Creating Alert Groups

Click Settings on the left-hand sidebar, and select the Alert Groups tab. Click the + Alert Group button on the upper-right and give your alert group a name (e.g. "DevOps Alert Group"). From there, you can enumerate users to be notified and webhooks to be invoked upon an alert being triggered.

Editing Existing Alert Groups

To modify an existing alert group, you will return to the same screen you saw when you created your alert group by clicking Settings on the left-hand sidebar and then select the Alert Groups tab. Click into an existing alert group. Within this screen, you can modify the name of the group and the list of users and webhooks associated with the group.

Deleting Alert Groups

To delete an alert group click the Settings link on the left-hand sidebar. Then, click the Alert Groups tab and select an alert group. Scroll to the bottom of the alert group's page and click Delete Alert Group. Click REMOVE ALERT GROUP to confirm deletion.

Alert Monitors

An alert monitor is a combination of an alert group (users and webhooks) and an alert trigger that is configured for an instance. When you add an alert monitor to an instance, you specify when the monitor should be triggered, and which alert group(s) should be notified in the event of a trigger firing.

Creating an Alert Monitor

After selecting an instance from a customer's Instances tab or the Instances link on the left-hand sidebar, click the instance's Monitors tab. Click the + Monitor button on the top-right of the screen. Specify a name for the monitor and select a trigger.

After creating the alert monitor you will find yourself in the monitor's Settings tab. Within this tab, you can add additional triggers to your alert monitor within the Triggers card. You can also choose the groups or users to notify and webhooks to trigger when an alert trigger fires.

Editing Existing Alert Monitors

To modify an existing alert monitor, click Instances on the left-hand sidebar and then select an instance. Under the instance's Monitors tab, select a monitor. This will bring you to the same screen you saw when you created the monitor, where you can modify the monitor name and who is notified under the Settings tab.

Clearing a Triggered Alert Monitor

If multiple team members are notified by an alert event, it's import for the team to know if the event has been addressed. By marking an alert monitor as "cleared", your team member acknowledges the event and indicates that they are working to resolve the issue.

From the instances list, select the instance with one or more triggered monitors, and then enter the Monitors tab. Select one or more triggered monitors. Click the icon to clear your selected events.

Deleting an Alert Monitor

Click Customers from the left-hand sidebar and select a customer. Under the customer's Instances tab, select an instance and then click Monitors. Click into an alert monitor and open the Settings tab. Scroll to the bottom of the page. Click Delete Monitor and confirm deletion by clicking REMOVE MONITOR

Alert Events

An alert event is created when an alert monitor is triggered. When an event is created, any users in the monitor's associated alert groups receive a notification (email or SMS) with a link to the event. Your team members can indicate that the issue has been acknowledged and is being addressed by clearing the alert event.

Viewing Alert Events

The easiest way to view an alert event is to click the link that is sent in the alert event email/SMS.

Alternatively, after clicking the Instances link on the left-hand sidebar, you will be presented with a list of all instances. Each instance has an indicator in the upper-right indicating if any alert monitors have been triggered but not yet cleared.

If you click an instance with triggered monitors and then select the Monitors tab, you can view currently triggered monitors, denoted with red on the left-hand side.

Clicking a triggered monitor will bring you to the monitor's Events tab, where you can view details about the triggered event.

From there, clicking a specific alert event will bring up logs from just before and after the event on the bottom of the page.

Last updated on