Skip to main content

Alerting

When something goes wrong (perhaps a REST endpoint you rely on goes down, or an integration starts taking five minutes when you expect it to take five seconds) your incident response team should be alerted right away. It should never be the case that customers are calling you to inform you that your integrations are down. Your team should be able to proactively, or at least quickly reactively, respond to issues with your integrations.

With properly configured monitoring and alerting you can put your mind at ease - no news is good news!

Prismatic alert monitors are configurable. You can choose from a variety of alert triggers, including triggering on elevated log levels, long execution times, failed executions, etc., and you can notify your integration team via email and SMS, or via Slack, Pager Duty, OpsGenie, or any other notification system you choose by using webhooks.

Terminology

  • An alert group is a set of users to notify (by email or SMS) and webhooks to invoke when when an instance does something noteworthy or unexpected, like failing to run to completion.
  • An alert trigger is a noteworthy or unexpected event that causes an alert monitor to fire. An alert trigger may fire if an instance takes longer than expected or logs error or warning messages unexpectedly. You can also trigger on positive things, like successful instance runs or when you set an instance to enabled. A full list of alert triggers are here.
  • An alert monitor is a combination of an alert group and some alert triggers, and is configured for an instance. You add an alert monitor to an instance, specify when the monitor should be triggered, and which alert groups should be notified in the event of a trigger.
  • An alert event is created when an alert trigger causes an alert monitor to fire. For example, one alert event might notify the DevOps team at 07:30 AM that an instance failed to run. If the instance is scheduled to run every 15 minutes, another event would be created 15 minutes later if the issue hadn't been resolved.