Explore Resources By
or
or
or

Webinar

PagerDuty Pulse: Latest Releases, Features & Capabilities

You asked, we listened. Join us for our quarterly release notes webinar in which we recap the latest and greatest from PagerDuty. In this special edition of our upcoming webinar, you’ll learn about some of the most important capabilities we released in 2017, so you can get the most out of your PagerDuty environment in 2018 and beyond.

In this webinar, we demo capabilities from some of our most exciting releases this past year:

Event Intelligence & Automation: 

Learn how you can centralize, manage, and automate event behavior at scale to extract signal from the noise.

  • Automated alert grouping
  • Similar incidents
  • Event routing

Major Incident Management:

Leverage the best-in-class, end-to-end incident lifecycle for effective, automated response to the most critical incidents.

  • Incident priority
  • Response plays & stakeholder engagement
  • Postmortems
  • ITSM integrations

Platform Extensibility:

Count on best-in-class API support to automate tasks and customize your ideal workflow.

  • Events API v2
  • New API endpoints
  • ChatOps integrations

…And much more!

Watch as we demonstrate new workflows and show you how to automate the end-to-end incident lifecycle to rapidly detect and prevent issues and gain back time for innovation.

Webinar

PagerDuty Pulse | 2017 Q1 Release Notes

Learn more about PagerDuty’s newest innovations that enable you to deliver better software and customer experiences.

In this webinar, we demonstrate the following capabilities:

  • Stakeholder Engagement – With the new Stakeholder user license, you can easily engage business stakeholders during critical incidents. Stakeholder Engagement simplifies cross-functional communication tasks and centralizes critical real-time information.
  • ServiceNow Express – PagerDuty partners with ServiceNow to deliver end-to-end incident management through bi-directional incident workflows, simplified on-call scheduling and response orchestration, and is proud to be one of the first certified integration partners for ServiceNow Express.
  • New API Endpoints – Create incidents via mobile, leverage new API endpoints, including the enhanced Events API to normalize and slice and dice event data, and customize end-to-end incident response workflows.
  • On-Call Timeline – See exactly when you’re on or off-call with a simple timeline.

Webinar

PagerDuty Pulse | 2017 Q3 Release Notes

You asked, we listened. Join us for our quarterly release notes webinar in which we recap the latest and greatest from PagerDuty. In this webinar, you’ll learn about our newest innovations that enable you to integrate machine and human intelligence to effectively manage your digital operations.

In this webinar, we demonstrate capabilities that deliver:

  • Intelligent, Real-Time Decisions — With machine learning and rules-based automation, intelligently surface the exact systems and people context you need, right when you need it.
  • Automated Precision Response — Go from signal to action in seconds and automatically get the exact people on the issue, eliminating manual work and saving time when it matters most.
  • Business-Wide Orchestration — Recover from disruptions and deliver amazing service to customers by mobilizing the right people across the entire business quickly with response best practices.

We demonstrate new workflows and show you how to automate the end-to-end incident lifecycle to rapidly mobilize people and resolve any issue.

Webinar

Better Incident Management with ChatOps

ChatOps — conversation-driven development -— is changing the way development and operations teams work, helping increase productivity by an average of 32% and team transparency by 80.4%1. By bringing your tools into your conversations, you can automate tasks, develop, and fix issues more effectively by learning and working in a single environment.

Join us as we share use cases and examples of how to leverage ChatOps for better collaboration with more flexibility and speed than ever before. We share and demo our ChatOps extensions to popular chat tools, including HipChat, Slack, Flowdock, Cisco Spark, and more!

You’ll discover how to:

  • Streamline ChatOps incident management
  • Leverage message buttons to increase speed during response
  • Easily fix issues in context without toggling between tools
  • Automate ops-related tasks with bots and slash commands
  • Enforce user permissions and analytics

Case Study

Modernizing IT Processes and Improving the Incident Management Lifecycle

IBM Smarter Workforce needed the right solution in place to help modernize their traditional IT methodologies in order to increase the team’s productivity and tackle incident management at scale. Moreover, IBM Smarter Workforce needed a solution to help them overcome the manual processes around managing on-call schedules and incident escalation.

Read the case study to learn how PagerDuty enabled IBM Smarter Workforce to modernize their incident management lifecycle and overcome the challenges of digital transformation. 

" Having PagerDuty in place is a huge win for our IT operations teams."
— Peter Kosmalski, Manager of Hosting Operations Support, IBM

RefCard

PagerDuty for Cloud Migration

PagerDuty capabilities for Atlassian:

  • Create work items
  • Prioritize service issues
  • Automate support workflows

RefCard

PagerDuty for Atlassian

PagerDuty capabilities for Atlassian:

  • Create work items
  • Prioritize service issues
  • Automate support workflows

RefCard

PagerDuty App for Splunk

With the PagerDuty App for Splunk:

  • Focus on what matters
  • Understand service impact
  • Turn data into action

Webinar

Accelerate Digital Transformation with Integrated Machine & Human Data

With the proliferation of data and tooling, it’s becoming harder to make sense of and take action on business-impacting events, yet it is more critical than ever to do so with revenue and brand impact on the line. Teams need to integrate machine data with human intelligence, alongside response best practices to drive action when it matters most. In this webinar, you’ll learn best practices leveraged by thousands of the most mature operations teams in reducing alert fatigue, improving productivity, and driving an empowered culture.

Join us as we discuss:

  • The demand for intelligence to drive digital transformation
  • How to break down silos across data stores so you can correlate data
  • How teams are coping with chaos and complexity
  • 6 best practices leveraged by top ops teams to undergo digital transformation

Presented by

  • Nancy Gohring, Senior Analyst, 451 Research
  • Ancy Dow, Senior Product Marketing Manager at PagerDuty

Podcast

The Secure Developer: Keeping PagerDuty Secure

In this Podcast:

  • A look at some of the security tools we’ve used and continue to use, how we implemented them, and what goes into the decision of whether we use a new tool or not
  • How security is changing and becoming more of an operational problem — how we’re approaching this changing terrain and what our plans for the future are
  • Insight into how we work with other teams across the company to get the job done in a secure way
  • Beta testing security solutions with our teams to get a solution that worked well for everyone.
  • A peek at how we run our internal security training program

This podcast brought to you by The Secure Developer, a Heavybit podcast series.

Case Study

Cloudflare Reduces Mean-Time-to-Action to Seconds With PagerDuty

Cloudflare’s Site Reliability Engineering (SRE) team sought an incident resolution solution that would help Cloudflare increase the stability of its operations while delivering a flawless experience for every customer.

Read the case study to learn how Cloudflare leverages PagerDuty to help increase the stability of its infrastructure in order to provide more reliable service to customers. Read the case study now.

“Mean-time-to-action has dropped from minutes to seconds.”

— Michael Daly, Engineering Manager at Cloudflare

RefCard

PagerDuty for Customer Support Teams

Learn how customer support teams can leverage PagerDuty to:

  • Empower on-call support staff: Centralize customer data from any channel in PagerDuty, and deliver the ideal response every time
  • Optimize people mobilization and processes: Engage stakeholders across the business to orchestrate the ideal resolution in real-time
  • Support a growing user base: Eliminate inefficiencies and rightsize team resources to do more with less

RefCard

How to Measure the ROI of DevOps

In this solutions brief:

  • The importance of DevOps and the impact it has on your organization
  • How to tie ROI to business goals
  • DevOps metrics that can be used in determining ROI

Report

PagerDuty Harnesses Machine Learning

In this report:

  • What sets PagerDuty apart from the competition
  • Unique developments that are underway in machine learning and event management to power new benefits for operations teams
  • Expansion in unique use cases like customer support and security, in addition to the traditional ITOps and DevOps use cases

Ebook

Delighting Customers Through Incident Management

In this ebook, you’ll learn:

  • Why incident management is so crucial to effective customer support
  • Strategies for leveraging an incident management solution to streamline and optimize the customer experience
  • How customer support teams can coordinate support operations effectively across large teams and geographies
  • How you can solve problems before they create major service disruptions that send customers packing

Ebook

Digital Transformation in Financial Services

In this ebook:

  • A look at the current state of financial services
  • Challenges faced by organizations in FinServ
  • Using incident management to accelerate digital transformation and achieve agility and stability

RefCard

Cherwell Integration

With the Cherwell integration:

  • Centralize and correlate all your system alerts from Cherwell into PagerDuty
  • Connect Cherwell to PagerDuty to extend your incident management workflow based on impacted services
  • Easily embed remediation instructions, recruit additional responders, and extend collaboration workflows to all incidents to drive down resolution times.

Report

State of Digital Operations Report: UK

Download the State of Digital Operations Report: United Kingdom to see where your organization stands and how you can turn the perception of preparedness into reality.

  • A clear disconnect between organisations’ abilities to keep up with the rise of digital services in order to resolve consumer-impacting incidents.
  • Insights from +300 IT personnel in development and operations from DevOps to management roles and +300 consumers in the UK
  • A gap in consumer expectations and organisations’ ability to respond to poor performing services: IT teams take at least 5X longer than consumers are willing to wait for a service that isn’t performing

Ebook

How to Choose a Reliable Modern Incident Management Solution

In this guide, we’ll help you determine:

  • Whether your provider’s service be up when you need them most
  • How your provider guarantees delivery of incident notifications
  • If your provider can scale with your organization
  • Whether your provider’s product will always be secure

RefCard

Digital Operations Management Platform

In this datasheet, learn about new capabilities that enable:

  • Intelligent context. Surface the exact machine data and human intelligence right when you need it
  • Automated precision response. Go from signal to response in seconds, with a single tap or automatically
  • Business-wide orchestration. Deliver the best possible service on all fronts by rapidly mobilizing people across departments

Webinar Series

PagerDuty 101

Every new user should have the opportunity to learn PagerDuty best practices to ensure that they’re set up for success. We recognize that having a live training — a place to ask questions as they come up — to assist you as you get started with PagerDuty can be very helpful.

Sign up for PagerDuty 101, our new configuration and responder training, run by our very own Customer Success team. These 60-minute trainings will be hosted twice a month, every month.

Come prepared with questions for our LIVE Q&A!

Who should attend?

  • Incident responders
  • Admins/managers
  • Account owners
  • New Users
  • Customers
  • Folks in a trial

What’s covered?

  • Invite users to your account
  • Create schedules and overrides
  • Set up escalation policies
  • Configure services and integrations
  • Utilize extensions and API access keys
  • Add teams
  • Respond to incidents
  • Manage user profiles

Upcoming dates:

    *Have a question? Contact our Support team and we’ll have someone reach out to you right away.

    Ebook

    Impact of Downtime on Retailers

    In this ebook, learn:

    • The total cost of downtime—and it’s more than just monetary
    • Lessons from Black Friday
    • How incident management helps retailers survive peak workloads
    • How to overcome common retail challenges by following incident response best practices

    Podcast

    A Developer Story

    What’s in this podcast:

    • A look at PagerDuty’s strategy for the developer ecosystem
    • What event intelligence is and why it is relevant now
    • What organizations are doing differently when they utilize event intelligence
    • Application developers who are increasingly aware of their infrastructure
    • How developers are getting closer to being customer-facing
    • The philosophy behind how PagerDuty structures its platform

    Webinar

    PagerDuty Pulse | 2017 Q2 Release Notes

    Learn more about PagerDuty’s newest innovations that enable you to deliver better software and customer experiences.

    In this webinar, we demonstrate the following capabilities:
    • Alert Filtering & Search — Filter, sort, and search for specific alerts using normalized fields such as severity, summary text strings, services, source, component, etc. Minimize cognitive load as you investigate your alerts.
    • New Atlassian Integrations — Create JIRA tickets via PagerDuty incidents with the new JIRA Software integration and eliminate tool toggling when resolving incidents with our best-in-class HipChat extension.
    • Custom Incident Actions — Leverage rich in-app extensibility to create custom actions that are directly accessible within PagerDuty incidents.
    • Major Incident Sync to ITSM Tools — Seamlessly integrate with any ticketing system, have full control over what PagerDuty information syncs with your tools, and classify incidents using your organization’s priority scheme.
    • Postmortems — Get better at resolving and preventing incidents by streamlining the post-mortem process with automated timeline creation, collaborative editing, actionable insights, and more.

    Report

    Australian State of Digital Operations

    In this report, you’ll learn how:

    • Digital incidents have a direct impact on the business, with nearly one-third of respondents reporting that one hour of IT downtime costs their companies between $500,000 to more than $10 million AUD
    • 56.2 percent of respondents noted that their organisations are still experiencing customer-impacting incidents at least one or more times a week
    • Incident management reigns supreme among those who feel prepared to effectively support digital offerings, along with continuous integration, agile development, and ChatOps

    Whitepaper

    Organizing and Optimizing ITSM Toolsets

    In this paper:

    • Actionable strategies that have been leveraged by thousands of organizations that have retrofitted customer-centric and application-centric operational models onto their existing ITSM environments
    • How to empower your people to deliver, fix, and improve IT
    • Top operations challenges faced by organization resulting from the rise in digital services
    • 6 best practices to follow to optimize ITSM

    Ebook

    4 Methods to Prevent Downtime

    Learn how to implement these 4 best practices to prevent downtime:

    • Inject failure for success
    • Follow continuous integration practices to avoid outages
    • Never deal with the same incident twice
    • Test third-party services end-to-end

    Ebook

    4 Steps to Prepare for an Outage

    Prepare your teams and systems for an inevitable outage by learning how to:

    • Define your business critical metrics
    • Set up your optimal monitoring and alerting
    • Define your severity levels for better prioritization
    • Create a game plan

    Webinar

    DevOps at Scale: How Datadog Is Using AWS and PagerDuty to Keep Pace With Growth and Improve Incident Resolution

    Join us to learn:

    • How Datadog is using the AWS-PagerDuty integration to improve incident response times, manage and prioritize increasing alert volumes, and reduce alert fatigue for its on-call engineers
    • Best practices for analyzing application and service health across every layer of your IT environment
    • How PagerDuty enables collaboration with development teams to reduce resolution times

    *This webinar is part of the APN Partner Webinar Series.

    Article

    Best Practices in Outage Communication

    When a customer outage occurs, its impact is felt across the organization. While the technical response is underway, stakeholders from public relations, customer support, legal, and executives must also all be engaged and kept informed.But as teams become more global and distributed, coordinating streamlined internal and external communications and response only gets harder.

    You need a well-defined plan and processes in place to ensure effective messaging during an outage. This minimizes time wasted when every minute counts, and maximizes transparency and order in the face of stressful, major outages.

    What are some of the main challenges around outage communication?

    Today, outage communication is often manual as well as ad hoc. Unfortunately, this creates several challenges.
    • Managing updates across several siloed channels places additional burdens on the IT Team when they need it least, as they work to put out fires. This risks increasing the time it takes to achieve resolution.
    • Business and internal stakeholders also find themselves frustrated because they don’t know where to go for the latest, relevant updates. They especially don’t want to be hearing about major issues from the customer instead of the team.
    Traditional outage messaging is often done via email distribution lists, conferencing, and chat in multiple, non-consolidated streams. But if the process isn’t managed well, it can be hugely costly with respect to losses incurred from service degradation and impaired productivity. There are dire needs for standardized processes around incident communication, and centralizing information to get everyone across the business on the same page.

    What are the best practices for communicating an outage?

    Here are a few best practices that will enable you to simplify your outage communication plan:
    1. Establish a single source of truth
    2. During an outage, 100% your attention needs to be focused on solving the issue at hand. This leaves no time to waste, let alone on toggling between 4 or 5 tools, to execute mission-critical tasks like collaborating, logging status, and making sure people outside the team also know what’s going on.
    3. Have predefined lists of stakeholders to automatically notify
    4. This is where doing some pre-planning makes a world of difference in reducing chaos in a war room situation. Don’t exhaust mental energy during an incident trying to remember names of people you need to contact (Mary from the Infrastructure team? John from Support? What’s the name of that Director of Compliance again!?) and figuring out how to get in touch with them. There are great tools out there, like PagerDuty, that enable you to predefine groups of stakeholders that must know about various types of issues. When an incident strikes, automatically notifying all the right individuals with their preferred contact methods can be as easy as pushing a button.
    5. Streamline postmortems to improve future response
    6. For the most part, systems of record are not where people do the bulk of communicating during the incident response. You’re much more likely to find that information dispersed across multiple places, like ChatOps tools. But to make sure system and process failures aren’t repeated, there needs to be a way to piece together everything that happened chronologically, and prioritize learnings and action items with a post-mortem. Streamlining the post-mortem with templates and easy timeline building is key to learning faster. 

    7. Practice, Practice, Practice!
    8. The best way to get good at responding and communicating an outage is to regularly practice failure testing. While it’s crucial to do so in a way that doesn’t impact customers, test and try out different things to try to expose potential vulnerabilities. The ensuing response is an important opportunity to get more efficient at getting on top of unplanned issues, and at resolving issues fast while remembering to keep the right people engaged.

    How do you communicate with the incident response team?

    Teams must effectively coordinate incident response across subject matter experts and front-line responders. It’s important to have an efficient way to sound the alarm.
    Get the right people involved
    Appoint an Incident Commander who is the point person for getting all the right people from respective teams on the line, track the incident, and coordinate response. For more information on the role and best practices of being an Incident Commander, check out this webinar.
    Pick your communication channel
    You want to minimize the number of channels that you’re using to communicate with the response team, as tool toggling wastes time. Whether you right channel depends not only on the severity and scope of the incident, but also on your team culture and work location. The main thing that matters here is making it easy to get the right people immediately engaged.
    Document everything
    ChatOps tools are a fantastic tool for the incident response team. Having a simultaneous discussion in a chat client provides actionable, searchable, time-stamped data of who is doing what, and on what services. Even better, you can automate certain tasks and bring important information (like monitoring graphs) into a shared view, which helps drive down resolution times.

    How do you communicate with business stakeholders?

    IT outage management isn’t solely concentrated to IT. As they potentially affect the entire business and bottom line, organizations should also have a plan around how teams like Support, Legal, Marketing, Sales, etc. are kept in the loop. Have an idea of what to share, set up a place where colleagues can easily get information, and determine who will get updates and how often.
    Decide what to share
    To keep things streamlined, the response team should only share key, high-level updates: How severe is the outage? What is its likely duration? What’s being done, and when can the team expect the next update?
    Automate when you can
    A solution like PagerDuty’s Stakeholder Engagement enables you to automatically notify individuals or groups of stakeholders via preferred contact methods. No more need to try and remember names of people to look up and contact during an outage. Stakeholders can also subscribe to incident status pages to check up on progress.
    Coordinate follow-up
    If colleagues have further questions, they shouldn’t distract individual members of the response team that are heads-down on the incident. To strike a balance between keeping things moving and providing additional context as needed, funnel questions and asks through the Incident Commander.
    How do you communicate externally?
    According to Inc. magazine, it’s 30 times cheaper to keep an existing customer than it is to get a new one. Being proactive in communicating an outage to users helps you control the story about your outage, and makes it clear that your company makes transparent communication a priority.
    Be transparent with public updates
    Let end users know that you are aware of the issue and at work on a solution. The outage notification can take many forms: a maintenance page on your website, social media post or update to your status page, or perhaps just an internal communication to your customer support team.
    Craft your message
    Provide updates at regular intervals and give practical information to customers about how the issue affects them, that is short and to the point.
    Enable your support team
    A representative from support should always be immediately notified when a major outage takes place. This helps the support team stay on top of communicating the right messaging, updating your status page and support channels in real time, and reaching out to customers both during and after the issue.

    How does PagerDuty support better outage communication?

    PagerDuty supports better outage communication by enabling you to automate the best practice response. With PagerDuty’s Stakeholder Engagement, you can automatically engage the right stakeholders with real time updates via their preferred communication channels, and orchestrate the right business-wide response to customer-impacting issues.

    How to become great at outage communication

    Try out PagerDuty incident resolution, automate stakeholder communications, streamline and learn from postmortems, and more — all according to best practice. Get started with a free 14-day trial. Be sure to check out our ebook, Best Practices in Outage Communication if you’d like to dive deeper into the best practices mentioned above. Our Incident Commander training is a great resource in building up Incident Commanders that can drive clarity in both internal and external communications during a response.

    Article

    What Is Continuous Deployment?

    Continuous Deployment is the process of automatically deploying software updates to the production environment, once code is checked into the software repository by development. This relies upon multiple layers of automation to perform software testing and deploy to production.

    Why would an organization want to deploy automatically?

    If all tests have been passed and every code update is release ready, pushing updates into production automatically maximizes the release velocity with which value is provided to users, and minimizes the gaps of time during which applications are not incrementally improving. Adopting continuous deployment also provides more rapid feedback from users on new functionality, allowing organizations to make better informed decisions.

    Continuous Deployment vs. Continuous Delivery

    Continuous Deployment is related to the concept of Continuous Delivery, but takes it one step further: while Continuous Delivery creates a situation where software is always ready to be deployed into production, Continuous Deployment automates that final deployment step. Both Continuous Delivery and Continuous Deployment are strongly associated with DevOps best practices.  In order to reach full DevOps maturity and maximize the speed and quality of software delivery with automation, you need to have the components of a full Continuous Delivery pipeline in place.  However, full DevOps practices don’t necessarily require using Continuous Deployment.

    Continuous Deployment vs. Continuous Integration

    Continuous Deployment is an extension of Continuous Integration (CI), and relies on CI to validate that checked-in code is of high enough quality to be production-ready.  CI performs automated testing of code after every check-in, ensuring that the entire software codebase works properly with the newly committed code.  These unit, build, and integration tests are what validate that the update is safe for deployment, and can proceed through the additional phases of automation that culminate in deployment to production. Obviously, CI by itself doesn’t get you all the way to Continuous Deployment.  Frequently, CI tools like Jenkins or TeamCity will be used together with Application Release Automation tools like Automic, Serena, or XebiaLabs that handle the actual deploy automation.

    Can Incident Management Support a Continuous Deployment Process?

    Because Continuous Deployment is a fully automated process extended from code check-in to pushing code to the production environment, it’s vital to have triggers for manual intervention should anything go wrong. Connecting an incident management system like PagerDuty to continuous deployment tools provides visibility into any errors that may take place at any point in the pipeline.  It also allows you to make sure the proper people are notified with the right context to rapidly solve the issue - software development if code errors are found, infrastructure managers if the staging environment or production system are configured improperly, etc.

    How to Get More Out of Continuous Deployment

    PagerDuty facilitates better continuous deployment by enabling you to deploy the most innovation solutions for your customers, with complete confidence. By hooking up your deployment data to PagerDuty, you can streamline operational responsibilities with the right tools and information to rapidly triage, resolve, and learn from issues when they arise. Try it out now for yourself with a free 14-day trial.

    Article

    What is Continuous Delivery?
    Continuous delivery, also known as CD, refers to the ability to automate software deployment so that it’s always ready to be released into the production environment at any time. In practice, this involves dividing software releases into small chunks and performing build and unit tests continuously on all code, which results in increased release velocity. Why implement continuous delivery practices — what’s the advantage? By deploying code and application updates more frequently, each individual update is lower risk, and can lead to zero-downtime deployments that are undetectable to end users. When software updates or bug fixes can be released incrementally as they’re produced — without waiting for the result of large waterfall-style releases — it allows organizations to deliver value to end users much more rapidly. The software performance and overall quality end up being much higher, as quality assurance and performance testing can be carried out continuously throughout the development process. Continuous delivery relies heavily on automation, removing repetitive manual tasks and replacing them with tools that perform them much more rapidly, and conforming to a standard set of rules that eliminate errors. By connecting multiple layers of automation, software organizations can build a deployment pipeline that continuously delivers software packages from development to operations.

    Best practices for successful continuous delivery

    Companies implementing continuous delivery strategies can seek to optimize processes across multiple points in the software delivery lifecycle. Some key areas that are typically focused on include:
    • Continuous integration: This refers to the practice of build automation by performing integration and other testing immediately upon check-in of code into a version control repository. Older practices might involve a daily (or even less frequently) test run to make sure all code submitted by developers in that time period worked properly with the rest of the code base. Continuous integration, on the other hand, is always ensuring that every code update works well with the rest of the code base, and that every check-in results in releasable code. Some popular tools for continuous integration include Jenkins, Bamboo, and TeamCity.
    • Configuration automation: To ensure that software delivery functions smoothly and efficiently, all environments used for development, testing, and production deployment should be functionally identical. This has led to the notion of “configuration as code” — the use of software practices to automatically configure and deliver the same environment for use in different phases of the software delivery lifecycle. Just as code can be stored in version control repositories, so can environment configurations. Popular tools for configuration automation include Chef, Puppet, and Ansible.
    • Application release automation: Once the code is tested and the environments are configured properly, there’s value in automating the process of packaging and deploying an application or update in a standard way across different environments (including managing libraries and dependencies). Popular tools in this space include Automic and Octopus Deploy.
    • Monitoring and analytics: While most organizations are familiar with the value of monitoring tools and likely already use some combination of them for their IT operations, creating a feedback loop back into the software development process is vital for ensuring high-quality software releases. Keeping a close eye on software performance to identify problem areas – ideally, before they impact customers – allows you to be ultra-responsive to user concerns, and release rolling fixes immediately rather than wait for the next big release. There are many tools out there for different types of monitoring, but some popular ones include New Relic, AppDynamics, and Nagios.

    How does continuous delivery relate to DevOps?

    DevOps and continuous delivery are different but closely related concepts. DevOps is a mindset and a process by which development and operations align their incentives and practices to create a high-performing IT organization, all the way from coding to deployment of code into production. Continuous delivery refers to specific techniques for delivering software more rapidly and efficiently in smaller chunks – by adopting DevOps best practices and team structures, it’s easier to get organizational buy-in for building a continuous delivery process. And without building a continuous delivery process, you can’t fully implement DevOps.

    Continuous delivery vs continuous deployment

    By implementing continuous delivery, you ensure that your code is always deployable to production at any time. However, ensuring deployability is different from actually deploying it. Continuous deployment takes the pipeline that is set up by continuous delivery, and automatically deploys changes into the production environment once they’ve passed the suite of unit, build, and security tests automatically triggered by a code check-in.

    Continuous delivery vs continuous integration

    As mentioned above, continuous integration is one of the components of a full continuous delivery process. In order to be able to delivery software code continuously, it must be of deployable quality at all times – that means it must be continuously integrated into the full codebase. Though the name just refers to integration, continuous integration processes should involve a full set of test suites, including unit, build, and security tests, so that any problems are identified immediately upon code check-in.

    How does incident management support Continuous Delivery?

    The ultimate goal of a continuous delivery process is to delight the end user by providing value to them as rapidly as possible. Any interruption in their experience must be identified and resolved as rapidly as possible – ideally before any customer is impacted. In an organization that follows DevOps and continuous delivery best practices, agile development teams are responsible for the code they’ve written even in production: “you code it, you ship it, you own it.” When an incident occurs, there must be a clear and well defined response and escalation process, and everyone involved must have access to all the pertinent diagnostic data. Then once a solution has been found, continuous delivery processes are vital in testing and deploying a fix as rapidly and safely as possible. After an incident takes place, an important aspect of continuous delivery is continuous improvement: understanding what went wrong and how to do better next time. This is true both for the application problem itself, as well as for the process used in handling the incident. Typically this involves an incident post-mortem in which timelines are put together detailing every step of what went wrong and how it was handled, leading to concrete steps the team can take to improve.

    How to get more out of Continuous Delivery

    PagerDuty facilitates better continuous delivery by enabling you to ship code with confidence by making it easier to deal with unplanned work. By making it easier to adopt best practice, the end-to-end incident resolution lifecycle also enables you to learn from issues and constantly improve the resiliency of your systems and processes. Try it out now for yourself with a free 14-day trial.

    Article

    On-Call Rotations and Schedules

    Just as doctors go on-call to support emergency patient needs around the clock, IT organizations task dedicated groups of engineers with going on-call to fix issues for software services as they arise. These engineers are put on an on-call rotation, a method of rotating scheduled shift work across everyone on the team that is responsible for maintaining software availability.

    During their shift, should something break, the on-call engineer will get paged (via a smartphone push notification, phone call, text, email, or possibly even a Blackberry or pager that gets passed around if it’s an older organization). The on-call engineer is responsible for immediately taking action on the page and must fix the issue quickly or escalate it if he or she can’t fix it. As they must be available to perform troubleshooting at any point during the duration of their shift, rotating on-call responsibilities among multiple individuals or teams is important for overcoming alert fatigue and protecting work-life balance. The practice of having an on-call rotation is typically an organization’s first step towards committing to reliability for customers and users. On-call engineers are the first line of defense in ensuring customer-impacting outages are quickly noticed and resolved by someone on the team. That is why implementing setting up such a process is critical for having 24x7x365 coverage in managing issues as they arise. And by tying a timeout threshold to each tier of an escalation policy (i.e. the incident must be acknowledged or resolved within 30 minutes before it’s auto-escalated to the next line of defense), organizations can guarantee that when something breaks, someone will be on it fast. They can better meet their SLA’s, instead of collectively falling asleep at the wheel during a customer-impacting issue because the right information wasn’t quickly routed to the right person.

    Creating an effective on-call schedule

    Some organizations manually use wiki pages or spreadsheets to manage on-call rotation schedules. However, changes often don’t propagate in real-time, and getting the right people on issues can quickly become challenging if contact information is outdated, or time zone math is incorrect, among other things. At the same time, organizations are also finding that every minute of downtime can cost thousands of dollars and irreversible damage to brand reputation. Fumbling through a wiki page or static spreadsheet to find and notify the right on-call engineer is quickly becoming a very costly method of managing on-call rotation information.

    On-call rotation best practices to keep in mind

    Here are a few steps that you can take in effectively creating and managing on-call rotations that meet the needs of your team:
    Consider software for automation
    On-call scheduling software can be a great investment for your team. It saves time and minimizes manual overhead by automatically routing notifications via engineers’ preferred contact methods based on predefined schedules. This removes several steps in getting the right information to the right expert when every minute counts.
    Set up teams
    Define the teams of individuals that have on-call responsibilities for every service. Be sure to set up both service and server-level monitoring and dashboards for teams to understand system performance and health. Whenever an issue arises, it should route to the on-call engineer on the appropriate team that manages that service. The on-call engineer should also be able to immediately recruit other teammates as needed to help collaborate on issue resolution with a collaboration tool, such as conferencing or chat.
    Define escalation policies

    Determine who should be in the respective lines of defense and what actions must take place when an incident occurs. For instance, the first tier of defense might be the software engineer who wrote the code, while the second tier consists of someone from the operations team who better understands the underlying network and hardware infrastructure — or vice versa.

    Establish time limits
    If you have an availability SLA with your customers or end users, it is critical to define time limits. This way, if the first responder doesn’t take action within the timeframe, the issue automatically gets escalated and won’t be missed.
    Enable easy overrides
    Make sure there’s an easy way for people to edit the schedule to accommodate shift swaps as needed should an unexpected event come up such as an appointment or PTO.
    24x7 coverage
    Lay out shifts to see if there are any gaps and ensure complete coverage that correctly takes time zones into account.
    Transparency and communication
    Everyone should be notified and kept in the loop of changes to the schedule, so no one is caught off guard or unknowingly has a weekend ruined because of a last minute change that wasn’t communicated.
    Be aware of on-call hours
    To the point of transparency and communication, help people get ahead of knowing when they’ll be on on-call duty, and when they’ll be off, so they never miss a shift and can also plan activities accordingly. This can be easily done with an on-call timeline.

    Benefits of an effective on-call rotation

    There are several benefits that make establishing an effective on-call rotation a highly worthwhile investment:
    • Improved team transparency and accountability in handling issues
    • Better service reliability by quickly acting on and resolving alerts
    • Happier customers, who can contact on-call staff for urgent issues at any time or be assured in knowing issues always will be quickly fixed
    • Less wasted time in getting on-call staff on issues
    Collectively, all of this leads to shorter service disruptions, less loss of revenue and customers, and better brand reputation.

    Who goes on-call?

    Traditionally, on-call rotation responsibilities have been delegated to sysadmins or operations engineers (including HelpDesk and the NOC). Development teams would primarily be responsible for designing, building, and shipping new services and functionality. They would then “throw code over the wall” to operations teams, who would debug, run, operate and maintain the code. However, this siloed process created some significant challenges in accountability, cross-functional alignment, scalability, and reliability. Developers felt less ownership of impacting the customer experience, and when they didn’t have experience handling production workloads, they were more likely to deliver non-performant code that didn’t fully scale or had high operational load. Operations engineers would often take longer to fix broken code that was written by someone else and sometimes ended up having to escalate to the developer anyway. As a result, while most operations in enterprises to date have largely been centralized, many organizations are beginning to distribute operational responsibilities to improve the performance of services and applications, instead of operating monolithic systems. Increasingly developers are going on-call for their own code, which closes the feedback loop by encouraging collaboration between development and operations to proactively build more resilient, production-ready services. New roles have also spun up, such as DevOps Engineer and Site Reliability Engineer. These roles often focus on faster and safer releases, improving reliability via automation, and improving the software lifecycle by building internal tools that automate the manual, human labor typically involved in operations (triaging, change management, monitoring, etc.). As more groups within an organization take on operational responsibilities, as opposed to the NOC triaging all issues and trying to route them to the right people, cross-functional teams typically can focus on higher-value customer experience metrics and collectively work together to improve them.

    What on-call rotation schedules does PagerDuty support?

    PagerDuty can support any kind of custom on-call rotation type, including on-call after-hours support, follow-the-sun, daily, weekly, or split shift rotations. We enable you to create multiple scheduling layers (a group of people who rotate on-call responsibilities through the same shift) within a single schedule. Below, we’ve highlighted some common configurations and on-call schedule templates from our Support Knowledge Base.
    • Getting Started - Learn the basics of how to create an on-call schedule, including how to add users, define rotation frequencies and time-of-day restrictions, and more.
    • Complex irregular schedules - This schedule is set up for teams that rotate shifts that are on for one week, and then off for a few.
    • Complex schedule for 2 users on a 2-day rotation with separate weekends - This example shows a complex schedule for two users that are on a two-day rotation. However, on Saturday and Sunday, the on-call user is on call for 24 hours.
    • Complex schedule with restrictions - PagerDuty enables you to build complex schedules where users trade off the early morning, morning, evening, weekend, and other shifts for varying numbers of hours respectively. Click the link for an example.
    • Complex split shift rotation - This example shows you how to create a rotation with time restricted where each shift is split by multiple users.
    • Creating primary and secondary on-call schedules - Creating primary and secondary on-call schedules creates multiple lines of defense if the primary on-call engineer misses a notification. You can add multiple schedules as progressive levels of an escalation policy to ensure a backup user will respond to an incident.  
    • Follow-the-sun schedule - The follow-the-sun schedule is used by teams that may work internationally in different time zones, and ensures full 24/7 coverage.
    • Inverse schedules on an escalation policy - If you have two or more users that rotate primary and secondary on-call shifts, then you will want to create two on-call schedules and add each of those schedules to a separate level of an escalation policy.
    • Schedule users on-call every other week - You can create multiple layers within your schedule to accommodate multiple users that hand off every other week (for example, 2 on-call engineers who cover weekdays and 2 who cover weekends, who rotate weekly).
    • Expert that is always on-call - You can create an additional layer to always route certain types of issues to specific experts (for instance, a DBA, Network Architect, etc.)
    And much more! Contact support@pagerduty.com if you have any questions. We’re more than happy to help you with any custom schedule management needs and set up ideal on-call rotations for developers, NOC teams, support teams, security teams, and more.

    How to get the most out of on-call scheduling

    PagerDuty streamlines on-call rotation management for any kind of rotation type or team. Our on-call scheduling capability includes simplified editing, SSO integration, automated escalations, and much more. Try it out now for yourself with a free 14-day trial. We hope these resources enable you to formalize your on-call rotation process to make it as easy as possible for your team to respond to issues.

    Article

    User Roles and Permissions

    Nearly every organization has security and compliance requirements around data access. IT Operations teams must adhere to a standardized process to correctly roll out tools and resources to all users with respect to on-boarding, configuration, permissions, and security — especially as organizations scale. It’s table stakes to implement a permissions model that ensures users only have access to the data required to perform their respective roles.

    Locking down user access to different enterprise resources is foundational for upholding security standards and ensuring organized and purposeful resource provisioning. That’s why enforcing user privileges with permissions has long been an IT governance best practice,  ensuring that appropriate levels of data access are granted based on user roles. Permissions models typically follow the principle of least privilege. This is a standard concept in security practices that requires that any module (i.e. a user, process, etc.) only be granted the minimum level of privileges required to execute its intended function. Abstracting data so that users don’t see details that are nonessential for their roles also improves productivity, as users can focus solely on the information that matters. Organizations also bolster this with the separation of duties concept, which disseminates critical responsibilities to more than one person or department to manage risk, error, and fraud. Most permissions models build on a predefined access control list, a set of data that defines which access rights each user has to specific system objects. When a request is received, the tool will check the access control list for its associated security attribute to ensure the requester has the permissions required to access the resource. With any tooling investment, organizations seek solutions where teams are using the same tooling in the same capacity. When teams start employing their own solutions (Shadow IT), locking down access and enforcing behaviors to meet organizational requirements becomes very difficult and can quickly complicate things such as billing. In the optimal scenario, a central admin has the ability to control the level of access users have through a single instance of the tool, and the tool meets the organization’s policies and guidelines as related to information security and data protection. Admins are thus able to track activity across all of the different teams, while teams can only manage the specific objects they need access to.

    When choosing a solution that upholds permissioning standards, benefits that tools and process owners seek include:

    • The solution should fit different types of teams
    • The solution should align with existing security policies
    • The solution should integrate with existing infrastructure and user repositories defined in their identity management solution (e.g. AD / LDAP). This manages access across applications that align with organizational hierarchy and structure, definition of user groups, and distribution of access privileges
    • The solution should associate outcomes, usage, and other metrics with specific users and teams to show the business improvements and optimizations made

    Why do user permissions matter in incident management?

    Just as with any other tooling investment, user permissions matter a lot in any incident management solution. They’re critical to protecting data, especially if there is any sensitive information around any incident that must only be accessible by a specific individual or team. They help teams adhere to compliance requirements around data access and they improve user and team productivity when responding to issues, as people aren’t trying to parse through a ton of information that’s irrelevant to them. When it comes to enterprise permissions, the model must be highly scalable. This way, administrators can map access to groups of users, instead of having to manually configure access for each individual user.

    Here are the key reasons why it’s important to ensure incident management roles and permissions are aligned to your organizational requirements:

    • Enhance security across all objects. Only admins and account owners should have the centralized ability to lock down and secure read/write access to specific objects within the incident management solution (e.g. schedules, escalation policies, incidents that are aligned to specific IT services, etc.). This ensures users can only view and perform actions on objects they are supposed to.
    • Increase user productivity. Teams gain productivity as individual users access only those resources specific to their role. As users no longer need to toggle through every single object in order to find the relevant ones to interact with, they can focus their attention on data points that actually matter to their role.
    • Protect data and meet compliance requirements. Organizations of any scale are better positioned to meet compliance requirements when designated admins or account owners can exercise security principles.
    • Granularly define access. Admins should manage access to various objects within the incident management solution across independent teams. This is done through the creation of custom user roles, or if a more scalable solution is required, then access permissions should be defined for groups of users requiring the same set of functionality. Access to specific objects can be layered on top of a base role to maximize granularity,  enabling admins to manage and secure resources without over or under-provisioning user privileges.
    • Enforce API access. Many incident management solutions pull in third party monitoring data and can extend functionality via an API. Being able to customize the incident response workflow to meet specific needs and shave off time during response is an important value for most who operate IT services. As such, it’s important to leverage an incident management solution where API access is automatically enforced across the entire platform including the API. API keys retrieved through the platform should be authorized to only provide a user the level of access to objects in the API as are defined in the pre-existing role.

    Who needs to consider permissions?

    Permissions should be a priority for companies of all sizes. Your organization may be a large enterprise with a central operations team and independent siloed teams. Or, it may be a mid-sized organization with growing infrastructure complexity and distributed ownership over operational responsibilities. Regardless of your organization’s size and mode of operations, permissions matter to any stakeholder who is responsible for tooling investments and for centrally managing implementations. They also must be considered by individuals who are both administrators and users of the tools, such as NOC, CentralOps, and DevOps managers. These individuals need to manage visibility and access to objects across independent, siloed teams, and do so in a nimble way. This way, their teams can only interact with what they need and aren’t stuck submitting tickets to HelpDesk every time they need access to various objects.

    What permissions does PagerDuty include?

    PagerDuty’s Custom Permissions enable capabilities around powerful security and access control. There are two different models within PagerDuty that maximize flexibility in how permissions can be granted and modified. First, admins and account owners can create custom roles for specific users, ensuring users are only granted the permissions they need. The second option enables admins and account owners to enforce permissions and visibility control at the team level as well, to improve efficiency and scalability when dealing with large groups of users. Organizations can exceed tight compliance requirements and exercise full control and management over user access and level of interaction with individual objects. There are three fixed roles in PagerDuty that cannot be granted any additional access on top of their existing privileges:
    • Account Owner — Full access to create, update, and delete objects, including a user’s permissions. This access cannot be restricted. Can also access the Billing page.
    • Global Admin — Full access to create, update, and delete objects, including a user’s permissions. This access cannot be restricted.
    • Stakeholder — Can view objects, but cannot make any modifications. Cannot be given Additional Permissions.
    Any user that is not an Account Owner, Global Admin, or Stakeholder is assigned one of the following four base (flexible) roles, on top of which they are granted the level of access they need to specific objects. A single user can have multiple roles to define the level of access they get to different objects in PagerDuty. For instance, an individual can have Manager access for objects required for a team they manage, but have Responder or Observer access to other services, escalation policies, etc.
    • Restricted Access — Users can't see anything in the account until they're added to a Team and assigned a Team role.
    • Manager — Full access to create, update, and delete objects and all of their configuration.
    • Responder — Can take action on incidents, create overrides, and set maintenance windows.
    • Team Responder — For objects belonging to their teams, able to take action on incidents, create overrides, and set maintenance windows.
    • Observer — Can view objects, but cannot make any modifications.
    » Learn more about permissions in PagerDuty at our Support Knowledge Base here.

    Learn more about team and user-based access controls in incident management

    PagerDuty enables teams to streamline permissioning in incident management with scalable team-based permissions and visibility control, highly granular user permissions, simple user association between PagerDuty and other tools such as ChatOps, and more. Try it out now for yourself with a free 14-day trial. We hope these resources enable you to optimize permissions in incident management so users and teams can administer independently while operating and taking action on issues effectively.  

    RefCard

    PagerDuty for Retail

    PagerDuty helps retailers:

    • Deliver better customer experiences
    • Modernize your operations
    • Understand and respond fast to customer sentiment
    • Drive fast resolution and prevent future issues

    Webinar

    On-Call Scheduling and You

    Easy scheduling and automated escalations to accelerate response times

    On-call schedules continue to be one of the most frequently asked about topics from our customers. In this session, we will address the distinction between on-call schedules and escalation policies — as well as how they work together. We’ll also review best practices on shift restrictions, hand-off times, and how to sync up rotations. Lastly, we’ll go over best practices for how to set up common on-call rotation schedules.

    In this webinar:

    • The difference between on-call schedules and escalation policies
    • Best practices for how to set up on-call schedules
    • Common on-call rotation schedules

    *Part of the “Succeed at PagerDuty” Webinar Series

    Webinar

    Approaching Service Groups

    Structure your microservices, applications, custom tools, or infrastructure components as they actually exist in your environment.

    Join us as we show you how to approach setting up your services in PagerDuty. This important capability enables you to establish workflows around business services instead of siloed tools. We look at some example use cases so you can walk away with best practices on setting up and successfully using services.

    In this webinar, we discuss:

    • How to approach service groups
    • Best practices for using service groups
    • Overcoming the challenges of siloed tools
    • Establishing workflows

    RefCard

    Solutions for Financial Services

    PagerDuty helps FinServ organizations:

    • Drive faster resolution and protect data with best practice
    • Uplevel workforce productivity
    • Continuously improve operations and system resiliency

    Webinar

    Introduction to Being an Incident Responder

    You’ll learn:

    • What is incident response?
    • The roles involved in incident response
    • How to incorporate learnings from previous incident responses
    • Skills for success

    Webinar

    Introduction to Being an Incident Commander

    Join us to learn:

    • What is an Incident Commander?
    • The role and responsibilities of an Incident Commander
    • Incident call procedures and terminology
    • Incident commander skills for success

    Article

    What is an Incident Post-Mortem?

    A post-mortem (or postmortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.

    As your systems scale and become more complex, failure is inevitable, assessment and remediation is more involved and time-consuming, and it becomes increasingly painful to repeat recurring mistakes. Not having data when you need it is expensive. The good news is, most organizations do have some kind of a post-mortem process in place to assess what happened once a service has been restored. Arguably, any resolution of an issue isn’t truly complete until a team has fully documented and reflected on it. However, conducting a post-mortem can be a highly time-consuming task — teams often spend hours on each post-mortem trying to piece together the chronology of events from different sources of information. Streamlining the post-mortem process is key to helping your team get the most from their post-mortem time investment: spending less time conducting the post-mortem, while extracting more effective learnings, is a faster path to increased operational maturity. In fact, the true value of post-mortems comes from helping institutionalize a positive culture around frequent and iterative improvement. Organizations may refer to the post-mortem process in slightly different ways:
    • Learning Review
    • After-Action Review
    • Incident Review
    • Incident Report
    • Post-Incident Review
    • Root Cause Analysis (or RCA)

    Streamline the post-mortem process

    The specifics around conducting post-mortems vary from organization to organization. Regardless of the process, the primary purpose of post-mortems should be learning, whether it’s about the systems being managed, the process being followed, or how the organization executes during a crisis. Additional goals, including identification and implementation of system or process improvements, may be realized depending on the process followed. In general, an effective post-mortem report tells a story. Incident post-mortem reports should include the following:
    • A high-level summary of what happened
      Which services and customers were affected? How long and severe was the issue? Who was involved in the response? How did we ultimately fix the problem?
    • A root cause analysis
      What were the origins of failure? Why do we think this happened?
    • Steps taken to diagnose, assess, and resolve
      What actions were taken? Which were effective? Which were detrimental?
    • A timeline of significant activity
      Centralize key activities from chat conversations, incident details, and more.
    • Learnings and next steps
      What went well? What didn’t go well? How do we prevent this issue from happening again?

    Why do post-mortems?

    During incident response, the team is 100% focused on restoring service. They can not, and should not, be wasting time and mental energy on thinking about how to do something more optimally, nor performing a deep dive on figuring out the root cause of an outage. That’s why post-mortems are essential, providing a peacetime opportunity to reflect once the issue is no longer impacting users’ experiences. The post-mortem process drives focus, instills a culture of learning, and identifies opportunities for improvement that otherwise would be completely lost. By forcing the team to explicitly dedicate time towards discussing and documenting lessons learned, while the incident is still fresh in their minds, the team is able to prioritize their focus on the right thing at the right time. The team does not sacrifice its ability to respond quickly in the midst of the fire, nor does it lose the opportunity to collaboratively understand how to improve its infrastructure and processes across every step of the response. Post-mortems matter because learning together establishes the right culture around failing forward, with iterative and continuous improvement.

    The blameless post-mortem

    A blameless post-mortem is critical for understanding failures by trying to understand how a mistake was made, instead of who made the mistake. “You ignore the ‘this person did that’ part,” explains PagerDuty Engineering Manager Arup Chakrabarti. “What matters most is the customer impact, and that’s what you focus on.” This is a crucial tool leveraged by many leading organizations such as Etsy, a pioneer for blameless post-mortems, for ensuring post-mortems have the right tone, empowering engineers to give truly objective accounts of what happened by eliminating the fear of punishment. Some make the argument that the blameless post-mortem might not seem possible because humans are hardwired for blame. They advocate “blame-aware” post-mortems in which teams acknowledge the instinct to blame, but focus their attention onto actionable takeaways instead. Whichever terminology resonates with your team, the key point is that post-mortem discussions should be safe spaces in which teams can be completely honest and oriented around improving for the future instead of blaming others for the past.

    Best practices and more

    PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are now part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more.

    Article

    What is Incident Response?
    Incident response (IR) is a process used by ITOps, DevOps, and dev teams to address and manage any sort of major incident that may arise. The main goal of IT incident response is to organize an approach that limits damage and reduces recovery time and costs — and prevents it from happening again. Incident response generally includes an outline of processes that need to be executed upon in the event of an IT incident. An incident response process is something you hope to never need, but when you do, it’s critical that it encompasses all the steps necessary for the response to go smoothly and seamlessly. Normally the knowledge of how to handle incidents within your company or organization is built up over time and gets better with each incident. Many times, the knowledge of how to conduct thorough incident response is lost when a team member leaves, making it ever more crucial to have a documented process. Nailing your incident response and learning how to deal with major incidents in a way which leads to the fastest possible recovery time is vital to the success of any team. Generally, your incident response documentation will outline not only how to prepare for an incident, but what to do during and after an incident. It is intended to be used by on-call practitioners and those involved in an operational incident response process.

    Steps for successful incident response

    For successful incident response, you must not only have a holistic view into the health of your IT infrastructure, you have to prepare your team to know just how to respond and what roles they must take on — allowing you to orchestrate the right response to resolve incidents faster and reduce your mean-time-to-resolution (MTTR). Monitoring your IT infrastructure health by implementing different monitoring tools to appropriately monitor disparate and new systems, you can gain full-stack visibility. There needs to be a way to normalize, de-dupe, correlate, and gain actionable insights from all this data, and all the events generated by these monitoring tools must be centralized in a single hub, from which they can be triaged and routed to the right on-call engineer. Before all else, it’s crucial for your team to have established guidelines for what to do when a major incident occurs. Incident response documentation that outlines a process for going on-call, what to do when an incident arises, how to communicate with teams, and what post-mortem process to follow after an incident is crucial. If you need help getting started with establishing your own incident response process, check out PagerDuty’s incident response documentation for guidance. All this sets the stage for being able to streamline the incident response process when an incident does occur. When a major incident does occur, be sure you:
    • Assess
      When a major incident does occur, assess the situation and call in the right stakeholders as needed. Collaborate with subject matter experts if need be, otherwise work with your incident commander, deputy, and customer liaison to assess the damage.
    • Resolve
      Once a plan of attack has been formulated, incident resolution begins. Determine what needs to be shared with the public, employees, and customers.
    • Learn
      Learn is arguably the most important step in the incident response process. It’s in the aftermath that your team is able to look and see what went well or what didn’t go so well, and what you can do to prevent things from happening again. Incident post-mortems are a great way for teams to continuously learn and serves as a way to iteratively improve your infrastructure and incident response process. Check out our incident post-mortem template and handbook to get started.

    Modern incident response lifecycle

    Organizations are investing in many monitoring solutions to get visibility into their IT infrastructure so they can better deliver on rising customer demands. Making sense of the event data and taking action by automating the incident response lifecycle for your environment—from assess, to resolve, and learn — is critical. Knowing what do when a major incident does occur is vital to the success of your team and your organization, Learn more about incident response and the incident response lifecycle, which encompasses everything from assess, triage, and resolve - to learning and prevention to support developers as they move towards owning their code in production.

    Ebook

    5 Ways to Empower Developers

    Discover five ways to make going on-call seamless so you can empower your teams to spend more time on coding and delivering higher-quality services instead of fixing issues.

    In this ebook, we discuss:

    • How the developer’s role is changing
    • Top three benefits of developers owning their code in production
    • Five ways to empower your team to do their best work

    Webinar

    Incident Resolution Lifecycle for Modern Ops

    You’ll discover how to:

    • Optimize your ITSM toolsets by integrating people, data, and processes
    • Maximize cross-functional transparency and consistency
    • Prioritize incidents with well-defined rules
    • Automate troubleshooting and remediation
    • Ensure continuous learning and improvements across your team

    Report

    State of Digital Operations 2017

    In this report:

    • How organizations are solving for digital disruption challenges with DevOps and monitoring
    • What is the business impact of digital downtime
    • Why 55.9% of teams see increased complexity resulting in more cognitive load a top operations challenge
    • Much more!

    RefCard

    What is PagerDuty?

    Discover how PagerDuty enables you to:

    • Meet critical application and performance demands for IT and business users
    • Centralize, triage, and resolve alerts across systems and applications to maximize uptime
    • Reduce resolution times by >50% and save millions of dollars by orchestrating business and technical response

    RefCard

    PagerDuty for Developers

    PagerDuty helps developers:

    • Full own their services
    • Customize and optimize workflows
    • Reduce alert volume

    Ebook

    9 Steps to Owning Your Code

    Discover 9 steps to owning your code, including:

    • Understanding your services and your customers’ experiences
    • Recruiting the right people quickly
    • Working your way and with your favorite tools
    • Building more resilient services and much more!

    Infographic

    The Cost of Digital Downtime

    Discover just how much impact downtime can have throughout a digital consumer’s day across different industries and experiences, including:

    • Mobile Apps
    • Financial
    • Enterprise Applications
    • Retail
    • Transportation

    Ebook

    The On-Call Survival Guide

    In this On-Call Survival Guide:

    • What is “On-Call”?
    • On-call responsibilities (and not responsibilities)
    • On-call etiquette to follow
    • Recommendations for starting your own on-call rotations

    This guide provides a look at how PagerDuty prepares its team members for going on-call. It’s our hope that you’ll use this as a starting point to formalize your own processes.

    Webinar

    A Developer’s Guide to Managing Your Code

    We share strategies to help you:

    • Minimize the amount of interruptions or redundant work
    • Respond to incidents your way
    • Reduce noise so you can focus on what matters
    • Automate your incident resolution workflows
    • Leverage data to identify patterns to prevent future issues and write more production-ready code

    Learn how PagerDuty helps you so you can spend your time on building rather than fixing and significantly improve your work-life balance.

    RefCard

    Operational Intelligence Across Your AWS Environment

    Discover how you can:

    • Accelerate migration to AWS
    • Full-stack monitoring visibility into your AWS environments
    • Configure real-time response workflows and orchestration

    Ebook

    The PagerDuty Post-Mortem Template

    In this template:

    • A single place for all your post-mortem details, including overview, root cause, impact, timeline, action items, messaging and more!
    • Streamline your post-mortems
    • Easy to fill-in PDF format

    Ebook

    The PagerDuty Post-Mortem Handbook

    Download this guide and learn:

    • The importance of responder roles
    • What goes into creating a post-mortem page
    • How to conduct a post-mortem meeting

    This eBook also includes a free post-mortem template!

    RefCard

    Dynatrace Integration: Full-Stack Monitoring with AI Power

    What you get with the certified Dynatrace integration:

    • Easy point-and-click setup
    • Forward Dynatrace events to PagerDuty via our Events API
    • Enable Dynatrace to trigger and resolve incidents in PagerDuty with seamless sync
    • Deliver Dynatrace notifications to the correct resource via PagerDuty

    Webinar

    How to Focus On What Matters With Triage and Noise Reduction

    In this webinar, discover how to:

    • Gain full control and programmatically manage your event data at scale
    • Suppress non-actionable alerts to reduce noise
    • Optimize resolution by consolidating related alert context into a single incident
    • Take advantage of bulk actions and eliminate manual tasks

    Webinar

    Where is the Modern-Day Post-Mortem?

    Incident post-mortems are an invaluable part of the IT operations toolkit. Learning from past incidents to improve future detection and remediation efficiencies is crucial, as it prevents you from making the same mistakes over and over again. Yet, not every team performs a post-mortem. And while technology advances continue to propel the ops industry, the content and process of creating post-mortems has remained mostly stagnant.

    In an era of anomaly detection and auto-healing monitoring systems, where is the modern day post-mortem?

    In this webinar, we’ll review common barriers to teams creating and leveraging post-mortems, and explore the concept of a modern day post-mortem — one that is rooted in technology and helps improve team efficiencies during the most crucial moments.

    Webinar

    The Journey of Chaos Engineering

    You already know Netflix’s Chaos Engineering and the Simian Army – tools that enable engineers to test the reliability, security, resiliency and recoverability of the cloud services. You might have also heard of Google’s legendary Disaster Recovery Testing (DiRT) exercises that find vulnerabilities in critical systems and business processes by intentionally causing failures in them.

    Enter Twilio. Discover the story of how Twilio got started with their own version of chaos engineering. They’ll share lessons learned along with the impact it had on their engineering culture.

    This session is collectively part of PagerDuty Virtual Summit.

    Webinar

    Data-Driven Service Intelligence

    Modern IT environments require a completely new approach to monitoring—an approach that puts services at the core. They require an approach that is driven based on observed data, metrics, events, and logs, to tell the entire story of service health.

    Yet monitoring still largely looks the same as it did 15 years ago, where you configure a separate system of checks and thresholds, and attempt to separate signal from noise in the flood of resulting alerts.

    How do you implement a monitoring approach with data at the core? Register for this webinar and learn firsthand from organizations that have successfully done so.

    Webinar

    Large-Scale Agile/DevOps Transformation

    Digital transformation is something that every organization will go through, but it can be easier for some than others.

    Join Sean Reilley, VP for Agile and Talent at IBM, as he discusses what it takes to move from a legacy environment to a DevOps mindset and undergo a large-scale agile transformation.

    In this webinar, you’ll learn:

    • What to do if you find yourself stuck in with a legacy environment
    • How to transform before so your organization doesn’t get left behind
    • How to empower your team to adopt a DevOps mindset

    Webinar

    PagerDuty Pulse: Latest Releases, Features & Capabilities

    You asked, we listened. Join us for our quarterly release notes webinar in which we recap the latest and greatest from PagerDuty. In this special edition of our upcoming webinar, you’ll learn about some of the most important capabilities we released in 2017, so you can get the most out of your PagerDuty environment in 2018 and beyond.

    In this webinar, we demo capabilities from some of our most exciting releases this past year:

    Event Intelligence & Automation: 

    Learn how you can centralize, manage, and automate event behavior at scale to extract signal from the noise.

    • Automated alert grouping
    • Similar incidents
    • Event routing

    Major Incident Management:

    Leverage the best-in-class, end-to-end incident lifecycle for effective, automated response to the most critical incidents.

    • Incident priority
    • Response plays & stakeholder engagement
    • Postmortems
    • ITSM integrations

    Platform Extensibility:

    Count on best-in-class API support to automate tasks and customize your ideal workflow.

    • Events API v2
    • New API endpoints
    • ChatOps integrations

    …And much more!

    Watch as we demonstrate new workflows and show you how to automate the end-to-end incident lifecycle to rapidly detect and prevent issues and gain back time for innovation.

    Webinar

    Reach On-Call Teams Faster with Live Call Routing

    Prevent incidents from becoming business-impacting by notifying the on-call team immediately. With Live Call Routing, anyone can reach your on-call teams in real-time to report incidents simply by calling a phone number. Teams can ensure incidents are received and resolved faster.

    Join us as we showcase configuration and real product examples of how PagerDuty’s Live Call Routing capability is enabling improved visibility and response times for customers.

    Discover how Live Call Routing provides:

    • Automatic call forwarding via schedules and escalations
    • Triggered incidents with just a phone call
    • Phone tree to reach specific teams
    • Global numbers

    Webinar

    Full-Stack Anomaly Detection and Response Orchestration

    Microservices architectures have unleashed unprecedented amounts of application data on organizations. More often than not, there’s no way to correlate data coming from siloed tools that look at only a single part of critical apps or infrastructure, making it difficult to understand the overall health of the digital business, diagnose the root cause when service disruptions occur, and coordinate a response in real time.

    With PagerDuty’s Operations Command Console, you can visualize the health of applications, services, and infrastructure while managing incident response workflows all in one place to easily mobilize, coordinate, and orchestrate both technical and business response to incidents.

    Discover how the Operations Command Console provides:

    • A single view for full-stack event intelligence and response workflows
    • Interactive and customizable applications for actionable insights
    • Shared context between infrastructure, service health, incidents, and response
    • Pattern and anomaly detection across all your data sources

    Webinar

    Streamline Critical Communications With Stakeholder Engagement

    When a major incident occurs, its impact is felt across the organization. While the technical response is underway, stakeholders from all areas of the business—including public relations, support, legal, executives, and more—must all be engaged and kept informed so they can immediately respond and minimize the overall business impact.

    With Stakeholder Engagement, you can streamline the process of identifying and notifying key business stakeholders and maintaining communication during a major IT incident.

    Discover how Stakeholder Engagement provides:

    • A single source of truth for critical, real-time updates
    • Improved alignment between IT and the business during incidents
    • Automation of communication tasks
    • Streamlined post-mortems
    • New licensing to support business user needs

    Video

    RefCard

    Dynamic Incident Response: Real-Time Collaboration at Scale

    Dynamic Incident Response enables you to:

    • Notify the right people
    • Mobilize teams immediately
    • Access contextual information
    • Gain platform agnostic conferencing support

    RefCard

    Slack Integration: Seamless Incident Management Workflows

    With the certified Slack bi-directional extension:

    • View and customize PagerDuty incident details, urgencies and other updates that send to Slack
    • Receive sufficient incident information for ITOps, Developer, and DevOps teams to immediately acknowledge or resolve incidents directly in Slack with message buttons
    • Easily tie Slack to multiple PagerDuty services in just a few clicks

    Note: To use Slack’s slash commands to easily trigger new incidents in PagerDuty, see the Slack to PagerDuty Integration Guide

    RefCard

    Live Call Routing: Reach On-Call Responders Immediately

    With Live Call Routing:

    • Real-Time Conversations
    • Automatic Escalations
    • Provision Global Numbers
    • Trigger Incidents with a Call
    • Reach Specific Teams

    Webinar

    Predictions and Trends in DevOps

    As cloud adoption becomes an operating standard and containers, microservices, machine-learning gain traction, the complexity in application and infrastructure environments increases exponentially. To deal with the rising complexity of critical service, DevOps teams globally will be challenged to deliver better reliability and availability of the software they create and operate.

    Join PagerDuty as we host a discussion with leading organizations as they share trends and their predictions for on what DevOps teams will need to succeed.

    The DevOps expert panel will discuss:

    • Is DevOps finally mainstream?
    • Can enterprise organizations adopt DevOps practices?
    • Will central operations teams move closer to the application codebase?
    • Does security become a part of the DevOps operational model?

    Whitepaper

    Overcoming Alert Fatigue in a Modern Ops Environment

    In this whitepaper you’ll discover:

    • What is alert fatigue and what causes it?
    • The cost of alert fatigue to your business
    • The impact of alert fatigue on your team
    • 4 steps to help you overcome alert fatigue and build a better modern ops environment

    Webinar

    Oracle Delivers Better Customer Experience with PagerDuty

    Discover how Oracle is using PagerDuty:

    • Respond to incidents in real-time
    • Accelerate Mean Time To Identification (MTTI)
    • Support better postmortems
    • Get full-stack visibility into the health of their applications, services, and infrastructure

    This webinar showcases real product examples of how the Operations Command Console from PagerDuty is enabling Oracle to deliver better software.

    Case Study

    Nelnet Increases Uptime, Boosts Employee Productivity, and Reduces Costs with PagerDuty

    Read the case study to learn why Nelnet turned to PagerDuty to help minimize the challenges around scheduling, alerting, and on-call escalations to help them increase uptime, employee productivity and save $650,000 annually.

    “When we were evaluating PagerDuty, we found there weren’t other organizations that had such a complete product offering, or feature set, and they weren’t as easy to use,”

    — Ryan Regnier, IT Manager, Nelnet

    Case Study

    SendGrid Enhances Employee Productivity and Reduces Downtime with PagerDuty

    Before making the move to PagerDuty, SendGrid used a different vendor as their alerting tool, but when faced with scalability challenges they realized they needed a full-scale incident management solution in place to support their high volume of incidents.  SendGrid decided to make the move to a more reliable and scalable incident management platform.

    Read the case study to learn why SendGrid made the move from a simple alerting tool to a full-scale incident management platform.

    “We have confidence in PagerDuty and no longer have to worry about unnecessarily long outages and revenue loss. Everyone uses PagerDuty and knows the solution as an established provider”

    —Mary Moore-Simmons, Engineering Operations, SendGrid

    Case Study

    Signal Sciences Addresses Security Anomalies Quickly, Keeping Customer Data Safe With PagerDuty

    Looking to overcome challenges around security incident management, as well as altering and resolution, Signal Sciences set out to find a better solution that would help them overcome manually processes and help reduce overall maintenance and effort.

    Signal Sciences needed instant visibility into incident status, on-call lists, and escalation options, along with the ability to recruit additional responders from any team.

    Read the case study to learn why Signal Sciences chose PagerDuty to help them become more proactive and improve operational reliability and agility.

    “PagerDuty helps us stay on top of our security posture and resolve security incidents faster and more consistently.”


    — Zane Lackey,  Chief Security Officer at Signal Sciences

    Infographic

    The Roadmap to Modern IT Operations Checklist

    You’ll learn the steps you need to take to:

    • Transform IT into being a strategic unit for driving business value and success
    • Apply a full lifecycle approach to facilitating changes to people, processes, policies and products while minimizing cost to the enterprise
    • Enable the continuous delivery of business value to the enterprise and its customers
    • Continually improve IT services

    Webinar

    ROI is the True Measurement of DevOps Success

    Digital transformation creates the need for DevOps organizations to undergo rapid changes and encourage the adoption of new ways of thinking, developing, and operating. Today, businesses are rapidly increasing their focus on quickly identifying ways to drive greater results.

    These challenges require new practices and methodologies to drive agility and efficiency among coexisting members of development and IT operations teams. This transformation directly impacts members within an organization and raises new questions: What is the impact of introducing DevOps practices on the business overall? How can ROI be measured to justify establishing a DevOps organization? What is considered a valuable return on investment? How does this compare to peer groups, competition, and industries at large?

    In this webinar you’ll learn:

    • Why DevOps matters to organizations embracing digital transformation today
    • How to make the right decisions that drive better business value across the organization
    • What are the different methods for measuring ROI
    • The impact of embracing DevOps on ROI and to members of the organization
    • Why enterprises shouldn’t be afraid of DevOps

    Ebook

    What is Data-Driven DevOps?

    In this ebook:

    • Benefits of a data-driven culture
    • Why DevOps works better when it’s backed by metrics
    • Important metrics to track
    • How to build a data-driven culture around incident response

    Report

    A Modern Operations Solution for Incident Management

    In this report, find out:

    • What makes incident management such a unique and important investment
    • What key features should an incident management solution have
    • How should you evaluate an incident management solution
    • What sets PagerDuty apart from the competition

    Webinar

    Organizing and Optimizing ITSM Toolsets

    Service Management (ITSM) is an approach for designing, delivering, managing and improving the way IT is used within an organization. To make that approach a reality, a core requirement is having the right strategic toolset for your unique organizational needs. But are the right tools to choose to help you deliver optimal services and keep your application and critical infrastructure available? How do you organize all the information these tools are feeding your organization everyday?

    Learn what it takes to:

    • Consolidate multiple ITSM services into one hub
    • Support a 2-speed IT infrastructure and multiple ITSM processes within a single organization
    • Evaluate integrations and flexibility of potential toolsets

    Report

    Transform IT to Reach New Levels of Efficiency

    Read this report to learn about:

    • Challenges faced by organizations without an Agile IT Ops model
    • How PagerDuty supports the business goals of a large, global media conglomerate
    • EMA’s perspective of the PagerDuty incident management platform

    Report

    Best Practices for Monitoring: Reduce Outages and Downtime

    This guide will help you

    • Ascertain what to monitor
    • Motivate your team to respond quickly
    • Avoid common monitoring mistakes

    Ebook

    Best Practices for Monitoring: Reduce Outages and Downtime

    This guide will help you:

    • Determine what to monitor
    • Motivate your team to respond quickly
    • Avoid common monitoring mistakes

    Ebook

    Best Practices in Outage Communication

    Not so easy, is it?

    Luckily, there are tools you can use and best practices you can follow to streamline this process. Putting a process in place well in advance for all sides of communication during an outage is crucial.

    In this guide, you’ll find best practices for communicating with:

    • Incident response team stakeholders
    • Key business stakeholders
    • Customers
    • External parties

    Infographic

    5 Ways to Augment Your NOC

    Discover 5 ways to improve your NOC performance:

    • Increase signal:noise ratio
    • Streamline content information
    • Improve efficiency and resolution; fewer phone calls
    • Better reporting and metrics
    • Automated escalation and communications

    Report

    Revamp Incident Management to Safeguard Business Success

    In a study commissioned by PagerDuty, Forrester Research shares four key findings:

    • Technology innovation and reliability are critical for business success, but IT is failing to deliver on business expectations
    • Incident resolution is tactical and reactive today, harming the business
    • IT executives and practitioners have different opinions on incident resolution needs and opportunities
    • Rapid, contextual notifications can help IT teams resolve to incidents faster, and predictive analytics can help teams prevent problems in the future

    Infographic

    What Is the Total Economic Impact™ of PagerDuty?

    With 2000 employees and 45 million customers, downtime meant lost revenue to a global entertainment services company. Three years after implementing PagerDuty, the company has seen a 448% return on its investment.

    Download the infographic below to see the key business benefits, cost savings, and metrics the company achieved using PagerDuty’s digital operations management platform.

    To learn more about the study, see the full Forrester report.

    Podcast

    The Cloudcast Podcast

    Bringing Advanced Analytics to DevOps

    In this podcast, join Dave Hayes, PagerDuty’s Product Director, for a discussion around operations best practices, managing postmortems, and what his team learned through “Failure Friday.”

    Featuring: Aaron Delp and Brian Gracely from The Cloudcastnet, with David Hayes from PagerDuty

    Description: Brian talks with Dave Hayes (Lead Product Management at @pagerduty) about how analytics and embedded intelligence are helping ease the pain of Operations teams trying to keep up with errors, alerts, troubleshooting and outages. They also discuss operations best practices, how to manage post-mortems, and their internal culture of improvements through “Failure Friday”. Music Credit: Nine Inch Nails (www.nin.com)

    Date: October 24, 2014

    Podcast

    Infographic

    Learn How Downtime Impacts Your Business

    How can PagerDuty help?

    Designed for engineers by engineers, PagerDuty increases the reliability of your operations by connecting people, systems, and data for visibility and actionable intelligence. Our platform makes it easy to manage events through the entire lifecycle, resulting in decreased resolution times and improved on-call quality of life.

    Check out our infographic to see how downtime can affect your business.

    Infographic

    Don’t Build Your Own Operations Performance Platform

    How can PagerDuty help?

    Designed for engineers by engineers, PagerDuty increases the reliability of your operations by connecting people, systems, and data for visibility and actionable intelligence. Our platform makes it easy to manage events through the entire lifecycle, resulting in decreased resolution times and improved on-call quality of life.