Partial 15 Minute Outage & Task Delays/Loss
Incident Report for Zapier
Resolved
Between Sunday December 18th 21:54:00 and 22:09:00 UTC (a 15 minute window approximately 6 hours before this post) we experienced a network partition in our RabbitMQ cluster resulting in some lost or delayed tasks. The details are as follows:

Any non-instant (polling) Zaps with tasks triggered in that 15 minute window likely experienced up to a 6 hour delay as we recovered any pending tasks from logs. No tasks were lost in this case. Most Zaps fall under this category.

Any instant Zaps with tasks triggered in that 15 minute window either experienced a maximum of 60 minutes delay and possibly a double task OR were lost. If your webhooks were delivered to the newer, safer hooks.zapier.com endpoint they were simply delayed - however, if your hooks were delivered to the legacy zapier.com endpoint - they were lost. Any instant Zaps turned on within the last 3-6 months are under the newer, safer hooks.zapier.com endpoint. Any instant Zaps turned on before 6 months ago are likely on the legacy zapier.com endpoint.

We've been working to move all instant Zaps over to the newer, safer hooks.zapier.com endpoint -- you can force your Zaps over by turning off and on your Zap and (if applicable) updating any webhook URLs pointing to Zapier in your connected apps.

In the last 24h approximately 1% of all tasks were delayed a significant amount of time (more than 15 minutes), and significantly less than 1% of webhook tasks were lost. To ensure this doesn't happen again we will be taking the following actions as a direct response:

1. Correct suboptimal configurations and assumptions in our RabbitMQ failover strategy.
2. Hasten the already in-progress migration from legacy zapier.com webhook endpoints to the newer, safer hooks.zapier.com endpoint.
3. Correct a bug that improperly signaled a failure to partners when a webhook was properly journaled to disk but not to RabbitMQ (this risked double tasks in a few cases).

We apologize for the interruption, we've been very proud of our reliability and transparency improvements throughout 2016. We're going to do our best to ensure this never happens again.
Posted Dec 18, 2016 - 20:12 PST