App down

Incident Report for Wisembly

Postmortem

20151506 App down incident

What happened

Today at 04:46:50pm GMT+1 our application went down, for 11 minutes.
Our team has been alerted few minutes later by automated probes and error logs.

We went into our servers to see the problem. We found an important RAM consumption by our MySQL process, threatening the overall system RAM, preventing REDIS backup daemon to perform its regular backup snapshots.

This problem affected some REDIS enabled API endpoints that made our PHP-FPM processes going wild.

What we did

We restarted MySQL process in order clear the buffer cache in RAM and make some room for other processes, especially REDIS automated backups. It was not sufficient, we had to restart PHP-FPM too to cool things down a bit. Once done, all system went green again

How to avoid that in the future

We looked into our server configuration with our backend team and housing provider, and clearly showed that the allowed RAM for MySQL was a bit edgy and could lead to what happened. We reduced so the allowed buffer size by 75%, leaving us enough room to cache all the needed things and leave enough space for all the other processes on the backed servers. We restarted again MySQL to take the configuration into account, and closely monitoring it performances in the near future.

Posted Jun 15, 2015 - 18:36 CEST

Resolved

All systems are up again. We'll write shortly a post-mortem explaining what happened, what we've done and how to prevent that to happen again.

Posted Jun 15, 2015 - 17:29 CEST

Identified

Application is up now, we restarted crippled processes. We're investigating further to understand what happened and keep you informed. Sorry for the trouble.

Posted Jun 15, 2015 - 16:49 CEST

Investigating

We currently have a major app outage. We are investigating and trying our best to make the thins up again.

Posted Jun 15, 2015 - 16:45 CEST