Today at 04:46:50pm GMT+1 our application went down, for 11 minutes.
Our team has been alerted few minutes later by automated probes and error logs.
We went into our servers to see the problem. We found an important RAM consumption by our MySQL process, threatening the overall system RAM, preventing REDIS backup daemon to perform its regular backup snapshots.
This problem affected some REDIS enabled API endpoints that made our PHP-FPM processes going wild.
We restarted MySQL process in order clear the buffer cache in RAM and make some room for other processes, especially REDIS automated backups. It was not sufficient, we had to restart PHP-FPM too to cool things down a bit. Once done, all system went green again
We looked into our server configuration with our backend team and housing provider, and clearly showed that the allowed RAM for MySQL was a bit edgy and could lead to what happened. We reduced so the allowed buffer size by 75%, leaving us enough room to cache all the needed things and leave enough space for all the other processes on the backed servers. We restarted again MySQL to take the configuration into account, and closely monitoring it performances in the near future.