


And today, I’d like to share part of that journey with you, in the hope you might avoid such a situation. This situation led me on a journey deep into the heart of Linux and Kubernetes performance. I wanted to be able to address their questions as clearly as I could and show them what the problem was. They asked what anyone in the same position should ask: How could this have happened? We followed the best practices of setting requests and limits the same, is that wrong? Should we just take off limits? What values should we be setting? Yet, the Grafana chart was definitive: CPU throttling was causing the outage they were currently experiencing. They had been using the same limits configuration in production for over two years. The highly skilled team that I was brought in to help with an outage was in disbelief. “Sir, your application is continually getting throttled,” I repeated.
