How we prevented going down while appearing on Dragons’ Den

Paul van Eijden
4 min readApr 14, 2021

Imagine, your startup appearing on Dragons’ Den. Millions of people watching. Now imagine a lot of these people going to your website. And not only that, they want to order your product as well…

When the startup founders of Viteezy told me that they would appear on Dragons’ Den, they also told me that almost all websites go down directly after appearing in the popular TV-show. Since I’m the CTO, everyone was looking at me :) So challenge accepted!

Our startup Viteezy featured in Dragons’ Den

First thing we did is establish a baseline, how many concurrent users does our platform handle nicely? We used a performance tool called locust for that, named after the swarming grasshopper farmers fear. Of course, you don’t want to polute your production environment with test data, so we made sure to have an exact copy of that environment, we called staging.

For our infra we use Kubernetes, so making an exact copy is a matter of creating a separate namespace for staging and target the deploy to staging.

If you use a hosted database (like RDS on AWS) make sure to point your staging environment to your separate staging database, which is sized the same as your production database for a genuine baseline.

Creating a script for locust is well documented and we made sure to cover every API call on our platform by it. We didn’t loadtest our static html because if you use a proper CDN (like Cloudfront or Cloudflare) they have that covered for you. So you shouldn’t worry about that. API’s perform business logic (like putting something in a cart) which saves stuff to your database. Usually this is the part where platforms go down.

Once the baseline is established you’d want to benchmark any improvements against that. So on staging we also deployed changes while testing and we could immediately see results of those changes if they were in fact improvements!

The first thing we noticed is that our backends ran out of database connections after a certain amount of time running the loadtest. We scaled our backend on 4 pods, and each was configured with 100 database connections. So a total of 400 connections.

RDS maximizes the amount of allowed connections based on the memory of the RDS instance. It divides that by 25165760, so if you have 8 GB of memory you have 341 max_connections. So our backends at some point wanted to use their 400 database connections, which they couldn’t because of this limit.

You can increase the maximum number of allowed connections on RDS by buying a bigger instance with more memory or override the max_connections in a customised RDS parameter group. You can also limit the backends to a smaller database connection pool.

We tried all variants until we saw an improvement. I’d recommend buying bigger hardware as the last step, because usually that’s not where the problem is.

We had implemented a healthcheck designed to check if the database connection was up, but it didn’t close the connection properly. On production this didn’t lead to a problem because after a minute the connection timed out, and closed itself. But under serious load we used all available 100 connections within that minute.

Some problems only surface while under load, that’s why loadtesting your application is important (also when you’re not featured in a TV-show).

Next problem we noticed was deadlocks while writing transactions to our database. A deadlock can occur when two or more processes simultaneously try to write to the same database table, and not being able to acquire a lock with a certain timeout period. We fixed this by implementing a delayed retry mechanism on this transaction. Your problems, as well as your solutions will most probably be different.

When buying a bigger database instance we checked whether it was mostly CPU that was used, or memory. For memory you should aim for allowing the entire database + indexes to fit in memory. If that’s not possible at least aim for all the indexes to fit in memory. We ended up buying a 4x bigger setup than required from our loadtest (during loadtest we saw a database CPU usage of around 60%). After 1 day we scaled it down to the ‘normal’ size.

The last thing we did was increasing the amount of workers needed for the loadtest. We ended up using as much hardware for the loadtest as currently in use for our production setup.

Viteezy peaking at 16.000 transactions per minute

It’s 9:30 pm, the entire Viteezy team is watching TV intensely. On another screen we have our monitoring system in view. Customer service is ready to take the many expected chats. Our startup is the last one to pitch. Nailed it. And our platform? Nailed it too!

--

--

Paul van Eijden

Better regret what you did, than what you didn't do. Founder of Unicornify. Angel investor.