Growth Experiments Prioritization

6 min readJan 24, 2022

These days I got a message from a PM colleague through Slack, right after I presented the results of our latest Experiment (an iteration on a home screen widget for new users). He wanted to know what my team was doing right in terms of prioritization since we were getting good results from our recent tests.

We started discussing how each of our teams come up with new tests: our ceremonies, how we adapt frameworks like the Opportunity Tree or RICE /ICE scores to each team’s reality. And at some point, we got to how we estimate the impact of potential experiments. Did we do a good job on that? Was that a critical factor in reaching our OKR?

To give more context, I’m a Product Manager at Nubank, the largest digital bank in Latin America. We have a very fast growth pace — acquiring millions of customers per year — that is supported by a very robust Growth Tribe. My squad handles user onboarding and our mission is to keep improving user activation metrics.

We are a very experiment-focused squad, 90% of our work consists of existing flow iterations and a/n tests looking for optimization. Nubank has very specialized teams, internal platforms for testing and has been doing that for some time now. So, it’s important to stress that we are in a mature stage in terms of experimentation.

If you are still trying to find a way for shifting your team towards experimentation or trying to assemble your first growth team, I highly recommend reading Sean Ellis’ Hacking Growth.

Getting back to my friend’s question, it led me to look at the assumptions we made to prioritize last semester’s bets and compare them to the actual results. I found it to be a great exercise that provided:

a rough estimate of our experiments success rate;
a way to identify our biases as a team;
a great way to consolidate our learnings on invalidated hypotheses.

I’m going to share some of the interesting things I found out. An important remark: I don’t think the lessons I got are replicable to other teams at all. They are very specific to the stage of our team and product, the type of hypotheses and solutions we usually work with, and so on. Only doing a similar exercise will give you insights that fit your product team.

I changed a bit some of the data, but

Our Experiments in hindsight

*Fig. 1: Overview of our 10 Experiments in the semester.*

We did 10 Experiments during the last cycle that varied in terms of effort and estimated impact. I added the actual results (measured by uplift in p.p.) in the fourth column so we can compare our expectations with reality. Here are some things we noticed:

1. Our failure rate

40% of our experiments (D, E, F, H) had null or negative results. Since we had a good outcome overall, I’m assuming that’s a good balance to consider in the future.

For our team’s scope, having a much higher failure rate could mean we are not doing well enough on hypothesis formulation. On the other hand, having a very low failure rate would probably mean we are not taking enough risk.

Reminding that’s my perception for a team that focuses on conversions optimizations for existing flows, fast experimentation, etc. Failing 40% of the time when building core features or strategic bets is not a good sign.

2. Our effort distribution

Taking a look at the “Effort” column, we could notice that our most complex tests paid off, given their return. And that the others didn’t cost us a lot of Engineering hours, allowing us to fail “fast” and move on to our next bets.

This point was very important for our performance over these 2 quarters: we were able to distribute our cost (i.e. Engineering time) effectively between the bets. We allocated a lot of time for the tests we had high confidence in and less time for the more bold/unpredictable ones. Note that’s not the same as only doing a few complex tests or always following a low to high effort order when prioritizing initiatives.

Given you don’t know the results beforehand, there’s no way to make sure you won’t spend time in a null-result test. But you should have a portfolio approach to your bets, that is very very different than a row of tasks/epics waiting in line in a prioritized order for you to work on.

When we use the RICE score, for example, we are reducing the number of dimensions (reach, impact, confidence, effort) to one (the final score)so all the candidates can be compared and form this unidimensional line. One thing we learned this year is that we shouldn’t do that for our experimentation bets. We should consider what we know about previous cycles (number of experiments per quarter, possible range of impact, our success/failure rate) and pick the right bets for different mixes of confidence/impact/effort.

For next semester, for example, we will try to gather more speed and perform 6 in-app experiments per quarter. We’ll have:

1 or 2 slots for more complex tests, that demand 2 engineering sprints: we’ll leave this slot for a bigger bet, somewhat validated through customer interviews (+confidence);
4 or 5 slots for small bets: experiments where we just have hypotheses and we’ll try to invalidate them as fast as we can.

If everything goes well, we will concentrate our failures on the small bets, being able to learn fast and move on.

3. Our successful bets: Estimated Impact x Results

Looking at the 60% we got right we can see that we can separate them into two groups: high (~4p.p.) or low (1 to 2 p.p.) uplifts.

We also noticed that:

All experiments that delivered a high uplift were given a high estimated impact (A, I, J);
Not all of the experiments that we estimated as high impact were very successful (D, C);
For experiments where we had low expectations, we had mixed results. Some of them had null results (E, F, H), and others delivered significant impact (B, G).

I was a little bothered by the fact that to reach a high uplift the experiment had to be evaluated with high potential by our team, but at the same time, we would sometimes bet a lot on experiments that didn’t perform well at all.

So I did a double down on the estimates we made for these tests and I discovered something interesting.

At the beginning of the year, to prioritize our hypotheses formulation, we spent a lot of time understanding users’ problems and doing a stack ranking for them. We got to 3 prioritized problem groups that would summarize our strategy for the next tests.

This exercise was very important for our team, not only for our design work but for strategic actions we took throughout the year. However, I think it weighed more than it should have when we were guessing the impact of our tests.

Breaking Impact by the size of the problem, solution potential.

Experiments C and D were focused on a big problem that always comes up during customers interviews but that our team can’t solve in-depth. Experiments A, I, and J on the other hand were well-executed solutions for smaller problems that customers had when onboarding.

At least for our team and at least concerning last year: a good solution to a small problem outperformed a weak solution to a huge problem every time.

Key Takeaways

After doing this exercise I had the following takeaways for the next cycles:

Now I can estimate a failure rate for our next Experiments (in my case 60%) and even how the results are usually distributed (from low to high uplifts);
Using the 1/4 proportion between big bets and small bets seems to have paid off. We only committed a lot of Engineering time to bets I had evidence to support my confidence;
In our case, strong solutions to small problems beat weak solutions to a huge problem;
Last, but not least, if you are not able to rollout quickly between tests, take some time to analyze the performance of all the experiments combined vs a control flow (like it was 6 months before, for example). You can be sure that the uplifts won’t just add up, they will have some effect on each other (most likely negative, but sometimes a positive compound effect).

Hope you enjoyed the reading, if you had a different experience with your product I would love to know!