How to Experiment

With the advent of Lean UX — a kind of science of design — the ability to design and conduct an experiment should now be an important part of every designer’s skill set. But what is an experiment? How do you design an experiment? And how can you trust the results?


In an earlier article, I noted that new products and services have about a 1-in-10 chance of surviving the first 6 months after launch. In view of the time, cost and effort that go into creating new product offerings, not to mention staking a brand's reputation on a successful launch, those are pretty poor odds. So anything that can improve decision-making during development and reduce the risk of failure has got to be a good thing.

With so much at stake it's essential that decisions concerning, "Which idea?" "Which feature?" "Which design?" "Which technology?" are based on empirical evidence rather than on guesswork, personal opinion or fiat. Many companies rely heavily on polling the (literal) voice of the customer. But as Susan Fiske, Professor of Psychology at Princeton points out: "The plural of anecdote is not data, and the plural of opinion is not facts." 

Thankfully, there is a sure-fire way to take the guesswork out of decision-making in design. It's a way that not only provides unequivocal answers to our design and business questions, but also gives us a measurable degree of confidence in our decisions.

We can conduct an experiment.

In common parlance we use the term 'experiment' or 'experimenting' to refer to any sort of trial or attempt at something: "I'm experimenting with a new Italian recipe"; "My guitar lessons are a bit of an experiment"; "Let's take the usual way to the airport, we don't have time to experiment with a short cut." Those are not really experiments, and it's not what we're talking about here. The phrase 'conduct an experiment' is not a euphemism for 'try something and see what happens' or 'show it to a few people and get their opinions'. An experiment is a specific method with a specific procedure and a set of rules. It is a well-proven scientific procedure that is used to test a hypothesis. It's a powerful method that should be in every designer's toolbox. And with the popularity of Lean UX — a kind of science of design — decision-makers should at least have a comfortable understanding of experimental basics.

Let's take a look at what's involved.

What is an experiment?

Described in its simplest terms, an experiment is an investigation into the causal relationship between two things. It is conducted by making a change in one thing and then looking to see if that changes the other thing.

Note the word ‘causal’. That’s the real key to an experiment. If you’re not interested in what’s causing an effect (higher or lower sales, faster or slower sign-ups, better or worse user experience) then you don’t really need an experiment. But most of the time we do want to know the causal agent so that we can make informed design or marketing decisions, understand why our product or service works the way it does, or why a competitor version seems to work better, and what specific feature or element is making the difference.

The thing we change in an experiment to see if there’s a causal effect is called the independent variable. An example of an independent variable is where we place the call to action button on a screen. For example, we could have two versions of a web page in which the main call to action button is either on the left or the right of the page. 

The thing we measure in an experiment is called the dependent variable. For example, the dependent variable could be the number of test participants who click on the call to action button, or it could be the time taken to click on the button.

Another important defining characteristic of an experiment is to control bias. Bias happens when something else outside the experiment affects the dependent variable. For example, if the independent variable is the location on the page of our call to action, then it is important to keep constant other features of the call to action, such as the button size, the colour, the label, even the font. Otherwise we can’t be sure whether any change we measure is really due to location and not due to one of these other variables.

Other types of research methods commonly used in product development — such as field research, observational studies and formative usability testing — are not true experiments. With these methods it is usually impossible to control all the sources of bias, so we can’t be sure about causation. We can, however, think of them as ‘quasi-experiments’ because they do collect empirical data, and the data collected can lead to new hypotheses that can then be tested in an experiment. 

In short, the hallmarks of an experiment are manipulating an independent variable, measuring the effect of the independent variable on the dependent variable, and controlling bias. 

What to keep in mind when running an experiment

When running an experiment, you need to:

  • Create a hypothesis, 
  • Assign your participants and
  • Measure user behavior.

Let’s look at each of these in turn.

Create your hypothesis

Experimental design begins with a hypothesis — a guess about causation. This is what we’re going to test in our experiment. But your hypothesis must not only be testable, it must also be falsifiable. This is important because we can’t actually prove a hypothesis to be true: we can only show that it is false. As philosopher Karl Popper famously put it: 

“No number of white swans can prove the theory that all swans are white. But the sighting of just one black one can disprove it.”

That’s essentially what our experiment will seek to do. What is really being tested is an ‘anti-hypothesis’: an assertion that the independent variable will have no effect on the dependent variable. If we can disprove the anti-hypothesis then we may conclude that our original hypothesis is correct. (In experimental design, the ‘anti-hypothesis’ is known as the ‘null hypothesis).

Assign your participants

Next we must assign our test participants to the test itself. For example, will participants see both types of screen: one with the call to action button on the left and one with the call to action button on the right? Or will participants get to see just one kind of screen?

If each test participant sees both screens you have a ‘within-subjects’ design (data from each condition are compared within the performance of a particular test participant). In this design we must be careful that exposure to one condition doesn’t influence the test participant’s performance in another condition as this can give rise to misleading results. For example, if participants see an interface with the call to action button on the left, their performance might be enhanced, or impaired, if they have already worked with an interface where the call to action button is on the right. This is because of order effects such as practice, interference or fatigue. 

The obvious way to avoid these effects is to assign different participants to different levels of the independent variable. For example, one group of participants sees the call to action button on the right and another group sees it on the left. This is called a ‘between subjects’ design. 

Finally, here’s an important point that often get’s overlooked. For some behavioural measures performance differs considerably between participants. However, a within-subjects experiment doesn’t have to deal with the differences between participants. In a within-subjects design the participants act as their own controls. This makes it a more efficient test. The consequence is that you are more likely to detect small effects by using a within-subjects design, and you will likely need fewer test participants than you would if you used a between-subjects design.

Measure user behavior

Behavioural measures typically fall into four main categories: 

  • Latency refers to the time that elapses from the presentation of a stimulus to the start of a particular behaviour. For example, how long it takes for a test participant to make a response after being presented with alternative buying options.
  • Frequency refers to the number of occurrences of a specific behaviour per unit of time. For example, how many times in a 10 minute task the test participant hesitates.
  • Duration refers to length of time that a specific behaviour lasts. For example the time it takes to complete a task, or the time taken to find an item on a web site. 
  • Intensity refers to the ‘amplitude’ of a behaviour or action. This often requires splitting a behaviour into discrete components and then measuring how many times a component occurs per task or per unit of time. For example, this ‘local rate’ of behaviour can indicate the extent to which a behaviour is being hurried.

Wherever possible collect quantitative data in the form of ratio or interval data as this will allow carrying out powerful statistical analyses. Note that it is important to design the experiment with a specific statistical analysis in mind to ensure that assumptions required for the analysis are met by the experiment.

Thinking about statistics

As you’ll have noticed, it’s not really possible to design an experiment without also thinking about the kind of analysis that you’ll need to carry out on your data. One question that always surfaces during presentations of customer research (to be honest, it’s usually the only question ever asked about stats) is: “Are the results significant?” Having gone to the trouble of designing and conducting a good experiment, it’s important to understand what this question is really asking. 

The word ‘significance’ in this context does not carry its usual meaning. It has nothing to do with whether the result you got is important. You can have a ‘significant’ result that is too small to be of any real importance — though in some areas of research a small effect can be very important. You can also have a large effect that is not ‘significant’, often due to noisy data or low test power. 

What the significance question is really asking is, “Did we get the result by chance?” Or putting it another way, “How confident can we be that the changes we have measured in our dependent variable really were caused by manipulating our independent variable?"

The value ‘95%’ gets tossed around a lot when discussing experimental results. This is the significance level that is set in most behavioural research studies. By setting this level we are saying that, given the nature of the research topic, we are prepared to accept an error rate of 5% and that if the probability of getting the result by chance turns out to be less than 0.05 we can be at least 95% confident that the results are real and were caused by the changes in our independent variable. Furthermore, and this is the whole point of doing an experiment in the first place, we can safely generalize the result to our target population.

But there’s actually a better question to ask than, “Are the results significant?”

The 95% significance level is not a ‘magic’ value. It is just a convention. It is widely used in academia, but in most medical research, for example, significance levels of 99% or even 99.9% are often required before accepting that a result is not due to chance. But in customer and product development research, you may not want to discard your results too quickly just because they don’t quite achieve 95% significance. What if they only achieve 90% significance? If you are trying to make a decision about whether to go in direction A or direction B and you can be 90% confident in your decision - well, that’s pretty good. It is the equivalent of running the experiment 100 times with different random samples and getting the same result 90 times. You might want to go with it.

So, rather than asking whether the results are significant and risk throwing the baby out with the bathwater if the answer is “no”, instead ask, “What is the probability of this result happening by chance?”

Then you can decide for yourself whether the results are significant enough for the kind of decision you have to make.

 

Philip Hodgson, April 2014