Back

Case study

Benchmarking the user experience

Comparing two Learn Path concepts at Fretello using SUM, the single usability metric.

Álvaro Velasco
Álvaro VelascoWork: Fretello, early 2019 · Documented 2019 · Refreshed 2026 · Published · Updated
Role: Senior Product DesignerDuration: Early 2019 (pre-WWDC19)Team: UX team at Fretello, plus engineering for the MVP build

Owned: Test plan, prototypes (ProtoPie), task scripts, data analysis, the call on the combined design

Why use this method

Two designs, both look great, neither one obviously wins. That's the moment to stop arguing about screenshots and start measuring.

SUM (the single usability metric) is a standardised score that combines the three things you actually care about in a usability test: did people finish the task (effectiveness), how long did it take them (efficiency), and how did they feel about it (satisfaction). One number per variant. Argue about the number, not about which mockup looks more polished.

The reason I reach for this over an A/B test is independence from engineering. You don't need a build, a feature flag, or a rollout cohort. A clickable prototype, a task script, and a stopwatch are enough to start measuring.

Diagram of how SUM is calculated. For each variant, raw scores for completion, time on task, and satisfaction are standardised, then averaged into per-task SUM scores (Task1 SUM, Task2 SUM), then averaged again into a single Variant SUM Score.
How the raw numbers roll up. Standardise inside each task, average across tasks, end with one number per variant.

The competing concepts

At Fretello, the Learn Path workshop (the one I describe in the Fretello case study) produced two prototypes. Both clickable in ProtoPie. Same content underneath, different mental models on top.

Side-by-side phone mockups of the two Learn screens. Left: The Path, a ladder of circular module thumbnails (Read Tabs, Gear, Posture, Rhythm) with progress rings. Right: The Birdseye, a vertical list of beginner courses (First Steps, Beginner Basics, Scales Made Easy) with thumbnails, durations, and a featured Christmas course callout.
The two prototypes. The Path on the left, The Birdseye on the right.

Usability analysis

We wrote a task script that mapped to the requirements: completing a module, navigating to the next module, finding a featured course, picking up where you left off, jumping to a specific lesson. Three rounds of testing wrapped inside two sprints. Each task captured completion, time on task, and self-reported satisfaction; we standardised the raw scores and averaged them into a SUM per variant.

The Path won. Average SUM across tasks: 91.5%, margin of error 11%.But the Birdseye wasn't done. On a few specific tasks (browsing for a featured course, finding a lesson they hadn't done before) the Birdseye was actually winning.

Two side-by-side bar charts comparing per-task SUM scores. Left: 'The Path' in blue, with most bars between 80% and 100%. Right: 'The Birdseye' in red, with bars between 70% and 100% and noticeably higher results on 'specialcourse' and 'featured' tasks.
Per-task SUM scores. The Path wins on average; the Birdseye wins on a few tasks that pull it through.

Combining the concepts

After arguing about it inside the UX team, we ran a third prototype that took the Path's spine and bolted on the Birdseye's catalogue affordance. Two ways into the same content: the guided ladder for beginners, the catalogue for explorers.

The combined prototype scored above 75% SUM on every main task. No more trough on the Birdseye-favoured tasks. We worked with engineering to spec it, shipped the MVP into production, and the numbers landed:

The lesson, obvious in hindsight: when two prototypes both score well, the right answer often isn't to pick one. It's to ask what each one is doing well and whether you can have both without confusing the user.

The combined design's per-task SUM scores rendered as a green bar chart, all bars at or above 0.75; alongside, a phone mockup showing the shipped Learn screen with two tabs ('Path' selected, 'Courses' available), then the ladder of modules underneath.
Combined design: above 75% on every task, plus the shipped Learn screen with Path and Courses as parallel routes.
The production result. Beginners get a guided ladder; explorers can still see the whole catalogue.

Why I keep coming back to this

Two reasons.

SUM is the cheapest objective evidence you can get.If your team is split between two designs, put the prototypes in front of users and let them break the tie. The conversation after the test is fundamentally different from the one before it: not “I think A is better,” but “users finished task 3 a fifth faster on B.”

It pairs cleanly with hypothesis-driven design.Your hypothesis names what you expect to be true; SUM tells you whether it was. The metrics you measure (completion, time, satisfaction) are the same shape as the “We'll know this is true when…” line.

Stuck between two designs, or watching retention leak with no obvious cause?

Albot, my clone bot, is one click away. He can talk through how I run benchmarks like this, when SUM is the right tool, and whether I'm a fit for what you're after. Or just say hi.

Other case studies