Power meter output variation between rides

Of course, power meter output for the same input power should not vary between rides. But how would anyone check without having access to a calibrated pedaling robot?

One-on-one comparisons between different power meters are always difficult to interpret. They might perfectly match (which they usually won’t) and you’d still not know whether they are right. More often, they will differ and you won’t know which one is off by how much, just that at least one of them must be off.

Using scatter plots as described in an earlier post

variation between rides - scatter plot sample - total

we can use linear approximation to get a single value describing the difference between two power meters. We can also do the same for left and right pedals each, although, keep in mind that spider-based meters like the P2M (or hub-based meters) will only provide an estimate.

variation between rides - scatter plot sample - left variation between rides - scatter plot sample - right

The inclination of the linear approximation is not a 100% statistically correct way of comparison here; there are issues like whether zero-points should be left out, whether the approximation should go through the origin of the axes, whether it should be linear etc, BUT, it provides a way to look at the overall trend.

I collected the data over several rides, some of which were taken on the same day, each ride starting with a zero-offset calibration of both meters and got this:

variation between rides - vector vs pioneer

The total instant power of both meters differs on averageby 3.5%, which could be within the manufacturers claims. The fact that it may differ by up to 6.4% is less so, especially considering that this data was taken in a climatized indoor environment without large temperature changes. Both meters measure left power and right power independently: The left meters match almost perfectly by a difference of only 0.4% on average, although s.d. is a bit larger; the right meters agree less and differ by 6.3% on average and 14.1% maximum (note though that after both rides differing by more than 10% the second ride agreed much more, which might show some kind of conversion, though, the left side is showing the opposite trend… ). Although both meters report different values for instant and average power, their comparison result is similar.

variation between rides - vector vs p2m

Comparison of Vector2 and P2M shows that, for total power, the variance between rides is slightly less than in the previous comparison, although the difference is slightly larger, 7.3% on average, the Vector2 always reading 4.5 or more percent higher, which is not nice. We also see that the estimated left/right balance is pretty stable over rides, although that doesn’t have to mean anything. We also see that, while the left vector pedal matches almost perfectly, the right vector pedal is again reading a higher power value than the power meter it is compared with.


Even calibrating immediately before a ride will not eliminate bias beween power meters. A difference of up to about 5% doesn’t seem unusual, at least for the compared pairs.

The right Vector pedal used in this comparison reads to high compared with both the Pioneer and the P2M. Note that the Vector had been calibrated using a known weight just days before, and the right Vector was scaled to about 98% of its original output value. I may need to talk with their support about this issue.


How precise are Power meters?

In statistics, precision refers to repeatability, meaning that something is precise if all measured values are close to each other, while accuracy is exactness, meaning that the average of all measured values is close to the correct value. So, a power meter can be accurate and precise (the ideal case), precise but not accurate (useful for day to day training, but not comparable with other meters), accurate but not precise (average values are correct, but a single sampled value can be far off), or neither accurate nor precise.

Obviously, an accurate and precise power meter would be nice to have, but it’s also pretty well known by now that power meter values can dance a lot, and many riders display 3 or 5 or even 10 second averages on their cycling computers.

So, how precise are power meters?

Let’s look at this ride, comparing Pioneer ang Garmin power meters. As usual, this is just a single one-to-one comparison ride, so, I am not saying this data is representative for power meters in general or these two power meter models in particular, it’s just what I got. And, if you’d ask me which one is correct, I’d of course say: neither one.long ride comparison - average - complete ride

If we zoom in, we get something like this:

long ride comparison - instant - two intervals

Large changes in power obviously correpond, but small changes look pretty chaotic. Averaging over, say, 30 seconds eliminates all small changes and we get this:

long ride comparison - average - two intervals

So, both meters follow each other pretty well, and the pioneer is a bit slower to respond to large changes, but then tries to correct itself by a steeper slope. (Overshoot is pretty well controlled, although we see one at the last decline.)

But, this isn’t really satisfying: If I buy a kitchen scale for which the manufacturer is stating an accuracy of plus minus 1g for any weight below 100g, I expect 95% of all measurements to fall within the stated error range. Aren’t we making it too easy for power meter manufacturers when we let them get away with a simple accuracy statement (that’s also difficult to check) without any promises about precision?

So, what can be done with data that looks like this:

long ride comparison - instant - 1500 to 1600

One thing to try is a scatter plot where every data point is visualized by a single plotted dot, here with Vector on the x-axis and Pioneer on the y-axis.

long ride comparison - instant - scatter plot 2

The dots on the left on the y-axis show that the pioneer has more zero values which might be caused by measurement or transmission errors. The linear approximate y=1.0223x shows that on average the vector output is about 2% higher than the pioneer. Of course, one has to be careful with these plots because different time delays and a non-symmetric ride profile could bias this data. On the other hand, if the relative delay time difference is constant, one could simply shift the data set and try out several delay time combinations to find the most likely delay time.

We also see that the blue dot’s don’t line up neatly on the linear approximate line, but create a blue belt about 20W in width …. that’s a lot. So, let’s look at the width of this distribution in more detail, just keep in mind that this is not the error histogram of a single power meter compared with a correct value, but the relative difference of two power meters, containing the error of both (including the possibility that they cancel each other out sometimes).

long ride comparison - instant - histogram of error between meters in W

I am actually surprised that the peak comes at zero difference between the meters and a lot of values fall between plus minus 5W from zero. If we re-format this into a cumulative histogram, we get:

long ride comparison - instant - cumulative histogram of absolute error between meters in W

About 50% of the values fall within a plus minus 5W range, standard deviation is about 7W, 95% (or 2 s.d.) is about 18W and 99.7% (or 3 s.d.) is about 40W. So, from a statistical point of view, it’s pretty much nonsense to display instant power on your power meter.

I really think we need to have power meter manufacturers state how precise their meters are or have some independent organization check them with a calibrated pedaling robot.

Calibrating Power Meters with known weights

It’s often said that for day-to-day training, the important property of a power meter is precision (i.e. repeatability of measured values) and not absolute accuracy (i.e. correct value). I do agree but … what if you had several bikes with power meters fitted to each (okay, that’s actually another reason to swap a hub- or pedal-based power meter!) or wanted to review your long term performance changes 10 years from now?

Some power meters allow to check their absolute measurements after doing a zero-reset (example: Pioneer displays force in [N]), some even allow you to specify a scaling parameter after checking (example: Vector displays torque in [Nm] and allows to store a scaling factor in the pedal to correct their output). Some, like the older P2Ms, unluckily don’t do any of this.

Garmin has a manual on the internet for the recommended procedure. Although they mention the difficulty of measuring a heavy weight of over 10 kgs to the required precision, in their example they are using a large weight, and hanging that from a pedal requires hanging the bicycle high up in the air while attaching the weight … nothing I’d be keen to try.

One alternative could be to just use a calibrated weight that’s used for checking scales, which looks like this:

power meter calibration weight 2This one here is a 10 kg weight (which I admit is a bit on the light side, even as a light weight cyclist with a not too high maximum power number; 10 kg is equivalent to between 150 and 200 W at cadence 100 for ideal completely round pedaling or probably about 50 W at cadence 70 for typical not-round pedaling) accurate to plus minus 1.6g (guaranteed for one year by the manufacturer), which is far above the accuracy needed for this procedure. A 20 kg weight would only measure about 25% more in height/width/depth each and still be compact enough to measure both tangential force (as seen in the picture with a horizontal crank) or radial force (with the crank in upward position) with the wheels on the floor.

Together with the metal hardware like shackles to mount the weight to the pedal, measured on a extra precise kitchen scale, the total weight was 10184.5g plus minus 2.2g or 0.02% accuracy. With power being linear to force and torque, that’s more than accurate enough. (Sorry for the blurry smartphone picture.)

power meter calibration weight - small hardware on kitchen scale

My results for Vector2: Expected 16.485Nm (for crank length 165mm), measured 16.81Nm on right and 16.44Nm on left.

My results for Pioneer: Expected 99.91N, measured tangential 102N / radial -102N on right and 98N / -102N on left.

In both cases, that’s about 2%, which means that without any other error, the final power values could be within 2% error.


The big question here is of course: even a slight cadence error of 1 rpm will set this off largely, so a 2% error of the final power value is actually unlikely.

I might better use not just a heavier weight, but actually several different weights.

DCRainmaker reported that the yet to come Watteam Powerbeat will use a plastic bag that fills with an exact amount of water to act like a accurate weight. If that works, that’d be nice, although, hanging like 10 kgs = 10 liters doesn’t seem very practical.

Somewhat related: Wahoo used to sell, and probably now rents a weight for calibrating the power meter inside their KICKR trainer whereas Tacx claims their new really direct drive trainer is calibration-free. (The KICKR was only half direct between chain and trainer, but still had a belt driving a flywheel, while the new Tacx is doesn’t really have a flywheel and is completely electronic producing a virtual feeling of inertia by electronic control.) The principle behind the Tacx I guess is that calibration is not necessary if you can control or measure electric current very accurately. A more simplicistic view could be: for a rotation sensor you’d either have a magnet switch or some self-calibration using accelerometers and gravitational force, so why having to calibrate a power meter, isn’t that just poor engineering?

Power meter accuracy specifications

As far as I know, Verve cycling is the only power meter manufacturer that publishes somewhat trustable accuracy specifications:

  • Power range: 0–3000 Watts
  • Cadence range: 10–200 rpm
  • Accuracy of cadence: ±1 rpm
  • Accuracy torque: ±0.2 Nm accuracy for measurements below 20 Nm, and ±1% of actual readings for measurements above 20 Nm (ask for our Accuracy Certification)
  • Power: Can be calculated from any cadence value within the range at any torque
  • Power update rate: Every rotation

Surprisingly, even SRM only gives one single number, although added with a blunt statement:

Accuracy  ±1% (Scientifically Proven)

No proper scientist would state an error number like ±1% without specifying for what range of conditions that number is valid. So much about science.

Assuming cadence as a function of power:

accuracy - cadence

the specifications of verve systems gives this error:

accuracy - power error in percent with cadence error


which translates into W like this:

accuracy - power error in Watt

In other words: Verve cyclings’s Infocrank has an accuracy of about 2% above 110W (like most other power meters) and about 5% or 2W at 50W. Well, that doesn’t seem really significant to me, but still: hiding (including just forgetting to mention) such a fact doesn’t seem right either, and I strongly feel that since most power meters are engineered using somewhat similar principles, a lot of manufacturers have some clarifications to make.

(added new section from here on)

The interesting point here might actually be: How did I get from a 1% torque error to a 2% power error, if power and torque are related linearly?

accuracy - power error in percent without cadence error

Now we see: without cadence error, 1% torque error of course results in 1% power error, but a cadence error of just 1 rpm (or 1% at cadence 100!) will add another 1% to the power error. So, if you have a cadence sensor with an absolute 1 rpm error range, you’d want to pedal quickly to get more accurate power figures …. well, that’s just a joke, but, measuring rotation accurately is a really important factor here, and the more wheel rotations of a hub-based power meter compared with crank- = pedal-rotations would make a hub-based meter easier to engineer for high accuracy.

(I hope I didn’t make any calculation errors and would be happy to be corrected.)

Comparing responsiveness of Power Meters

One issue I see with power meters – when you go beyond just using it for day-to-day training and start to compare the data with that from other power meters or over a longer time period – is that although most manufacturers give some number about the accuracy of their devices, usually in the 2-3% range, they really give you just that single number.

Given that it is difficult to engineer – sorry, I won’t explain this deeper at this moment – a power meter that is accurate at very low power, at very high power, for quick changes, and over a long time period, that single number is not at all useful. Neither for comparing different power meters when shopping nor as a guideline about how much you can trust your data.

I will write at some other time about other accuracy issues; in this post I will briefly compare Pioneer’s second generation (crank-based), Garmin’s Vector2 and PowerTap’s P1 (both pedal-based) about how they respond to changes in power, or, in other words, their delay time from measurement to output. Although I do have a background in engineering and science, these are just simple tests of single devices bought through common sales channels, so I don’t claim that this data is in any way representative: It’s just what I got when I rode some time. (Note: Unluckily I am lacking a hub-based power meter, which would really be nice to have for such a comparison. I still need to check how useful the power data from the KICKR are: I know that their power data isn’t that accurate as I’d like but they could still be helpful if their sampling rate is high enough.)

First up, Pioneer in ANT+ mode versus Vector2, recorded using North Pole Engineering’s WASP unit (note: this allows me to record synchronized at a 1 sec resolution without relying on any specific head unit) on a KICKR (note: I used TrainerRoad to design a ride including constant sections at different power levels, ramp-up and downs, as well as 15 sec spurts at different power levels. The KICKR was controlled from TrainerRoad with PowerMeter feedback from the Pioneer in automatic mode). These are one-to-one comparisons, so there is no way to know which if any of them is right. In most cases, both have delays and both data have some error.

When looking at the whole ride, the power numbers seem to match more or less:20150808 Vector2 vs Pioneer all

If one starts looking at the details, it seems that the Pioneer is slower to respond to ramp ups than the Vector2. Interestingly, this hold only for the up-ramp and not for the down side:

20150808 Vector2 vs Pioneer ramp

When looking at 15 second sprints, the delay seems negligible but the maximum power numbers are lower for the Pioneer:

20150808 Vector2 vs Pioneer sprint

Next, comparison of Pioneer vs PowerTap P1. Again, no significant difference on a larger scale:

20150807 P1 vs Pioneer all

Again, we see that Pioneer has some delay on the up-ramps. The P1 might even be a bit faster to respond than the Vector2, but it also seems to have a bit more spikes.

20150807 P1 vs Pioneer ramp

Here, the first three sprints show that the P1 is more responsive than the Pioneer. Again, there is also a difference in maximum power values.

20150807 P1 vs Pioneer sprint

A natural question now might be: What happens if someone does extremely short power bursts? It seems you can get away with 1 to 2 seconds of very brief bursts while the Pioneer is undecided whether that’s a burst or a noise spike. Although not noticable from the data alone: It seems that the Vector2 is slow to get down to zero and often didn’t go completely down as well, so, whereas the up-ramp of the Vector2 is more trustable than the Pioneer (which shows smaller than real power values because of it’s delay), the Vector2 may show inflated power values because of the delayed down step.

20150810 Vector2 vs Pioneer bursts

(Section starting here added on August 11th)

Actually things are not that simple, for two reasons.

First, the ANT+ power meter protocol has to get power meters of fundamentally different designs like hub-based (where power is calculated from torque and wheel rpm) and crank-, spider- or pedal-based (where power is calculated from torque, cadence and, in the case of pedal-based meters, crank length) as well as head-units of different levels of sophistication (just displaying instant power, or being able to do calculations and recordings) under one roof. So the standard actucally includes different ways of communication. For example, in one such communication protocol, there is a data field for instant power, meant for simple displays, as well as accumulated power, from which you can either calculate average power (as the difference between current and last accumulated power; that’s I believe what the WASP does to calculate the average power data field) or correct accumulated statistics like TSS.

Second, current crank-, spider- and pedal-based meters all rely on a torque sensor and a basic physics formula that requires cadence to calculate power. (That’s also the issue with oval rings which kills the assumption of a constant cadence.) For these kinds of short bursts, even using accelerometers instead of a simple magnet that only triggers once for every crank rotation, it may be difficult to sense cadence accurately. (If they did, they’d all be able to provide correct data for oval rings, too.) So, these short bursts are likely outside of the not-published working conditions of these power meters. (Even Verve cycling which gives more information about working conditions than the other manufacturers doesn’t say how responsive their cadence data would be in such a condition.)

So, with this knowledge and including all relevant data, the above graph looks like this:

20150810 Vector2 vs Pioneer bursts with average power filled with cadence

Although not visible from the data alone (trust me with this!), I had stopped pedaling between the bursts, so, cadence data from both units are messed up, meaning that all the power data doesn’t look trustworthy to me (unless they are internally calculating with some other cadence data that they don’t send over ANT+). On the good side, Vector’s average power data seems to avoid the effect of sticking to high power values even after the burst has ended that I observed with their instant power data and on the Garmin Edge display. (Actually, they might have designed instant power this way so that you don’t miss data when briefly looking at your computer during a ride.)

Now, when we compare the Pioneer in its proprietary pedaling-monitor mode, we get this:

20150811 Vector2 vs Pioneer in pedaling monitor mode

In pedaling-monitor mode, the Pioneer’s data cannot be recorded with the WASP, so I had to export from Cyclo-Sphere and convert the .fit file to .csv using GoldenCheetah, and manually align them as good as possible (note: a perfect alignment is not possible with devices that are not synchronized).

Now, the interesting thing here is that the Pioneer’s cadence data from Cyclo-Sphere looks much better than the one I got via ANT+, probably also contributing to power data that is closer to data from the Vector, although still lower, and there is not much of a delay compared with Vector.

Since I’d gotten myself already knee-deep into this, I also briefly swapped pedals and compared Pioneer in ANT+ mode with PowerTap’s P1.

20150811 P1 vs Pioneer corrected colors

A few interesting observations: The P1 does not distinguish between instant and average power, which I personally like – deep down I’m an honest person or so I imagine, hence I love my equipment to be honest too. The length of the bursts in the P1 data seems correct too, although I don’t have any data to back that up. On the other hand, the Pioneer seems to distinguish between cadence data available via Cyclo-Sphere and the “instant” cadence data from their ANT+ stream, possibly resulting in their Cyclo-Sphere power data to be more likely than the power data from their ANT+ stream.

Follow-up (August 15th):

Here is a set of Vector2 vs Power2Max comparison data. Slightly different setup, with data set taken on a roller and not the KICKR.20150815 bursts p2m vs vectorObservations: Genereally there is quite some difference between Vector and P2M. Between 21s and 60s I did some single crank rotations, which are much better picked up by the P2M. For two and more crank rotations, there is more agreement between the two in both start timing and power value, but vector seems to take longer to notice stops. P2M does not distinguish between instant and average power, while vector again has some larger differences between them.

Conclusion (revised on August 11th):

For normal riding, all three power meters seem pretty much good enough to me.

If anyone wants correct data for very short bursts, there is a fundamental limitation here: A crank- or pedal-based power meter depends on how exact it can measure cadence during such a brief burst, and even using accelerometers or mounting the cadence magnet to utilize polarization change for higher accuracy when sensing crank position.

So, my recommendation would be, to either try a hub-based power meter (although I admit I’ve never used one before and have no idea how they’d perform under such conditions) or go with the P1 (which seems to provide honest data, an impression that also somewhat aligns with their claim of using a large number of sensors (8) and enough computation power).

Personally, I highly value the realtime pedaling analysis data that the Pioneer power meter gives when combined with their head unit, which can be helpful for understanding and changing pedaling technique (whereas I personally found the advanced metrics of the Vector and Garmin’s visualization on the newer Edge units less useful, but that might be just myself). Therefore, as a total package, I’d still thing that the Pioneer will have the most impact on someone’s cycling performance although only in combination with their head unit and if you ride regularly indoors and are concerned about pedaling technique. (Yes, choices are never easy.) And, if you really need data from bursts using Pioneer, may be look at Cyclo-Sphere data and not their ANT+ stream.

A common way to reduce noise is to use something called a Kalman filter or to do at least some simple averaging; both necessarily delays the data output. It seems the Pioneer has been engineered more towards reducing erroneous spikes than the Vector2 and the P1, or it’s simply looking at a longer time window given that it was fundamentally designed as a pedaling monitor and averages less over crank rotation.

Other thoughts:

The general consensus in cycling data collection to use 1 second sampling seems old, considering how much communication bandwidth and memory capacity is nowadays available and that most power meters are actually sending at a higher rate. A higher rate could simplify simultaneous correct recording and undelayed display under all conditions including quick bursts.

But even at 1Hz, one should expect “instant power” to be instant possibly including spikes and “average power” to give correct data when accumulated over time. Power meter manufacturers should make clear what their specific conditions for accuracy are.

If I was to design a power meter from ground up, I’d possibly integrate a high resolution optical rotational encoder in the bottom bracket that together with accelerometers would enable giving exact rotational position and velocity, solving both oval ring issues as well as accuracy under bursts.

The comparison was also restriced by the WASP iOS app to be able to record only at 1Hz and not all the data that the power meters are sending, which would have allowed for more exact assessments of delay time. I was not able to check yet whether the WASP’s ANT+ to WIFI bridge functionality filters down to 1Hz (I need to sign their NDA first!) or whether it is a restriction of their iOS app. I was neither able to find a PC or Mac application for simply recording all ANT+ traffic.

The processing involved here is possibly somewhat comparable to high ISO noise reduction in digital cameras which reduces noise patterns but also image details. It might be good if power meter manufacturers made these noise reduction levels user-configurable as in higher-end digital cameras, empowering the user to choose the processing that is best for their usage.