Note: we have updated our testing analysis (Aug 2017) and will be applying this retrospectively on past reviews over time. The results between the old style and new tests may not be directly comparible.
Because nobody is reviewing running GPS watches for trail runners in a comprehensive, candid, and objective way in NZ trail conditions. And conditions are everything with respect to GPS trail performance.
The point of our testing is to replicate how trail and ultra runners actually run in the real world. We turn the watch on, get a satellite fix and start running. We run in the sunshine, run in the dark, in the rain, in mud, in the bush, across open fields. But unlike you we also test the sh!t out of it (to paraphrase Mark Watney/ Neil deGrasse Tyson).
If not already obvious we are reviewing for running the trails, not tri, not road, not cross-fit. This means we prioritise some things more highly than others, such as – accuracy under canopy, endurance, and reliability (all day/all conditions). In other words some features and style are readily conceded if means better core trail performance.
Each review is comprised of five sections – a summary, field testing, functional tests, a trail ready feature checklist, general usability assessment, and long term usage verdicts. These are assessed against a baseline of results from other models reviewed to produce a summary and trail-runner type match.
The summary tries to fit the model to the trail runner. Gives an overall synopsis, best runner fit, long term outlook, and the good and bad of the watch
Field tests are the real-world testing of the GPS watches. A number of standardised tests have been designed to provide meaningful, objective data to assess positional accuracy, distance, elevation, and battery life in NZ trail conditions. These tests are undertaken and repeated in various terrain and satellite availability conditions involving +100km of running per unit. There’s no substitute for doing the tests in situ, and obsessively repeating them till outcomes are either predictable, or knowingly unpredictable.
The functional tests are putting to key features to task on some common tasks like navigation, race pacing, and data exchange.
The feature checklist is made up of the core features we use in training and racing and expect in a trail ready watch. These are categorised as – trail standard, ultra running, and nice to have features.
The general usability assessment is about general design and functionality.
Long term usability verdicts are the consensus views of the MEC test lab members who’ve got either got +500km running experience with the unit, or were no longer able to cope with the issues encountered and disposed of the unit.
The reviews are written in an ongoing journal type manner as time permits and new firmware/web service updates come on board. Not surprisingly these tests take some time, with a couple hundred kilometres needed for the field tests and +500km for the long term verdicts (so don’t hassle us, we have spouses, kids, jobs, and brewing to attend to).
The increased feature dependence on paired web services, and a modern development cycle of an hardware release followed by multiple firmware updates also limit a static review. Updates to the review will be clearly identifiable, so as to be clear on any past issues encountered and resolved.
The Maungakiekie Endurance Club (MEC) is made up of a bunch of people who share a love of trail running and other endurance sports. Experience ranges widely from seasoned ultra-runners to those new to trail. The MEC test labs draws on this collective to provide the long term usability verdicts and match assessments.
What (can you expect)
These reviews are generally undertaken using a sample of one GPS unit per model. Intra-model variability is an unknown, and anecdotal reports without objective testing makes interpretation difficult. Unless otherwise stated all GPS units are sourced via standard retail channels.
A good field test outcome here means the model in question is capable of performing, but good performance for you may not be inevitable. Conversely a bad outcome may mean either the model is a poor design, or there is variation in performance in that model (both of which are a poor result for consumers).
The field tests were undertaken in a specific set of semi-controlled conditions to enable valid comparisons between models. While the results are generalizable to similar conditions (given a large enough sample), we have also made a reasoned assessment on likely performance in tougher/more fun conditions. Your mileage will definitely vary.
Two standard courses with multiple lap points were surveyed with respect to distance and absolute XYZ positioning. The courses were constructed using post-processed differential-GPS data validated and aligned against high resolution aerial photography. Elevation was determined using 0.5m LIDAR (laser measured elevation) data. These courses were confirmed in distance by running each course with a large diameter calibrated measuring wheel on multiple occasions.
Each course has various laps representing easy and more taxing GPS conditions (tree cover, terrain steepness, track bendiness). Neither of the courses have conditions approaching the full difficulty of typical NZ bush trails. But if performance suffers in these conditions then the unit has little hope of accuracy in the real thing.
Last the units are taken out into known but un-surveyed courses representing typically difficult NZ GPS conditions, including gorges, steep terrain, and heavy tree cover. Absolute accuracy is not measured on these outings, but relative performance is assessed on repeated laps and in the context of the other results.
On the surveyed courses, distances and positional errors are measured (how far off the measured course each recorded track is). On the more challenging un-surveyed course positional error relative to each consecutive lap is assessed as well as the variation across GPS units.
Each surveyed course is run over multiple laps over multiple days with at least two GPS units in play (to check against truly anomalous GPS conditions). All activities are logged in terms of potential satellite reception (Geometric Dilution of Precision) to ensure results between models are comparable. Additionally some activities are pre-planned to check performance under intentionally poor GPS coverage conditions.
Sampling over multiple days is critical in testing accuracy as within session samples are not independent. If you’ve used a fitness GPS for a while, you’ll also know that they all have bad days. Testing needs to be sufficient to capture these occurrences.
Positional accuracy measurement is undertaken using the GDAL libraries in R from raw GPX track points for positional accuracy. This is measured as the distance to closest surveyed course line, so will be rather more generous than absolute positional accuracy. Data from the watches is grabbed directly from the PC and converted to a GPX for the positional tests only.
Distance and elevation is as recorded on the watch since these are filtered/smoothed according to watch smarts and give different results from the GPX. Filtering and smoothing can potentially improve pace and distance figures as the rubbish GPS data is discarded, but can only do so much with dodgy data. Course distances are measured by comparing reported lap distances (as reported on watch) against surveyed course distance. Elevation is assessed by measuring watch reported climb against surveyed course climb.
As well as the overall accuracy we classify each lap into a difficulty category depending on the GPS conditions. These categories are assessed using a GIS to identify percentage under tree cover, how bendy the track section is, and whether we are running in a gully or up against some steep terrain. Lap categories are ‘easy’, ‘mixed’, and ‘difficult’. We can then assess watch performance under each of these conditions independently so we can better understand watch performance WRT trail running. If you rarely venture into difficult conditions you can ignore those results.
Once we’ve collected +100km of data on the surveyed courses we start running the stats (hopefully a lot more than that if we can hang on to the watch long enough). These include –
- A long term average of distance accuracy to indicate if the watch records long or short on average (this is adjusted or balanced to ensure all watches spend equal time in the various GPS conditions)
- A long term average of elevation climbed accuracy that indicates if the watch under of over estimates total climbing
- An absolute average of elevation climbed accuracy, ignoring if it’s long or short – which stops long and short recordings from cancelling each other out
- How often the watch comes within 10% of the true course climb
And by GPS conditions (easy, mixed, difficult, and total) –
- An absolute average of distance accuracy which ignores if the watch is recording short or long, it just measures how wrong – this stops watches that record equally long and short from cancelling each other out
- How often the watch records at least 99% accurate laps – which is a means of picking out which watches are consistently reliable on a run by run basis
- How much the recorded distance as you see it differs from the satellite recorded distance
- Positionally how far off the recorded track is from the surveyed track – measured as the minimum distance from the GPX trackpoint to the survey line (this will be a lot more generous than the true GPS positional error)
Along with spitting out these figures we also calculate some stats to check if accuracy differences are likely to be real (ie. statistically significant). In case you were wondering we use the Wilson score interval method for 99% success scores and bootstrapping confidence intervals for the absolute distance accuracy.
Finally we throw together all the factors from our courses and measurements to build regression models to identify what factors affect watch distance accuracy. By doing this we can see what the impact of each of these factors is on its accuracy so you can make an assessment of how the watch might perform for you. The factors we examine include –
- Satellite availability via a generalised GDoP score (averaged for the duration of the run)
- Track sinuosity (bendiness)
- Percentage of track under tree cover
- Watch firmware
- Average running speed
- Median trackpoint offset distance
- Raw GPX distance accuracy
As well as quantifiable results, the form of the GPX tracks are also described. This provides some insight to how the watch GPS performance copes under the conditions. Tracks are described in terms of smoothing (corner cutting), random scatter, and shadowing (tracks running parallel to actual position). Trackpoint density clouds also provide a visual summary of GPS performance.
The battery run down test is pretty much as its sounds. With the watch on it’s most accurate setting and HR sensor paired, we run them to exhaustion in real trail conditions. If opportunities allow the watch will also be tested with battery saving settings on.
Windows desktop software used for the testing includes QGIS, PostGIS, R, SportTracks, GPS Track Editor, and Bipolar. Relevant desktop apps by the manufacturer may also be used (eg. desktop Basecamp for Garmin).
Just cos there is a vague reference in the brochure doesn’t mean it does what you think it’s going to do. We put the watch to a number of set tasks involving data exchange, navigation, and pacing.
Data exchange is moving the data off the watch to native and non-native services and applications, both via mobile and desktop apps. Currently this includes Strava and SportTracks. The native web service is also checked to see if data can be extracted individually/bulk or imported to the service from other model GPS (in full fidelity).
Navigation includes a set task of creating a trail route (that auto-follows trails) and waypoints against a topo map. The course should have waypoint alert functionality cooked in, as on-watch breadcrumbs by themselves don’t help a whole lot when things get tough (sometimes we need all the assistance we can get). Additional stuff like a course elevation profile and elevation on waypoints, waypoint autolapping, distance to waypoint and ETA that respect the planned course are also checked. A basic ‘get me back to the start’ navigation function is checked. Also waypoints need to be able to be effectively collected and managed (imported and exported) from the watch or webservice.
Race pacing is simply a check of what function the watch has to run against a set race pace or time, or alternatively against a past recorded event (with changing pace).
Watches are assessed against a set of standard trail relevant features we consider core or pretty useful.
For us a design critical to lifelong athletes is the adherence to open standards in both sensors and data formats. For one thing you tend to collect sensors. More importantly though you amass data, having this locked up in a proprietary format or service doesn’t make for a good long term strategy. An ability to manage our data independent to a manufacturers software and services is fundamental.
General Trail Running Feature Set
- GPS accuracy under canopy
- Consistent GPS performance
- Rapid GPS Acquisition
- Cadence option (on wrist or foot pod)
- Battery 8hr (with HRM)
- Barometer (for altitude)
- Basic breadcrumb with waypoint navigation
- Vibration alerts
- Trail legible display
- Open data access
- Sensor Standards Compliant
Standard Ultra Feature Set (as per trail running plus)
- Battery 14hr+ with HRM and high accuracy recording
- Battery 24hr+ with HRM and down-sampling
- Electronic compass
Nice to Have Features
- Mobile uploads (Android/Apple)
- HRV (R-R) recording with recovery estimate/test
- Footpod GPS override option
- Basic interval workout ability
- Pacing function
- Position/waypoint autolapping
- Feed/drink or run/walk timing reminders
- Everyday watch with
- Activity Tracking
- Mobile notifications