Objectives Consumer-grade smart devices are now commonly used by the public to measure waking activity and sleep. However, the ability of these devices to accurately measure sleep in clinical populations warrants more examination. The aim of the present study was to assess the accuracy of three consumer-grade sleep monitors compared with gold standard polysomnography (PSG).
Design A prospective cohort study was performed.
Setting Adults undergoing PSG for investigation of a suspected sleep disorder.
Participants 54 sleep-clinic patients were assessed using three consumer-grade sleep monitors (Jawbone UP3, ResMed S+ and Beddit) in addition to PSG.
Outcomes Jawbone UP3, ResMed S+ and Beddit were compared with gold standard in-laboratory PSG on four major sleep parameters—total sleep time (TST), sleep onset latency (SOL), wake after sleep onset (WASO) and sleep efficiency (SE).
Results The accelerometer Jawbone UP3 was found to overestimate TST by 28 min (limits of agreement, LOA=−100.23 to 157.37), with reasonable agreement compared with gold standard for TST, WASO and SE. The doppler radar ResMed S+ device underestimated TST by 34 min (LOA=−257.06 to 188.34) and had poor absolute agreement compared with PSG for TST, SOL and SE. The mattress device, Beddit underestimated TST by 53 min (LOA=−238.79 to 132) on average and poor reliability compared with PSG for all measures except TST. High device synchronisation failure occurred, with 20% of recordings incomplete due to Bluetooth drop out and recording loss.
Conclusion Poor to moderate agreement was found between PSG and each of the tested devices, however, Jawbone UP3 had relatively better absolute agreement than other devices in sleep measurements compared with PSG. Consumer grade devices assessed do not have strong enough agreement with gold standard measurement to replace clinical evaluation and PSG sleep testing. The models tested here have been superseded and newer models may have increase accuracy and thus potentially powerful patient engagement tools for long-term sleep measurement.
- sleep medicine
- information technology
- respiratory medicine (see thoracic medicine)
Data availability statement
Data are available upon reasonable request. The dataset will be available upon emailed request to the corresponding author.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
Consumer grade devices were compared with gold standard in clinic patients.
More than one device was included for comparison.
This study includes measure of sleep parameters that clinicians frequently need to review in daily practice, such as total sleep time and sleep efficiency.
High device failure was found in this study, confirming that consumer grade devices cannot be used to replace high fidelity diagnostic measurement.
This sample had patients with sleep apnoea, insomnia or hypersomnia as their final sleep diagnosis.
Poor sleep quality and duration has been shown to be an independent risk to overall mortality and for many chronic diseases.1 The gold standard test for the measurement of sleep and diagnosis of sleep disorders is attended polysomnography (PSG). However, this is an involved and costly test, that requires complex equipment, dedicated space, trained staff and does not lend itself well to multi-night monitoring.
Sales of consumer sleep monitors and wearable consumer-grade smart devices have dramatically increased in recent years, with 33 million units estimated to have been sold in the USA in 20152 and the estimated value of the wearable industry in the USA expected to grow to US$8.5 billion in 2020.3 4 Consumer-grade devices fall into three major categories (i) wrist based devices (eg, Jawbone, FitBit); (ii) Bedside devices (eg, ResMed S+, Touch-Free Life Care) and (iii) Mattress-based devices (eg, Beddit, EarlySense Mattress, Emfit Bed Sensor). Each of the categories of devices use unique proprietary algorithms for inferring wake/sleep, body position and measures of sleep quality.
The Jawbone UP (the precursor to the UP3 used in this study) has been compared with PSG in adolescents and concluded to have good agreements for total sleep time (TST), sleep efficiency (SE) and wake after sleep onset (WASO), however, the tendency to underestimate TST and SE increased with age.5 In a study of adult women, the FitBitChargeHR overestimated TST by 27 min, and was found to have significantly different SOL and WASO compared with PSG.5 Similarly in adolescents the Jawbone UP tended to overestimate TST and SOL, while underestimating WASO. The researchers also found greater discrepancies in nights when participants had more disrupted sleep (ie, lower TST and greater SOL and WASO).5 In patients with suspected central disorders of hypersomnolence, the Jawbone UP3 was found to significantly overestimate TST by an average of 39.6 min compared with PSG and was not able to discriminate stages of sleep adequately.6 Interestingly, the Jawbone UP3 performed similarly to actigraphy in this study. Another clinical study found that the FitBit Flex overestimated TST more in a group of insomnia patients compared with good sleepers (32.9 min vs 6.5 min).7 Taken together, these two studies suggest that consumer-grade sleep devices are less accurate at measuring TST in a clinical sleep disorder population, than they are for good sleepers.
The Beddit mattresses based device has been found in 10 health controls to have poor agreement with TST (overestimated by 43.5 min), WASO and SE.8 SOL was the only measure to have agreement, but had a wide variance.8 The sensor technology used in the ResMed S+ device has been shown to have moderate accuracy in measuring TST and SE in healthy volunteers compared with PSG and high specificity.9 10 Furthermore its utility in measuring sleep disordered breathing has been investigated and found to have reasonable accuracy in detecting moderate obstructive sleep apnoea, with a sensitivity of 89% and specificity 92%.11
Patients are increasingly attending sleep clinics with downloads from consumer-grade devices for discussion with primary care physicians and sleep specialists. These commonly encountered situations in the sleep clinic raise the questions: how reliable are consumer-grade devices, and which type of technology is most comparable to gold standard? This study aims to answer these questions with an in-laboratory comparison of PSG with the three consumer devices—Jaw Bone UP3, Beddit and ResMed S+ in a sleep clinic population. It was hypothesised that these devices would have similar accuracy in detecting TST, SOL, WASO and SE.
Fifty-four adult patients were consecutively recruited through a private sleep disorders centre in Melbourne, Australia from June 2015 to February 2016. Inclusion criteria were age >18 years and any patient who required overnight PSG as standard investigation following sleep physician review to either confirm or exclude sleep disordered breathing. All patients attending the laboratory for a polysomnogram were screened for inclusion. Exclusion criteria were age <18 years, positive airway pressure titration study, pregnancy and cognitive impairment. Figure 1 demonstrates the Consolidated Standards of Reporting Trials statement.
All assessments took place at an attended sleep laboratory in Melbourne, Australia. Sleep laboratory staff were trained to set up the three devices in addition to regular overnight PSG monitoring; lights out time was noted for synchronisation across all devices. The primary outcome measure was TST and secondary outcomes were sleep onset latency (SOL, min), SE (%) as TST/(TST +total wake time) and WASO (min). Other measures from the consumer grade devices such as time spent in light, deep or rapid eye movement sleep was not compared in this analysis.
PSG was measured using a standard six-channel electroencephalography, submental electromyography and electrooculography, ECG, airflow (thermistor and nasal cannula), respiratory effort, oximetry, snoring (dB sound metre), body position, pulse rate, leg electromyography and digital video, recorded according to American Academy of Sleep Medicine standards.12 The following standard sleep parameters were recorded via PSG: TST, SOL (min), total wake time (TWT, min), SE (%) as TST/(TST +TWT) and WASO (min). Participants were classified as having obstructive sleep apnoea if the apnoea hypopnoea index was >5 events/hour. A single registered polysomnographic technologist scoring the PSG was blinded to the download of consumer grade devices and raw data were scored using Compumedics amplifiers and Profusion software V.3 (Compumedics, Abbotsford, Victoria, Australia).
Participants were fitted with the JawBone Up3 on the participant’s non-dominant wrist with the Jawbone UP3 shortly before lights out time. Data were collected via a dedicated iPod Touch, synced to the Jawbone app V.18.104.22.168 This consumer-grade actigraphy device has a three-axis accelerometer and heart rate monitor, which together measure TST, SOL, WASO and SE which were exported by a technician the following morning after the PSG was complete.
The ResMed S+ is a non-contact radio-frequency sensor that continuously measures the biomotion due to breathing and body-movement in bed. The sensor operates in a license-free band at 5.8 GHz, emits an average power less than 1 mV and is capable of sensing movement and breathing over a distance ranging from 0.3 to 1.5 m. The device was positioned by the bedside and synced shortly before lights out time to a dedicated iPod with the ResMed S+ app V.22.214.171.124 Measurements from the ResMed S+ were TST, SOL, WASO, SE which were exported by a technician the following morning after the PSG was complete.
The primary sensor in the Beddit is a piezoelectric 70 cm band that was attached to the mattress prior to patients getting into bed. The device detects micro-movements of the chest wall from heartbeats and respiration and uses ballistocardiography to infer sleep stage and time. Ballistocardiography is a non-invasive measurement of cardiac output and respiration by converting mechanical motion (eg, movement generated by a heartbeat) to a digital signal. Measurements from the Beddit were taken each night using the device synced to a dedicated iPod running the Beddit app V.1.15 Output from the app included TST, SOL, WASO, SE and HR which were exported by a technician the following morning after the PSG was complete.
Each of the three non-invasive devices was compared with PSG as the gold standard on an intention to treat basis. The primary and secondary outcomes were compared on total measurements over the night, not epoch-by-epoch method. Summary statistics of the study population are presented. For all normally distributed continuous variables mean and SD, whereas for non-normally distributed variables median and IQR were presented. Normality was assessed using the Shapiro-Wilk test. Frequencies and proportions are presented for categorical variables. Extent of agreement and reliability between gold standard and each of the selected test devices, was assessed using intraclass correlation coefficients (ICCs) with two-way random-effects model. Agreement was considered moderate, good and excellent if the ICC values were between 0.5 and 0.75, 0.75 and 0.9 and >0.9, respectively.16
Additionally, Bland-Altman plots17 were used to visualise the agreement between gold standard PSG and each of the selected devices. The average of two measurements was plotted on x-axis and difference between the two along y-axis. The mean of the differences provided an estimate of average bias between the methods. The upper and lower limits of agreement (LOA) were calculated which correspond to the mean difference (gold standard–selected method)±2 SD. LOA estimated the interval that a given proportion of differences between the measurements is likely to lie within and will be used to determine if the methods can be used interchangeably. Cohen’s d is reported for the magnitude of the effect size. In case of non-normally distributed data, effect size ‘r’ was calculated by dividing Z statistic by the square root of the sample size (N). Interpretation of r is 0.10 to <0.3 (small effect), 0.30 to <0.5 (moderate effect) and ≥0.5 (large effect).18 Data were analysed using R (V.4.0.4) (https://www.r-project.org/) (R Core Team, 2017).
Patient and public involvement
Patients at our sleep disorders centre sparked the interest to assess the accuracy of consumer-grade sleep monitors. Our clinicians were often asked about the accuracy of home sleep monitors. To answer this question our team invited the patients to be involved in evaluating three commonly available consumer-grade smart devices. Participants were not paid for their involvement but did provide written consent. The findings of this research suggest that consumer-grade sleep monitors can give insights into trends in sleep but are not accurate enough to replace laboratory measurement.
Fifty-four adult patients (57% females) with a mean age of 48.09 (±SD 18.05) years participated in this study. Table 1 presents demographics of study population. The final sleep diagnosis found was obstructive sleep apnoea in 33 (61%), insomnia 9 (17%) and central hyper-somnolence disorder in 12 (22%) participants. The mean PSG detected TST was 371 min (SD ±69), SOL of 16 min (SD ±15), WASO 63 min (SD ±56) and SE of 82% (SD ±13%). The absolute values of the measurements for each device are summarised in table 2. The results of the Bland-Altman analyses and intraclass correlation are summarised in table 3 and displayed in figures 2–4.
On average JawBone UP3 overestimated TST by 28.57 min (LOA=−100.23 to 157.37). By inspecting the Bland-Altman plots (shown in figure 2A), the cluster of points surrounded the mean tightly between 300 and 400 min and there was greater variability with TST below 300 min and above 400 min. The magnitude of effect size was small (d=0.44). A moderate degree of reliability for recording TST was found between PSG and Jawbone UP3 with an ICC of 0.6 (95% CI 0.34 to 0.77; p<0.001).
Bland-Altman plot (figure 2B) suggests that the mean difference in SOL between two methods was very small and on average JawBone UP3 measured SOL 0.14 min (LOA=−39.95 to 40.23) more than the gold standard. The cluster of points surrounded the mean tightly on the left, with greater variability for values over 20 min. The magnitude of difference was small (r=0.13). The reliability between the two methods was between poor to moderate (ICC=0.29; 95% CI –0.04 to 0.57; p=0.04).
Jawbone UP3 overestimated WASO only slightly, 1.7 min (LOA=−102.32 to 105.71, d=0.03) compared with PSG. Greater variability was seen for measurements over 50 min (as shown in figure 2C), indicating better estimation of WASO by JawBone UP3 at lower values. The agreement between Jawbone UP3 and PSG for WASO was poor to moderate (ICC=0.55; 95% CI 0.29 to 0.73; p<0.001).
The mean difference in SE between two methods indicated that on an average JawBone UP3 measures SE 0.51% (LOA: −18.96 to 19.99) less than the gold standard. This bias seems to be due to measurements less than 85%, with better estimation of SE by JawBone UP3 at higher SE, as seen in figure 2D. The magnitude of difference was small (d=0.05) The ICC for agreement between Jawbone Up3 and PSG regarding SE was 0.66 (95% CI 0.41 to 0.81; p<0.001) indicating poor to good reliability between the two measures based on 95% CI.
As shown in figure 3A, on average ResMed S+ underestimated TST by 34 min (95% CI −257 min to 188 min). The mean difference between ResMed S+ measured and PSG measured TST was offset (lying below) zero, suggesting a bias. The points remained in the same general pattern for all x-axis values, except for few outliers at lower mean values. The magnitude of difference was moderate (r=0.4). ICC of 0.36 (95% CI 0.02 to 0.63; p=0.02) indicating poor to moderate reliability.
Conversely, ResMed S+ overestimated SOL by 35.6 min (LOA=−57.68 to −128.89) and effect size was large (r=0.8). Cluster of points go from below the mean at short SOL, to above the mean with increasing SOL, showing proportional error, suggesting overestimation of SOL by ResMed S+ at increasing SOL duration, as shown in figure 3B. A poor agreement for SOL was seen between the two methods (ICC=−0.01; 95% CI −0.21 to 0.26; p=0.51).
Similarly, ResMed S+ recorded WASO 27 min more than PSG (LOA=−73.53 to 127.91) and a large effect was found (r=0.52). Visual inspection of Bland-Altman plot (figure 3C) suggested that ResMed S+ increasingly overestimating WASO with increasing time. Reliability between methods was between poor to excellent (ICC=0.61; 95% CI 0.28 to 0.8, p<0.01).
Visual inspection of the Bland-Altman plot figure 3D suggests that on average ResMed S+ underestimated SE by 16% (LOA=−54.06 to 22.31). The effect size was large (r=0.8) and an ICC value of 0.28 (95% CI −0.06 to 0.58; p=0.06) was found. Moreover, the mean difference was not constant, with greater variability at lower values (particularly below 80%), showing proportional bias.
The Beddit and PSG had the least agreement for all outcomes except TST compared with other devices. TST was underestimated by 53 min (LOA=−238.79 to 132). As demonstrated in figure 4A, the cluster of points shifted from below mean to above mean with increasing TST, showing a proportional error depending on the duration of sleep. The magnitude of difference was large (r=0.55) and reliability poor to moderate (ICC=0.40; 95% CI 0.09 to 0.63; p=0.01).
SOL was overestimated by 45 min (LOA=−74.09 to 163.33) by the Beddit compared with PSG. The points were tightly clustering above the mean, and go from above, to below the mean, from left to right (figure 4B), showing error proportional to the duration of SOL. The effect size was large (r=0.78) and reliability poor (ICC=0.004; 95% CI −0.173 to 0.22; p=0.48).
Beddit slightly underestimated SE by 1.35% (LOA=−38.81 to 36.11). As shown in figure 4C, variability of points was constant around the mean at values below 80%. This suggests that at higher values, Beddit estimated SE more closely to the PSG gold standard. The effect size was small (r=0.13) and poor agreement (ICC 0.26; 95% CI −0.04 to 0.51; p=0.06).
Consumer-grade recording failure
Consumer-grade devices were set-up by Sleep Scientist staff each night at the time of the standard PSG set-up. Despite this, device or recording failure resulting in inability to record sufficient data, on the single night of recording, in the consumer-grade devices was common. Failure to synchronise with the dedicated Bluetooth device was the most common reason for device failure. The ResMed S+ failed to synchronise the most, with 25/54 nights (46%) resulting in recording failure. The Jawbone and Beddit had similar rates of synchronisation failure (12/54, 22%), however, not usually in the same room or on the same patient. Comparisons were made on an intention to treat analysis, even where large differences in TST were seen.
The agreement of these three consumer-grade smart devices have simultaneously been compared with gold standard attended PSG in an adult sleep clinic cohort. For each of the devices, there were components of sleep measurement with poor to moderate agreement with the gold standard. This study found the primary outcome measure of TST was overestimated by, Jawbone UP3 whereas both ResMed S+ and Beddit underestimated it. The Jawbone UP3 also overestimated SOL and WASO, however, the magnitude of difference was very small. Generally Jawbone UP3 had better agreement across all outcomes, however for SE agreement was better between ResMed S+ and PSG. The Beddit had the least agreement with PSG, all components having poor agreement when compared with gold standard PSG.
Wearable devices, particularly wrist-worn accelerometers have now been widely compared with PSG. Similar to the results of this study, the accelerometers have been shown to overestimate TST by around 20–30 min, particularly in sleep disordered populations compared with healthy controls.5 7 19 Previous investigations into consumer grade accelerometers in clinical populations found TST overestimated by 32.9 min7 in a population of 33 insomnia patients and 39 min in 43 hyper-somnolence patients.6 In our study, SOL had a large CI, with bias found with measurements over 15 min, consistent with findings of a recent systematic review and meta-analysis.20
The Beddit device and mattress devices in general are one of the least studied consumer grade devices. Tuominen et al8 found in 10 healthy controls the Beddit overestimated TST by 43 min, whereas our data suggest a significant underestimation (PSG TST 371 min vs Beddit TST 321 min) with a larger sample size (n=42). Tuominen et al8 were also able to access WASO data, which was not available with the model of Beddit tested in this study and found to underestimate WASO by 32 min. Non-wearable devices have a potential growing market as non-intrusive home monitors of sleep, as they can be applied in a ‘set and forget’ method. Thus, further refinement and evaluation of bed-based devices would be desirable.
Chinoy et al10 recently compared PSG to ResMed S+ and to SleepScore Max with a population of 19 young ‘healthy normal’ individuals. The ResMed S+ was found to have underestimated TST by only 0.3 min (95% CI −70.7 to 70.2) and the SleepScore Max overestimate TST by 7.5 min (95% CI −60.7 to 75.7). A likely explanation for the difference these findings and the present study is the difference in population—‘healthy normal’ participants versus sleep clinic population. There is growing literature that consumer grade devices have lower accuracy in clinical population compared with control populations.21 Notably, Chinoy et al10 found 2/19 nights (10.5%) using the ResMed S+ were impacted by device synchronisation issues, requiring device re-synchronisation.
The high device synchronisation failure rate also observed in our study is concerning, despite the set-up being performed by sleep laboratory scientific staff. There is no way to calibrate these consumer-grade devices over time and it is difficult to monitor device connectivity to the Bluetooth device until the next morning. The high failure rate further confirms the role of these consumer devices is not to replace that of a diagnostic sleep study.
The main strength of this study was the sample size and that it was conducted in a clinical adult sleep population with a range of suspected sleep disorders. This makes the findings more translatable to clinicians managing patients with sleep disorders. Further, assessing a number of different devices is a novel approach. The weaknesses of the study include a high device recording failure rate, predominantly with Bluetooth synchronisation failure. Epoch-by-epoch analysis was not performed. Further, sales of devices tested in this study have since been discontinued. Beddit was acquired by Apple Inc in May 2017 and relaunched an updated device, the Beddit 3.5 which has reportedly improved integration with mobile phone health kits.22 The ResMed S+ was discontinued and subsequently a similar device was launched in 2017 as SleepScore labs, which is similarly Apple iOS and Android integrated.23 JawBone however has gone into liquidation with no subsequent models leading on from the UP3 device.24
This study indicates that the wrist worn Jawbone UP3 had the best agreement in measuring sleep compared with gold standard and can provide useful information about commonly measured parameters of sleep quality. For Sleep Medicine Clinicians, the translation of these findings, is that when our patients present with longitudinal measurements of sleep from their consumer grade devices, we can be reassured that wrist worn devices have reasonably accuracy and can be harnessed as an engagement tool for behavioural sleep interventions. This is consistent message with the American Academy of Sleep Medicine’s position statement about the use of consumer-grade sleep devices stating that these devices cannot be used for clinical diagnosis, however they allow for meaningful discussions with patients about sleep and encourage active participation in sleep-related healthcare.25
Given the large body of literature linking sleep quality to mortality and many chronic diseases, patient-collected longitudinal sleep data provides a powerful insight into a patient’s overall health. This study adds to the data of consumer grade wearable sleep monitors, showing they can provide some reliable information compared with gold standard PSG, however do not replace clinical evaluation and gold-standard PSG sleep testing. In reviewing sleep data collected by patients with consumer-grade devices, clinicians are encouraging measurement and quantification of sleep, which in turn will likely emphasise the importance of quality sleep in maintaining good health.
Data availability statement
Data are available upon reasonable request. The dataset will be available upon emailed request to the corresponding author.
Patient consent for publication
The study was approved by the Human Research and Ethics Committee of St Vincent’s Hospital, Melbourne (LRR141/15).
Sleep laboratory staff at St Vincent’s Private Hospital, East Melbourne for their set up efforts. Telstra Corporation Ltd (Australia) for the provisions of the Jawbone UP3, ResMed (San Diego) for the ResMed S+ and Beddit Ltd (Finland) for the supply of the test devices used. The authors acknowledge the statistical support received through the Metro South Health Biostatistics Service.
Contributors CME was involved in the protocol preparation, participant consent, data collection, analysis, manuscript preparation and is the manuscript guarantor. SFZ was involved in the data curation and analysis and manuscript preparation. HM was involved in data analysis and manuscript preparation. RJ was involved with participant consent, data collection and manuscript preparation. DC and JS were involved in protocol preparation, data analysis and manuscript preparation.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests The Telstra Corporation Ltd (Australia) provided the Jawbone UP3 test devices used in the study, ResMed (San Diego) provided the ResMed S+ and Beddit Ltd (Finland) provided the Beddit device.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.