Part 2: What can 24 Ecologically Valid Hours and 102,74...

Return to site

Part 2: What can 24 Ecologically Valid Hours and 102,740 Heart Beats Tell Us About the Accuracy of the Apple Watch 3 and the Fitbit Charge 2?

Study Description

This blog post is Part 2 of investigating the accuracy of the Apple Watch 3 and Fitbit Charge 2 as compared to the gold-standard electrocardiogram (ECG) during an ecologically valid 24-hour continuous paradigm that approximates real world conditions under which these devices were devised to be used by consumers.

If you haven't read Part 1, which only looked at analyses from the overall 24-hour period, then you can find it here. What started out as a nerdy study on myself to determine if the Apple Watch 3 and Fitbit Charge 2 were accurate, has now turned into a pre-pre-print manuscript. My mentor, Dr. Nick Allen, has provided some edits as we are planning to submit this for publication.

Everything that follows was preregistered with Open Science Framework.

Pre-Pre-Print
;-)

Introduction

Wrist-worn smart watches and fitness monitors or wearables have been widely adopted by consumers and are currently gaining increased attention by researchers for their potential contribution to digital measurement of health in big data studies as they are scalable, unobtrusive, and provide greater ecologically validity. These devices contain a multitude of sensors, including an optical sensor that uses photoplethysmography (PPG), allowing these devices to collect heart rate (HR). Recently, there have been a variety of studies that have examined the accuracy of wearable heart rate sensors as compared to an electrocardiogram (ECG; Boudreaux et al., 2017; de Zambotti et al., 2016; Gillinov et al., 2017; Kroll, Boyd, & Maslove, 2016; Shcherbina, Mattsson et al., 2017; Wallen, Gomersall, Keating, Wisløff, & Coombes, 2016; Wang et al., 2017), Polar chest strap (Dooley, Golaszewski, & Bartholomew, 2017; Stahl, An, Dinkel, Noble, & Lee, 2016), or pulse oximeter (El-Amrawy & Nounou, 2015) across various controlled laboratory conditions including sitting; treadmill protocols for walking and running, cycling, weight training; and sleeping, yet to our knowledge no studies have tested the accuracy of HR sensors on these devices as they were devised to be used by consumers - out in the real world during a 24-hour ecologically valid paradigm that approximates actual consumer device use conditions.

Previous research comparing wearables to the gold-standard ECG have shown that that wearables underestimate HR as compared to reference methods (Boudreaux et al., 2017; de Zambotti et al., 2016; Dooley, Golaszewski, & Bartholomew, 2017; Kroll, Boyd, & Maslove, 2016; Stahl, An, Dinkel, Noble, & Lee, 2016; Wallen, Gomersall, Keating, Wisløff, & Coombes, 2016; Wang et al., 2017). Prior research has also shown that the Apple Watch has greater accuracy than Fitbit devices (Shcherbina et al., 2017; Wallen et al., 2016; Wang et al., 2017). Specifically, prior research has found that the Apple Watch has lower overall error (Boudreaux et al., 2017; Dooley, Golaszewski, & Bartholomew, 2017; Shcherbina et al., 2017), lowest mean difference standard deviation (Wallen et al., 2016), and higher agreement as compared to an ECG than Fitbit devices (Boudreaux et al., 2017; Wang et al., 2017), but that wearables accuracy depends on activity (Gillinov et al., 2017). Specifically, research shows that wearable devices are more accurate during rest and low intensity exercise as compared to exercises at higher intensity (Boudreaux et al., 2017), which may be due to less movement of the wearable around the wrist, although this is not found in all studies (Dooley, Golaszewski, & Bartholomew, 2017).

The Current Study
This study was pre-registered (hypotheses, methods) on Open Science Framework (see osf.io/6w2sh). The objective of this study was to determine the HR accuracy of two of the most popular wearables, The Apple Watch 3 and Fitbit Charge 2, as compared to the gold-standard ECG. We used an ambulatory ECG to allow for continuous recording in real-world settings. A single-subject design was used for this initial study on the ecological validity of wearables, because it allowed for all potential confound variables to be constant, except for the wearable devices and to prevent research subject burden due to the time consuming protocol (e.g., time markers have to be made each time the subject starts and stops an activity).

The current study hypothesized that 1) the Apple Watch 3 would be more accurate at measuring HR than the Fitbit Charge 2 when compared to an ambulatory ECG across all conditions, 2) both wearables would underestimate HR across all conditions, and 3) that device measurement of HR would become increasingly inaccurate as activity intensity increased.
Add paragraph text here.

Methods

Participant.
This study contained one subject (BN) who completed a 24-hour protocol (29-year-old Caucasian male; Body Mass Index = 21.1; Fitzpatrick skin tone measure = 2; Right Hand Dominant). The participant (1st author) conceptualized and initiated this study with the purpose of having the data published. Therefore, approval from the University of Oregon ethics committee was unnecessary and not obtained. The participant gave consent for collecting and using the data for study purposes.

Study Protocol.
Participant psychophysiology recordings began at 18:28 on Day 1 and briefly stopped at 17:10 on Day 2 prior to the run condition. Recording resumed at 17:37 for the run condition and stopped at 18:50 on Day 2. Age, gender, height, and weight were used to set up both wearable devices.

Conditions.
Five daily conditions were recorded throughout the 24-hour study using a digital notebook (Google Sheets) to record activity times, resulting in 84 start and stop marker times. These included sitting, which included any seated activity; walking; running (this occurred on a treadmill to allow for a stable ambulatory ECG to increase accuracy); daily activities, which included activities such as cleaning, vacuuming, and cooking; and sleeping.

Gold-Standard Reference Method.
Electrocardiography (ECG). ECG data were acquired using a standard 3-lead ambulatory ECG (Vrije Universiteit Ambulatory Monitoring System; de Geus, Willemsen, Klaver, & van Doornen, 1995; Willemsen, De Geus, Klaver, Van Doornen, & Carroll, 1996).

Wearable Devices.
Apple Watch 3. The Apple Watch Series 3 (2017 version, Apple Inc, California, USA, v. 4.2.3) 42mm was worn on the right wrist. According to Apple, the Apple Watch 3 samples HR approximately every 10 minutes or continuously during workouts using PPG with either green LED or infrared light and photodiode sensors. All data from the Apple Watch 3 was sync with the Apple Health app on the iPhone and then exported in XML format for analysis. The AppleHealthAnalysis GitHub repository (Datta, 2018) was used to convert the XML file to a dataframe in R Studio to access per minute data for analysis. When more than one heart rate measurement was collected each minute, the average of these measurements was used in line with prior wearable research (Shcherbina et al., 2017).

Fitbit Charge 2. The Fitbit Charge 2 (2017 version, Fitbit Inc, California, USA, v. 22.55.2) was worn on the left wrist. According to Fitbit, the Fitbit Charge 2 samples HR at varying rates depending on activity level using PPG. The fitbitr GitHub repository (Teramo, 2017) was used to interact with the Fitbit application programing interface (API) to access per minute data for analysis.

Error.
In line with prior health sciences research on wearable HR accuracy (see Shcherbina et al., 2017) and pedometer step counting accuracy (Rosenberger, Buman, Haskell, McConnell, & Carstensen, 2016) we defined an acceptable error rate of ± 5% to be within acceptable limits.

Statistical Analysis.
All analyses were performed in R (version 3.4.3) using RStudio (version 1.1.383). Analyses were performed using the average beats per minute (bpm) separately for each wearable device. ECG data was used as the gold-standard for HR calculated as bpm.

Percent Error. The percent error relative to the ECG was calculated for heart rate in line with previous wearable research (see Shcherbina et al., 2017) for each wearable by using the following formula:

Percent Error = ((device measurement - gold standard)/ gold standard)*100

Bland-Altman Analysis. Bland-Altman Analysis and 95% Limits of Agreement were calculated using the BlandAltmanLeh R package (Lehnert, 2015) for the main analyses, rather than concordance class correlation, to determine agreement between devices as this is the main method used for comparing medical instruments (Bland & Altman, 1999; 2015; Zaki; for systematic review see Bulgiba, Ismail, & Ismail, 2012) and research indicates that different methods are unlikely to have exact agreement and therefore the importance lies in how close pairs of observations are as small differences between devices are unlikely to impact patient decisions (Martin Bland & Altman, 1986).

Concordance Class Correlation (CCC). Lastly, although not one of the analyses that was pre-registered, we also ran CCC analyses between the ECG and each wearable device separately across all conditions using the DescTools R Package (Signorell, 2018) to assist in Bland-Altman Plot Interpretation. In line with prior wearable research (see Wallen et al., 2016), the strength of agreement was interpreted based on the following, weak (ccc < 0.5), moderate (ccc = 0.5-0.7), and strong (ccc > 0.7).

Results

Descriptives. 
The ECG collected 102,740 individual heart beat recordings, which resulted in 1,424 individual beat per minute (bpm) observations after data cleaning and then being condensed to bpm observations. The Fitbit Charge 2 collected 1,446 bpm observations and the Apple Watch 3 collected 1,545 individual observations, which resulted in 394 bpm observations after averaging multiple observations within the same minute as described above and in line with prior wearable research. Overall, 4,415 raw bpm observations or 3,264 cleaned bpm observations (averaging multiple Apple Watch 3 observations within a single minute), which is up to 84% more data within a single subject as compared to some prior studies that had 50 subjects (Wang et al., 2016). See Table 1 for number of observations, HR descriptives, mean error percentage, Bland-Altman Analyses, and CCC agreement for each condition.

Percent Error.
Overall across the 24-hour recording, the Apple Watch 3 had a mean percent error of -2.25%, while the Fitbit Charge 2 had a mean percent error of -4.25%. During sitting conditions, the Apple Watch 3 had a mean percent error of -3.14%, while the Fitbit Charge 2 had a mean percent error of -6.29%. During walking conditions, the Apple Watch 3 had a mean percent error of .14%, while the Fitbit Charge 2 had a mean percent error of -6.50%. During the running condition, the Apple Watch 3 had a mean percent error of 1.50%, while the Fitbit Charge 2 had a mean percent error of -9.88%. During daily activity conditions, the Apple Watch 3 had a mean percent error of -9.38%, while the Fitbit Charge 2 had a mean percent error of -4.33%. Lastly, during the sleep condition, the Apple Watch 3 had a mean percent error of -1.36%, while the Fitbit Charge 2 had a mean percent error of -1.62%.

Bland-Altman Analysis and 95% Limits of Agreement. 
Overall, across the 24-hour recording (see Figure 1a and 1b) the Apple Watch 3 had a mean error of -1.80 bpm (Lower LoA-Upper LoA; -16.31 to 12.71 bpm), while the Fitbit Charge 2 had a mean error of -3.47 bpm (Lower LoA-Upper LoA; -15.54 to 8.62 bpm). Visual inspection of the Bland-Altman plots (see Figure 1a and 1b) revealed a tendency for the Apple Watch 3 to both over and underestimate HR values when HR values were between 70-120 bpm, while the Fitbit Charge 2 had a tendency to underestimated HR values, particularly once HR values exceeded ~80 bpm.

Figure 1a. Apple Watch Bland-Altman Plot

Figure 1a. Apple Watch 3 Bland-Altman Plot Across 24-Hours

Figure 1b. Fitbit Charge 2 Bland-Altman Plot Across 24-Hours

During sitting conditions, the Apple Watch 3 had a mean error of -2.47 bpm (Lower LoA-Upper LoA; -16.94 to 12.01 bpm), while the Fitbit Charge 2 had a mean error of -4.69 bpm (Lower LoA-Upper LoA; -14.29 to 4.91 bpm). During walking conditions, the Apple Watch 3 had a mean error of 0.11 bpm (Lower LoA-Upper LoA; -14.18 to 14.41 bpm), while the Fitbit Charge 2 had a mean error of -6.85 bpm (Lower LoA-Upper LoA; -28.51 to 14.81 bpm).

During the running condition, the Apple Watch 3 had a mean error of 1.77 bpm (Lower LoA-Upper LoA; 9.78 to 13.33 bpm), while the Fitbit Charge 2 had a mean error of -14.73 bpm (Lower LoA-Upper LoA; -29.77 to 0.31 bpm). During daily activity conditions, the Apple Watch 3 had a mean error of -8.50 bpm (Lower LoA-Upper LoA; -33.78 to 16.78 bpm), while the Fitbit Charge 2 had a mean error of -3.73 bpm (Lower LoA-Upper LoA; -19.88 to 12.41 bpm).

Lastly, during the sleep condition, the Apple Watch 3 had a mean error of -0.95 bpm (Lower LoA-Upper LoA; -6.39 to 4.50 bpm), while the Fitbit Charge 2 had a mean error of -1.11 bpm (Lower LoA-Upper LoA; -7.28 to 5.17 bpm).

Concordance Class Correlation (CCC). 
Overall, across the 24-hour recording the Apple Watch 3 (ccc = .955, 95% CI [.945, .963]) and the Fitbit Charge 2 (ccc = .9056, 95% CI [.896, .914]) had strong agreement with the reference method.

During sitting conditions, the Apple Watch 3 (ccc = .423, 95% CI [.321, .567]) had weak agreement and the Fitbit Charge 2 (ccc = .561, 95% CI [.515, .603]) had moderate agreement with the reference method. During all walking activities, the walking conditions the Apple Watch 3 (ccc = .871, 95% CI [.807, .915]) and the Fitbit Charge 2 (ccc = .740, 95% CI [.645, .812]) had strong agreement with the reference method. During the running condition, the Apple Watch 3 (ccc = .864, 95% CI [.731, .934]) had strong agreement with the reference method, while the Fitbit Charge 2 (ccc = .490, 95% CI [.268, .663]) had weak agreement with the reference method. During all daily activity conditions, the Apple Watch 3 (ccc = .460, 95% CI [.204, .656]) had weak agreement with the reference method, while the Fitbit Charge 2 (ccc = .739, 95% CI [.676, .791]) had strong agreement with the reference method. Lastly during the sleep condition, the Apple Watch 3 (ccc = .791, 95% CI [.715, .849]) and the Fitbit Charge 2 (ccc = .745, 95% CI [.707, .779]) had strong agreement with the reference method.

Discussion

This study provided the first continuous and ecologically valid assessment of the accuracy of the Apple Watch 3 and the Fitbit Charge 2 as they were devised to be used by consumers (i.e., during ecologically valid daily activities) during a 24-hour paradigm that approximated actual consumer device use conditions.

In line with previous controlled laboratory research (de Zambotti et al., 2016; El-Amrawy & Nounou, 2015; Gillinov et al., 2017; Shcherbina, Mikael Mattsson et al., 2017; Stahl, An, Dinkel, Noble, & Lee, 2016; Wallen, Gomersall, Keating, Wisløff, & Coombes, 2016; Wang et al., 2017), our findings indicated that both wearable devices provided acceptable accuracy overall across the 24-hour recording period. In addition, in line with previous research both the Apple Watch 3 and the Fitbit Charge 2 slightly underestimated heart rate as compared to ECG and other reference methods (Boudreaux et al., 2017; de Zambotti et al., 2016; Dooley, Golaszewski, & Bartholomew, 2017; Kroll, Boyd, & Maslove, 2016; Stahl, An, Dinkel, Noble, & Lee, 2016; Wallen, Gomersall, Keating, Wisløff, & Coombes, 2016; Wang et al., 2017). Although this overall (across the 24-hour study) underreporting of HR is unlikely to be problematic in most contexts (< 4bpm), there were a number of individual observations that were inaccurate by significantly large margins, which would be problematic in some contexts (e.g., medical settings). This indicates that while overall, summary statistics may be very accurate for research purposes, any single observation in real-time may have a large degree of error. In addition, we found it surprising that the Apple Watch 3 had such a high mean percent error rate (-9.38%) during daily activities as compared to the Fitbit Charge 2 mean percent error (4.33%). This difference may be due to the fact that the Apple Watch 3 was worn on the dominant hand, which may have made more erratic movements than the Fitbit Charge 2 on the non-dominant hand during daily activities making it more difficult for the PPG sensor to assess and accurate HR measurement.

Overall, the Apple Watch 3 had acceptable error across the entire 24-hour period as well as during the sitting, walking, running, and sleeping conditions, while it’s error rate rose above the ± 5% threshold for daily activities (-9.38%). In addition, the Fitbit Charge 2 had acceptable error across the entire 24-hour period as well as during the sitting, daily activity, and sleeping, but rose above the ± 5% threshold for walking (-6.50%) and running (-9.88%).

Limitations.
The current study had a number of strengths. First, the time intensive single-subject design allowed all potential confounding variables to be constant except for the wearable devices. Second, the length of recording resulted in the collection of a total of 3,264 bpm observations, which is up to 84% more data within a single subject as compared to some prior study data across 50 subjects (Wang et al., 2016). Lastly, this study provided the first continuous and ecologically valid assessment of wearable HR accuracy in real-world conditions.

In addition to these strengths, there were also a number of limitations. First, the single-subject design limited various participant demographic factors, such as higher BMI, darker skin tone, and larger wrist circumference, which have been shown to positively correlate with HR error rates (Shcherbina et al., 2017). Future studies should attempt to replicate these results across multiple individuals with diverse BMI, wrist circumference, skin tone, fitness level, and stress level. In addition, the single subject design combined with the Apple Watch 3 sampling rate of approximately every 10 minutes led to a small number of observations for some conditions. While continuous recording was not activated on the Apple Watch 3 in order to approximate real-world usage conditions, future studies should aim to collect larger numbers of subjects in order to increase the observations for each condition and potentially activate continuous recording on this device. Similarly, while this study had the strength of providing the first continuous and ecologically valid assessment of wearable accuracy in real-world conditions this was also a limitation as it inherently couldn’t take place within more controlled laboratory settings that used a stationary ECG, rather than an ambulatory ECG that may introduce some additional error. Another limitation to this study is that while that overall error rate of both devices was low, there were some individual observations that were inaccurate by significantly large margins. This indicates that while overall, summary statistics for conditions may be very accurate, any single observation in real-time may have a large degree of error. Researchers should keep this in mind when using wearable devices in research settings and this finding emphasizes the importance of data cleaning. Implementing these devices in research settings would likely benefit from automated outlier detection and deletion techniques. Lastly, this study did not counterbalance wrist placement of the wearables to rule out potential influences of wrist circumference or musculature on the accuracy of HR readings. The subject was right handed, and therefore the lower accuracy of the Apple Watch 3 as compared to the Fitbit Charge 2 during the daily activities condition may have been due to inconsistent wrist motions that accompany many activities in this condition as prior research has indicated that the lack of smooth wrist movements introduces larger HR measurement error (Dooley, Golaszewski, & Bartholomew, 2017). Future studies should provide both between-subjects analyses and within-subjects analyses with devices on both wrists to assess the accuracy of wearables as hand dominance may influence accuracy.

Conclusions.
This study provided the first continuous and ecologically valid assessment of the accuracy of the Apple Watch 3 and the Fitbit Charge 2 HR measurements as they were devised to be used by consumers out in the real world during a 24-hour paradigm that approximated actual consumer device use conditions. Overall, both the Apple Watch 3 and Fitbit Charge 2 had acceptable HR accuracy overall across the 24-hour period with the Apple Watch 3 having acceptable HR error across the day as well as the during the sitting, walking, running, and sleeping conditions, while the Fitbit Charge 2 had acceptable HR error across the entire day as well as during the sitting, daily activity, and sleeping. In contrast, the Apple Watch did not have acceptable accuracy during daily activities, while the Fitbit Charge 2 did not have acceptable accuracy during walking and running. Again it is important to note that while overall statistics for most conditions were acceptable, there were a number of individual observations that varied widely from the gold-standard ECG, which indicates that any single measurement viewed in real-time cannot be interpreted as an accurate measurement.

Overall, wearable devices likely won’t be replacing the gold-standard ECG in a medical setting anytime soon, but both the Apple Watch 3 and the Fitbit Charge 2 are acceptable for research and clinical applications, particularly big data studies, as these devices had an overall acceptable error rate combined with being relatively cheap, unobtrusive, and scalable as compared to gold-standard medical equipment.here.

References

Bland, J. M., & Altman, D. G. (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research. http://doi.org/10.1191/096228099673819272

Boudreaux, B. D., Hebert, E. P., Hollander, D. B., Williams, B. M., Cormier, C. L., Naquin, M. R., … Kraemer, R. R. (2017). Validity of Wearable Activity Monitors during Cycling and Resistance Exercise. Medicine & Science in Sports & Exercise, 1. http://doi.org/10.1249/MSS.0000000000001471

Datta, D. (2018). AppleHealthAnalysis. https://github.com/deepankardatta/AppleHealthAnalysis

de Geus, E. J. C., Willemsen, G. H. M., Klaver, C. H. A. M., & van Doornen, L. J. P. (1995). Ambulatory measurement of respiratory sinus arrhythmia and respiration rate. Biological Psychology, 41(3), 205–227. http://doi.org/10.1016/0301-0511(95)05137-6

de Zambotti, M., Baker, F. C., Willoughby, A. R., Godino, J. G., Wing, D., Patrick, K., & Colrain, I. M. (2016). Measures of sleep and cardiac functioning during sleep using a multi-sensory commercially-available wristband in adolescents. Physiology & Behavior, 158, 143–149. http://doi.org/10.1016/j.physbeh.2016.03.006

Dooley, E. E., Golaszewski, N. M., & Bartholomew, J. B. (2017). Estimating Accuracy at Exercise Intensities: A Comparative Study of Self-Monitoring Heart Rate and Physical Activity Wearable Devices. JMIR mHealth and uHealth, 5(3), e34. http://doi.org/10.2196/mhealth.7043

El-Amrawy, F., & Nounou, M. I. (2015). Are currently available wearable devices for activity tracking and heart rate monitoring accurate, precise, and medically beneficial? Healthcare Informatics Research, 21(4), 315–320. http://doi.org/10.4258/hir.2015.21.4.315

Gillinov, S., Etiwy, M., Wang, R., Blackburn, G., Phelan, D., Gillinov, A. M., … Desai, M. Y. (2017). Variable accuracy of wearable heart rate monitors during aerobic exercise. Medicine and Science in Sports and Exercise, 49(8), 1697–1703. http://doi.org/10.1249/MSS.0000000000001284

Jm, B., & Dg, A. (2015). Measuring agreement in method comparison studies . PubMed Commons, 2802(99), 10501650.

Lehnert, B. (2015). BlandAltmanLeh. Retrieved from https://cran.r-project.org/web/packages/BlandAltmanLeh/BlandAltmanLeh.pdf

Martin Bland, J., & Altman, D. (1986). Statistical Methods for Assessing Agreement Between Two Methods of Clinical Measurement. The Lancet, 327(8476), 307–310. http://doi.org/10.1016/S0140-6736(86)90837-8

Rosenberger, M. E., Buman, M. P., Haskell, W. L., McConnell, M. V., & Carstensen, L. L. (2016). Twenty-four Hours of Sleep, Sedentary Behavior, and Physical Activity with Nine Wearable Devices. Medicine and Science in Sports and Exercise, 48(3), 457–465. http://doi.org/10.1249/MSS.0000000000000778

Shcherbina, A., Mattsson, C., Waggott, D., Salisbury, H., Christle, J., Hastie, T., … Ashley, E. (2017). Accuracy in Wrist-Worn, Sensor-Based Measurements of Heart Rate and Energy Expenditure in a Diverse Cohort. Journal of Personalized Medicine, 7(2), 3. http://doi.org/10.3390/jpm7020003

Signorell, A. (2018). DescTools: Tools for Descriptive Statistics.

Stahl, S. E., An, H.-S., Dinkel, D. M., Noble, J. M., & Lee, J.-M. (2016). How accurate are the wrist-based heart rate monitors during walking and running activities? Are they accurate enough? BMJ Open Sport & Exercise Medicine, 2(1), e000106. http://doi.org/10.1136/bmjsem-2015-000106

Teramo, N. (2017). fitbitr. https://github.com/teramonagi/fitbitr

Wallen, M. P., Gomersall, S. R., Keating, S. E., Wisløff, U., & Coombes, J. S. (2016). Accuracy of heart rate watches: Implications for weight management. PLoS ONE, 11(5). http://doi.org/10.1371/journal.pone.0154420

Wang, R., Blackburn, G., Desai, M., Phelan, D., Gillinov, L., Houghtaling, P., & Gillinov, M. (2017). Accuracy of wrist-worn heart rate monitors. JAMA Cardiology, 2(1), 104–106. http://doi.org/10.1001/jamacardio.2016.3340

Willemsen, G. H. M., De Geus, E. J. C., Klaver, C. H. A. M., Van Doornen, L. J. P., & Carroll, D. (1996). Ambulatory monitoring of the impedance cardiogram. Psychophysiology, 33(2), 184–193. http://doi.org/10.1111/j.1469-8986.1996.tb02122.x

Zaki, R., Bulgiba, A., Ismail, R., & Ismail, N. A. (2012). Statistical methods used to test for agreement of medical instruments measuring continuous variables in method comparison studies: A systematic review. PLoS ONE. http://doi.org/10.1371/journal.pone.0037908