Article Text

Download PDFPDF

Evidence assessing the diagnostic performance of medical smartphone apps: a systematic review and exploratory meta-analysis
  1. Rahel Buechi1,
  2. Livia Faes2,
  3. Lucas M Bachmann2,
  4. Michael A Thiel1,
  5. Nicolas S Bodmer2,
  6. Martin K Schmid1,
  7. Oliver Job1,
  8. Kenny R Lienhard3
  1. 1 Eye Clinic, Cantonal Hospital of Lucerne, Lucerne, Switzerland
  2. 2 Medignition Inc., Research Consultants, Zurich, Switzerland
  3. 3 Department of Information Systems, Faculty of Business and Economics, University of Lausanne, Lausanne, Switzerland
  1. Correspondence to Professor Lucas M Bachmann; bachmann{at}


Objective The number of mobile applications addressing health topics is increasing. Whether these apps underwent scientific evaluation is unclear. We comprehensively assessed papers investigating the diagnostic value of available diagnostic health applications using inbuilt smartphone sensors.

Methods Systematic Review—MEDLINE, Scopus, Web of Science inclusive Medical Informatics and Business Source Premier (by citation of reference) were searched from inception until 15 December 2016. Checking of reference lists of review articles and of included articles complemented electronic searches. We included all studies investigating a health application that used inbuilt sensors of a smartphone for diagnosis of disease. The methodological quality of 11 studies used in an exploratory meta-analysis was assessed with the Quality Assessment of Diagnostic Accuracy Studies 2 tool and the reporting quality with the ’STAndards for the Reporting of Diagnostic accuracy studies' (STARD) statement. Sensitivity and specificity of studies reporting two-by-two tables were calculated and summarised.

Results We screened 3296 references for eligibility. Eleven studies, most of them assessing melanoma screening apps, reported 17 two-by-two tables. Quality assessment revealed high risk of bias in all studies. Included papers studied 1048 subjects (758 with the target conditions and 290 healthy volunteers). Overall, the summary estimate for sensitivity was 0.82 (95 % CI 0.56 to 0.94) and 0.89 (95 %CI 0.70 to 0.97) for specificity.

Conclusions The diagnostic evidence of available health apps on Apple’s and Google’s app stores is scarce. Consumers and healthcare professionals should be aware of this when using or recommending them.

PROSPERO registration number 42016033049.

  • mobile health apps
  • evidence-based medicine
  • systematic review
  • diagnostic research

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

View Full Text

Statistics from

Strengths and limitations of this study

  • A comprehensive literature search was used to retrieve the published evidence, applying stringent inclusion criteria and assessed the methodological quality of the studies systematically.

  • The primary studies found had low methodological quality and level of reporting. All but one of included studies used diagnostic case–control designs.

  • The summary estimates from the exploratory meta-analysis need to be interpreted very cautiously.

  • We were unable to test all but one of the apps that had been assessed in this review because they were unavailable in the stores, and thus lack first-hand experience.


Within recent years, the number, awareness and popularity of mobile health applications (apps) have increased substantially.1 2 Currently, over 165 000 apps covering a medical topic are available on the two largest mobile platforms Android and iOS, 9% of them addressing topics of screening, diagnosis and monitoring of various illnesses.3 Also, the Medical Subject Heading (MeSH) term ‘Mobile Applications’ that was introduced in MEDLINE in 2014 is currently indexing approximately 1000 records.4 However, while some authors predicted that mobile health apps will be the game changer of the 21st century, others pointed out that the scientific basis of mobile health apps remains thin.5 6

While information used for personal healthcare is traditionally captured via self-report surveys and doctor consultations, mobile devices with embedded sensors offer opportunities to entertain a continued exchange of information between patients and physicians. This dialogue is of particular importance for patients with chronic illnesses.

Three recent reviews focused on the efficacy, effectiveness and usability of mobile health apps in different clinical areas.7–9 They did not find reasonably sized randomised trials and called for a staged process in the scientific evaluation of mobile health apps. To date, rigorous evidence syntheses of diagnostic studies are missing. In view of the fact that most apps target at a diagnostic problem, it would be helpful to gauge the scientific basis of them. In this comprehensive systematic review, we thus summarised the currently available papers assessing diagnostic properties of mobile health apps.


This review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses10 statement recommendations.

Data sources

Electronic searches were performed without any language restriction on MEDLINE (PubMed interface), Scopus (both databases from inception until 15 December 2016) and Web of Science inclusive Medical Informatics and Business Source Premier (by citation of reference). The full search algorithm is provided in the online supplementary appendix 1.

Supplementary data

Study selection

We applied the PICOS format as follows: We included all studies examining subjects in a clinical setting (P) and investigating a health app that used inbuilt sensors of a smartphone (I) for diagnosis of an illness. Minimum requirement to be included in an exploratory meta-analysis was the availability of original data and the possibility to construct a two-by-two table, that is, the possibility to calculate sensitivity and specificity (O). We accepted all reference tests (C) used in these studies to classify presence or absence of disease. No selection on study design was made (S).

We excluded all studies examining apps providing psychological assessments, questionnaires or mobile alternatives of paper-based tests. We further excluded apps using external sensors, such as clip-on lenses, for the diagnostic assessment or studies, where the app was only used as the transmitter of data.

Data extraction and quality assessment

The methodological quality of all 11 studies11–21 providing 2×2 table data that were summarised in the meta-analysis was made using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. Reporting quality was assessed using the  ’STAndards for the Reporting of Diagnostic accuracy studies' (STARD) statement.22 23 Quality assessment involved scrutinising the methods of data collection (prospective, retrospective) and patient selection (consecutive enrolment, convenience sample) and descriptions of the test (the type of test and analysis performed by the app) and the reference standard (method to rule-in or rule-out the illness).

Two reviewers independently assessed papers and extracted data using a standardised form. Discrepancies were resolved by discussion between the two reviewers by correspondence with study authors or arbitration by a third reviewer. This was necessary in five cases.

Apps of included studies were searched in Apple’s App Store and on Google Play.

Data synthesis and analysis

Data to fill the two-by-two table were extracted of each study, and sensitivity and specificity were calculated. Sensitivity and specificity were pooled with the unified method implemented into Stata V.14.2 under the routine ‘metandi’. Metandi fits a two-level mixed logistic regression model, with independent binomial distributions for the true positives and true negatives within each study and a bivariate normal model for the logit transforms of sensitivity and specificity between studies. For pooling, at least four studies on the same target condition had to be available.24 Therefore, no separate analysis for health apps on Parkinson’s disease, falling in patients with chronic stroke and atrial fibrillation was possible.

All analyses were done using Stata V.14.2 statistics software package.


Study selection

Electronic searches retrieved 4010 records. After excluding duplicates, 3296 records remained and were screened based on title and abstract. Subsequently, 3209 studies were excluded because they did not fulfil the eligibility criteria. The large majority of records were excluded because they did not contain original data but expressed personal opinion about the possible role of medical smartphone apps. Eighty-seven articles were finally retrieved and read in full text to be considered for inclusion. Out of these, 30 studies provided some clinical data.2 11–21 25–42 Details on these studies are available in the online supplementary appendix 2. Eleven studies reporting 17 two-by-two tables were considered in this review.11–21 Details of these studies are available in table 1. The study selection process is outlined in figure 1.

Supplementary data

Table 1

Characteristics of included studies

Figure 1

Flow chart according to the PRISMA statement. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-analyses.

Study characteristics

The 30 papers providing some clinical data 35 diagnostic health apps for various clinical conditions: They included: screening for melanoma (n=8),12 15–19 27 28 Parkinson’s disease monitoring (n=6)11 21 29 34 35 42 tremor in Parkinson’ disease, in multiples sclerosis or of essential tremor (n=4),13 26 30 39 atrial fibrillation (n=3),14 31 32 rheumatoid arthritis (n=3),33 36 41 wet age-related macular degeneration and diabetic retinopathy (n=3),2 37 38 multiples sclerosis (n=1),25 cataract (n=1)40 and falling in patients with stroke  (n=1).20 The studies altogether involved 1048 subjects, 758 subjects with the target condition and 290 healthy volunteers or controls. One paper reported on approximately 3000 skin lesions of an unknown number of patients.28 The complete data abstraction of these studies is available in the online supplementary appendix 2.

Eleven studies11–21 that investigated 13 diagnostic health apps allowing the construction of 17 two-by-two tables qualified for the meta-analysis. Twelve tables reported on diagnosis of melanoma, three on Parkinson’s disease, one assessed falling in patients with chronic stroke and another atrial fibrillation.

Methodological quality

A summary of the methodological quality is shown in table 2.

Table 2

Summary of methodological quality assessed with the QUADAS-222

Ten studies had a diagnostic case–control design and one was a prospective cohort study.14 Only in one paper, patients were sampled in a consecutive manner.15

A high risk of bias was assessed in all cases. Most high-risk ratings were assigned in domains of ‘Patient Selection’, ‘Index Test’ and ‘Flow and Timing’, whereas fewest high-risk ratings were found within the domain of the ‘Reference Standard’. Hence, several sources of bias were identified that may have affected study estimates. Methodological criteria that were frequently inadequately addressed were ‘interpretation of reference standard without knowledge of the index test’ and vice versa.


Only four studies assessed usability of the investigated diagnostic health app.2 28 36 37 None used a validated instrument. Questions on usability involved that is, reasons for non-adherence, simplicity of use and difficulties and comprehensibility.

Exploratory analyses of diagnostic accuracy

The summary estimate for sensitivity was 82% (95% CI; 0.56 to 0.94) and pooled specificity was 89% (95% CI 0.70 to 0.97). In a subgroup analysis of 12 reports, pooled sensitivity of studies assessing melanoma was 0.73 (95%CI 0.36 to 0.93) and pooled specificity was 0.84 (95% CI 0.54 to 0.96). No pooling was possible for Parkinson’s disease, falling in patients with chronic stroke and atrial fibrillation due to the limited number of studies.

Only one of the apps assessed in this review was available on Apple’s or Google’s app stores.12 A summary of test performance characteristics is shown in table 3 and the hierarchical summary receiver operating characteristic curve is seen in figure 2.

Table 3

Test performance characteristics

Figure 2

Hierarchical summary receiver operating characteristic curve (HSROC).


Main findings

This systematic review of studies assessing the performance of diagnostic health apps using smartphone sensors showed that scientific evidence is scarce. Available studies were small and had low methodological quality. Only one-third of available reports assessed parameters of diagnostic accuracy. Only one app included in the meta-analysis is currently available on app stores. The large majority of health apps available in the stores have not undergone a solid scientific enquiry prior to dissemination.

Results in light of existing literature

To the best of our knowledge, this is the first systematic review assembling the evidence of diagnostic mobile health apps in a broader context. We are aware of one recent paper by Donker and coworkers, who systematically summarised the efficacy of mental health apps for mobile devices.43 In line with our findings, Donker and colleagues call for further research into evidence-based mental health apps and for a discussion about the regulation of this industry. Other reviews, examining efficacy and effectiveness of mobile health apps support our findings.7–9 For example, Bakker and colleagues called for randomised controlled trials to validate mental mobile health apps in clinical care.8 Likewise, Majeed-Ariss and coauthors, who systematically investigated mobile health apps in chronically ill adolescents, pointed at the need of scientific evaluation involving healthcare providers’ input at all developmental stages.7

Strength and limitations

We conducted a comprehensive literature search to retrieve the published evidence, applied stringent inclusion criteria and assessed the methodological quality of the studies systematically. We applied an overinclusive definition of diagnosis, because for example, symptom monitoring might contribute in the diagnostic work-up of a patient. Out of the papers qualifying for inclusion into this review, only about 25% investigated the diagnostic accuracy of the app. We believe that a broader concept of diagnosis in this particular context was useful to capture the relevant literature. Our study has several limitations. First, the primary studies were found to have low methodological quality and level of reporting. All but one of included studies used diagnostic case–control designs. While this design might be helpful in early evaluation of diagnostic tests, it usually leads to higher test performance characteristics than could be expected in clinical practice. From that viewpoint, the summary estimates from the exploratory meta-analysis need to be interpreted very cautiously. The searches performed in the electronic databases had low specificity leading to a large number of irrelevant records. Correspondingly, the ‘number needed to read’ was very high.44 Although we assessed the records in duplicate by two experienced systematic reviewers, we cannot fully rule out that we missed potentially relevant articles. Finally, we were unable to test all but one of the apps12 that had been assessed in this review, because they were not available anymore, and thus lack first-hand experience.

Implications for research

Led by the consumer electronics industry, the production of mobile health apps has gained in importance and popularity within recent years. Unfortunately, the scientific work-up of the clinical usefulness of these apps is leaping behind. While many studies have highlighted the potential and possible clinical usefulness of health apps, research conducted according to the well-established standards of design, sampling and analysis are missing. The regulation applied in the USA, the EU and other countries does not go far enough. Ensuring that medical health apps meet criteria on technical concerns is only one important element of regulation. From the consumers or patients’ perspective, a trustworthy source showing the amount and level of scientific data underpinning the claims made in the app descriptions would be very useful. In our view, it is very important that technical, clinical and methodological experts jointly form an interdisciplinary development team. While the IT experts take care of the technical developments, data safety and compliance with regulatory requirement, clinical expert certify that the app addresses the right medical context, and researchers finally impose appropriate scientific methods to validly quantify the clinical yield. We believe that developers of a (diagnostic) mobile health app should adopt the same hierarchical framework that has been proposed for imaging testing in the seminal paper of Fryback and Thornbury.45


In this comprehensive systematic review, we found a lack of scientific evidence quantifying the diagnostic value of health apps in the medical literature. The information about the diagnostic accuracy of currently available health apps on Apple’s and Google’s app stores is almost absent. Consumers and healthcare professionals should be aware of this when using or recommending them.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
View Abstract


  • RB and LF contributed equally.

  • Contributors RB, LF, LMB, KRL, NSB and MAT obtained and appraised data. LMB and MKS wrote the paper with considerable input from OJ, MAT, RB and KRL. All coauthors provided intellectual input and approved the final manuscript. LMB was responsible for the design and the statistical analysis of the study and is the study guarantor.

  • Funding The work presented in this paper was funded by Medignition Inc, a privately owned company in Switzerland providing health technology assessments for the public and private sectors, via an unrestricted research grant.

  • Competing interests LMB holds shares of Medignition.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement The dataset containing all abstracted data of included studies is available from the Dryad repository: doi:10.5061/dryad.900f8.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.