Article Text
Abstract
Objectives To provide an overview of the methodological considerations for conducting commercial smartphone health app reviews (mHealth reviews), with the aim of systematising the process and supporting high-quality evaluations of mHealth apps.
Design Synthesis of our research team’s experiences of conducting and publishing various reviews of mHealth apps available on app stores and hand-searching the top medical informatics journals (eg, The Lancet Digital Health, npj Digital Medicine, Journal of Biomedical Informatics and the Journal of the American Medical Informatics Association) over the last five years (2018–2022) to identify other app reviews to contribute to the discussion of this method and supporting framework for developing a research (review) question and determining the eligibility criteria.
Results We present seven steps to support rigour in conducting reviews of health apps available on the app market: (1) writing a research question or aims, (2) conducting scoping searches and developing the protocol, (3) determining the eligibility criteria using the TECH framework, (4) conducting the final search and screening of health apps, (5) data extraction, (6) quality, functionality and other assessments and (7) analysis and synthesis of findings. We introduce the novel TECH approach to developing review questions and the eligibility criteria, which considers the Target user, Evaluation focus, Connectedness and the Health domain. Patient and public involvement and engagement opportunities are acknowledged, including co-developing the protocol and undertaking quality or usability assessments.
Conclusion Commercial mHealth app reviews can provide important insights into the health app market, including the availability of apps and their quality and functionality. We have outlined seven key steps for conducting rigorous health app reviews in addition to the TECH acronym, which can support researchers in writing research questions and determining the eligibility criteria. Future work will include a collaborative effort to develop reporting guidelines and a quality appraisal tool to ensure transparency and quality in systematic app reviews.
- telemedicine
- systematic review
- health informatics
- information technology
- statistics & research methods
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
We leveraged the expertise of an experienced research team who have conducted various health app and systematic reviews to provide interim recommendations for good practice, based on seven key steps, to support standardised health app review methods and develop approaches to best practice in the mHealth field.
Relevant literature from key medical informatics journals was screened to present a robust analysis and comparison between systematic reviews and app reviews in health.
We propose TECH, a novel framework for constructing review questions and refining eligibility criteria in health app reviews.
This work focuses on mobile apps and does not include the emerging fields of virtual reality, augmented reality or mixed-reality apps.
The methods presented are informed by existing app reviews, which have primarily focused on client-facing apps (eg, for patients, the public or healthcare providers) rather than for the health system or data services, which are also key target users.
Introduction
With the rise in the use of smartphones and other mobile technologies, there has been an increase in the availability of health applications (mHealth apps) designed to be used by individuals for various health issues. Health apps can also support health and care professionals in their daily clinical practice by providing decision support, access to clinical guidelines and education and training.1 In 2018, over 325 000 health apps were developed,2 covering many health conditions and targeted behaviours. For example, mHealth apps can help to support self-management of conditions like diabetes,3 facilitate remote monitoring of patients with chronic conditions4 or support patients with general behaviour change such as increasing/monitoring physical activity5 or dietary change.6 Some health apps also support public health initiatives, such as promoting healthy lifestyles and encouraging the uptake of screening and vaccination programmes.7 8 The World Health Organization (WHO) have released mHealth apps to educate people about road safety, sun protection measures and COVID-19.9 However, concerns have been raised about the quality of advice and support such health apps provide.10 Bates et al also highlight concerns with the accuracy of apps designed to support medical diagnosis and potential gaps in the quality of apps regarding safety and privacy.2
There have been attempts to provide frameworks for evaluating the quality of mHealth apps. A systematic review identified 45 frameworks for evaluating mHealth apps, which varied according to the target users (eg, developers, patients), specific conditions (eg, diabetes, mental health, cancer, pain) and various elements of evaluation that are identified in the core domains for Health Technology Assessment framework (such as safety and effectiveness).11 Other frameworks have promoted a more holistic approach by encompassing privacy and security, the evidence base, ease of use and data integration12 or ethical principles related to using health apps in health psychology.13 Reviews have also identified existing methods for assessing mHealth app quality,14 as well as guidelines for reporting evaluations of specific types of technology, such as sensors15 and mHealth interventions, more broadly.16 This includes the development of the Consolidated Standards of Reporting Trials of Electronic and Mobile HEalth Applications and onLine TeleHealth (CONSORT-EHEALTH) checklist, an extension of the original CONSORT checklist for reporting randomised controlled trials (RCTs).17 This focuses on improving the reporting of evidence from research into the use and effectiveness of mHealth applications in research studies.
The existing initiatives reflect a focus on the quality of research activity, whereby mHealth interventions are evaluated in research studies, and the results of those studies are reported.18 The process of systematically reviewing published studies then provides an overview of the evidence base for the use and effectiveness of different types of mHealth interventions for different patient populations and various purposes. However, guidance is missing on how to systematically review commercially developed mHealth apps (ie, the software products available to download from app stores), which are often not derived from research or subject to evaluation in research studies.
The process of searching, screening, extracting and analysing data, and critically appraising mHealth apps available via commercial platforms can differ from traditional approaches to reviewing published research studies of health apps. A key difference between a systematic literature review and a systematic health app review is the items evaluated. In a systematic review, reviewers attempt to identify evidence from research studies from peer-reviewed journals or grey literature to evaluate the effectiveness (RCT evidence) or other characteristics that influence the effectiveness, uptake and engagement with digital health interventions. As we have outlined, several existing systematic reviews of mHealth applications do this. In a systematic health app review, we focus on providing a transparent and replicable evaluation of the functionality, quality and purpose of mHealth apps for particular user groups or health conditions. A systematic health app review is informed by the principles and process of more traditional systematic reviews, in terms of approaches to searching, use of inclusion/exclusion criteria and explicit assessment measures of quality. However, how these are operationalised is methodologically different and is the focus of this paper. By building on our research team’s experiences of conducting and publishing various reviews of commercially available mHealth apps, we provide an overview of the methodological considerations, aiming to systematise the process and support high-quality reviews of mHealth apps. In doing so, we outline the 7-step process for conducting systematic health app reviews.
Methods
In this paper, we use examples from our previous work as case studies, supported by work from other authors to develop a new framework for conducting a review of commercially available health apps. We combine our experience (see table 1) with the results of a hand search of the top medical informatics journals (ie, The Lancet Digital Health, npj Digital Medicine, Journal of Biomedical Informatics and the Journal of the American Medical Informatics Association) over the last five years (2018–2022) to identify other reviews of commercial health apps to contribute to the discussion of this methodological approach. Based on this, we propose methods for writing the research question and aim, determining the eligibility criteria and carrying out the review and highlight and discuss the methodological issues raised at each stage.
The reviews we draw on cover a range of apps and provide examples of a number of the decisions and challenges in conducting such reviews. Two of our reviews informed wider research studies; a review of apps used to support hand hygiene to provide the focus for a subsequent research evaluation,19 and a review of patient-facing genetics apps to inform the design and development of a genetic counselling app.20 Our other app reviews have provided evidence alongside a more traditional systematic review of the published research literature or as an independent review to guide clinicians/patients about mHealth apps they could use to support patient care.
As the app review process follows much of the same steps as conducting systematic literature reviews, we also drew on some existing guidance to formulate the seven steps. This included the work by Khan et al21 who name five steps for conducting systematic literature reviews: (1) framing the question, (2) identifying relevant publications, (3) assessing the quality of studies, (4) summarising the evidence and (5) interpreting the findings. Xiao and Watson22 name similar steps to conducting reviews but added steps for developing and validating the review protocol, screening for inclusion, extracting data and reporting the findings.
Results
Through discussion within the research team and drawing on our experiences of conducting app reviews and through cross-checking with app reviews by other author teams, we have outlined seven steps to support rigour in conducting reviews of health apps available on the app market. The steps are: (1) writing a research question or aim; (2) conducting scoping searches and developing the protocol; (3) determining the eligibility criteria using the TECH framework; (4) conducting the final search and screening of health apps; (5) data extraction; (6) quality, functionality and other assessments and (7) analysis and synthesis of findings. Each step is discussed in turn.
Step 1: writing a research question (or aims)
The focus of an app review will influence the development of the research questions or aims and underpinning approach to evaluating health apps. If the purpose is to produce a standalone review to support future research and innovation in a specific health domain, understanding existing gaps can help formulate a more general research question. However, if the review is the starting point of a programme of research that aims to design, develop and evaluate a new health app with a population of patients, carers, health professionals or the public then the research questions may be more focused to examine aspects of apps such as their quality, functionality or availability. Formulating an answerable review question is essential for systematic literature reviews. While formulating a review question helps guide all stages of a systematic literature review (eg, searching, screening, extracting and synthesising), not every question format applies to systematic app reviews. For example, the PICO format is appropriate for systematic literature reviews looking at the effectiveness of interventions in a target population. However, in systematic health app reviews, reviewers can only access the results of effectiveness or evaluation studies (if conducted) if they are published. A bespoke alternative is required, just as with systematic qualitative reviews where SPIDER is used.
Therefore, we propose the acronym ‘TECH’, which represents (1) Target user, (2) Evaluation focus, (3) Connectedness and (4) Health domain, as a mechanism to develop a focused research question to guide a health app review. TECH was designed through discussion by the research team and by mapping key concepts against existing frameworks (eg, SPIDER and PICO). Figure 1 presents the acronym, questions which researchers may consider (and the similarity to other acronyms) and a worked example for one of our reviews which aimed to identify commercially available atrial fibrillation self-management apps, analyse and synthesise characteristics, functions, privacy/security, incorporated behaviour change techniques and quality and usability.23 TECH is designed to capture the nuances of health domains, supporting the development of clear, answerable research questions. It is important to note that connectedness refers to connecting with other devices or applications and existing human-driven digital or other services—such as a health app for booking appointments with therapists.
Step 2: conducting scoping searches and developing the protocol
A preliminary (scoping) search of the health app market via the Apple, Google and Microsoft app stores is an essential first step to help determine whether the number of commercial health apps available is feasible to review. It is worth noting that the language used in descriptions of commercial mHealth apps can vary widely and differ from the scientific language used in published research studies. Hence, a broad search using a range of terminology should be employed initially to avoid missing relevant health apps. We recommend that researchers use basic keywords focused on the health domain/topic (see figure 1) as the search function within app stores is limited. For example, for our hand hygiene app review we only used two keywords: hand hygiene and hand washing.19 In our cancer app review,24 we used more keywords, but all were related to the health domain and only one focused on the target user (patients): cancer, cancer patient, cancer treatment, cancer management and cancer side effects.
If too few health apps are returned, this might allow for broadening the scope of the topic or adding more keywords, while too many apps will likely require the scope and language used to be narrowed. This means that the research question and the eligibility criteria may need to be refined iteratively, with multiple scoping searches performed until a reasonable number of apps are identified. The number of potential apps that may be included in the review can be counted by reading the app’s name and description and judging its relevance to the topic.
To give an indication of how many apps is reasonable to review, we previously identified 236,25 405,24 555,23 668,19 75420 and 393826 health apps from initial searches, before screening or deduplication took place. One of our reviews identified 7561 apps before screening27 due to the topic (exercise), for which many apps exist. Following the initial screening of app titles and app store descriptions, this number was significantly reduced, and only 13 were included in the review. However, each research team should decide what number is appropriate by considering resources (eg, time, budget, and the number of reviewers available) and the topic of interest.
Scoping searches can also help to identify important considerations for refining the inclusion and exclusion criteria. For example, in our review of hand hygiene apps,19 we initially intended to target healthcare providers as the population of interest. However, the scoping search highlighted that few apps specified their intended users. We, therefore, removed healthcare providers as the intended audience from the inclusion criteria and replaced this with adults more generally.
It is important to note that the scoping search should be conducted when the team is almost ready to begin the final searches, as health apps can disappear and emerge quickly from app stores. This means that the numbers determined from the scoping searches will likely differ from the number of apps identified in the final search. Longer periods between the scoping and final search will result in more substantial differences in the number of apps available.
In addition to scoping searches in the app stores, we recommend conducting initial searches in databases (eg, MEDLINE and SCOPUS) and protocol registration databases to identify whether similar app reviews have been published or are underway. We also strongly recommend that researchers prospectively develop a protocol to guide their methods. Unlike systematic reviews which should usually be registered on PROSPERO (https://www.crd.york.ac.uk/prospero/) or OSF (https://osf.io/), there have not been any formal requirements to publish protocols of systematic health app reviews. However, we recommend that future protocols for commercial health app checks be published (in advance of the searches) on OSF to reduce the likelihood that the review is unnecessarily duplicated and ensure greater research transparency. This is becoming accepted practice for other reviews (eg, scoping reviews) for which PROSPERO registration is currently not possible.28
Step 3: determining the eligibility criteria using the TECH framework
Inclusion and exclusion criteria should be carefully defined using the information obtained in the scoping searches. Frameworks such as PICO or SPIDER may help develop eligibility criteria for a review so that characteristics such as the population/sample, type of intervention or phenomenon of interest and outcomes related to mHealth apps are considered.29 However, we propose using a more focused framework for health app reviews to support the nuances of undertaking this type of systematic review, as the aim is not to examine the effectiveness of apps or to synthesise the findings of qualitative studies on health apps. As described above, the ‘TECH’ acronym considers the Target user, Evaluation focus, Connectedness and Health domain (see figure 1 for the worked example). Well-thought-out eligibility criteria using the TECH framework may support the development of an appropriate search strategy considering four aspects of commercial health apps, which can lead to a systematic and unbiased selection of appropriate apps. Additionally, reviewers should consider searching the literature to identify published and relevant app reviews to refine the eligibility criteria further.
A health app may be characterised by its intended use and target audience. Therefore, we suggest separating the eligibility criteria into two components: app characteristics (evaluation focus, connectedness, health domain) and target audience characteristics (eg, age, gender, race/ethnicity, geographic location). App characteristics are likely to include the type of health intervention or health prevention method, such as self-management, that can be captured under the ‘Evaluation focus’, along with the target disease, problem or focus of the health app, which falls under the ‘Health domain’. It may be helpful to state whether the health condition is specific or if it covers a broad category of health-related issues. Some health apps can link to other software applications, connect to hardware devices (eg, wearable technologies) or rely on additional external devices (eg, virtual reality headsets or smartwatches) to function correctly, which should be captured under the ‘Connectedness’ criteria. The target audience characteristics will likely include the population type, including patients, healthcare professionals, healthcare students, carers, the public, or particular organisations. Audience characteristics may also include age ranges or if the app is aimed at children or adults; whether the app is aimed at individuals or groups; whether it is assumed that users will pay to access the app (or content within it) and whether the app is available in certain languages or locations.
Step 4: conducting the final search and screening of health apps
Once the scoping searches are complete and the search terms have been collated, the final search can be run on the main app stores (ie, Apple, Google Play and Microsoft) to identify potentially relevant health apps. Several third-party app stores or repositories are also available for Android-based apps (eg, Amazon Appstore, F-Droid and Samsung Galaxy Apps), some of which are open source making the download process easier. The volume of app stores that can be searched will depend on the time and resources available to the review team. Other approaches include using a proprietary software database which enables searching for mobile health apps across iOS and Google Play app stores30 or publicly available online rating frameworks for health apps that use expert reviewers31 such as the Organisation for the Review and Care of Health Apps (ORCHA, https://orchahealth.com/), the PsyberGuide by One Mind (https://onemindpsyberguide.org/) and MindTools. This approach could be combined with independent searching and evaluating of commercial mHealth apps to ensure an exhaustive assessment is conducted.
In contrast to established bibliographical research databases such as MEDLINE or PubMed, which enable complex searches (eg, use of Boolean operators and filtering options), the search function within app stores is limited. Basic filters may exclude apps that cost or only include child-friendly apps. Some stores (eg, the Google Play store) also enable for users to identify family-friendly apps and distinguish the type of app being searched (ie, phone, tablet, TV, Chromebook, watch or car). The Apple app store also has basic filters for the price (any or free), category (including health and fitness) and sorting (relevance, popularity, ratings or release date). Other factors app stores use may also affect search results (eg, app rating or use of adverts). To overcome this, searching across multiple app stores is advisable.
The search on an app store results in a list of available apps and their descriptions. Unlike research literature databases, there are no options to export the results in a useful format (eg, RIS) for uploading to specialised screening and data management software (eg, Covidence32 or Rayyan33). Hence, the research team must review each app on the results page of each app store to determine if it meets the inclusion criteria for the review. Like systematic literature reviews, using two or more researchers to reach a consensus on eligibility from the review enhances rigour and study quality. Screening should ideally be conducted on the same day to avoid differences in search results on the different app stores, which can vary day-to-day and across countries. A workaround can be to log the app name, version number and link to the webpage on the app store hosting each app to capture the search results and ensure these are used consistently by the review team.
The second screening stage involves downloading all apps deemed to have met the inclusion criteria, requiring at least one Android and one iPhone smartphone between the review team. Strategies for this can include having separate researchers download apps using different devices or sharing one device between reviewers. Some health apps also require user accounts to be set up and verified before allowing access to their full functionality, which may be required to assess eligibility. In one of our reviews, we approached app developers for full app access, finding that they were more than happy to allow this.27 Additional researchers can be consulted to resolve differences during the second screening phase. Finally, modifying a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram34 can provide a transparent overview of the search and screening process. This also requires clearly stating the number of duplicates across the searches in addition to how many apps were excluded at each screening stage, with the reasons outlined at the second stage.
Step 5: data extraction
Like systematic literature reviews, the data extraction process in app reviews requires identifying relevant information from the eligible apps. Data are extracted into a pre-defined data extraction (coding) sheet by using the app. The length of use to extract the information depends on the types of apps, number of data extraction items and the focus of the review. For example, some apps will take longer to review as they may require more comprehensive information to be extracted, users to register personal profiles or send push notifications at specific times of the day (eg, behaviour change apps).
A range of data extraction items may be used. Henson et al12 developed a five-level framework for evaluating health apps, concluding that background information, privacy and security, evidence, ease of use and data integration are vital components to consider. Across our previous work, we have categorised the items as descriptive information, technical information and content (see table 2). We have included additional items regarding gamification principles and tactics used in an app review by Rajani et al35 and levels of personalisation, security and privacy, which Parmar et al30 proposed. Other approaches used in one of our reviews23 include extracting information on the Online Trust Alliance Best Practices Privacy Recommendations.36 This considers four elements: (1) basic notice/disclosure, (2) key compliance policies, (3) protected privacy and protected sharing criteria and (4) miscellaneous privacy elements. Lastly, our osteoporosis app review considered ratings by the ORCHA (https://orchahealth.com/). ORCHA objectively reviews health apps, giving scores for three domains (data privacy, professional assurance and usability/accessibility) and an overall score (%). Scores below 65% imply the presence of some issues, and below 45% indicate considerable issues.
We encourage researchers to develop their own data extraction items relevant to their topics of interest. For example, in our hand hygiene app review and like other app reviews,37 38 we developed criteria to assess the comprehensiveness of the content by identifying themes across reputable sources and guidelines on hand hygiene. Other examples include extracting information about health app security and privacy, including HIPPA or COPPA compliance, whether a medical disclaimer was provided, encrypted data disclaimer, and user verification strategies during login.30
We also note that sometimes the information sought is not readily available or transparently reported within apps. In this case, researchers should note where information is missing, using acronyms like N/R (not reported) or N/A (not available). This can also be an interesting finding and an opportunity for apps to be improved. For example, excluding information about data sharing may be concerning for health apps that collect and record personal medical information.
Readability metrics can also help determine how appropriate the language used in each app is. Researchers can determine readability by copying a paragraph into a Microsoft Word document and using two Flesch-Kincaid metrics that are built into the word processing software.39 40 The Flesch-Kincaid Reading Ease score ranges from 0 to 100, with higher scores indicating that the material is easier to read.39 The Flesch-Kincaid Grade Level gives a score that refers to the equivalent grade level of education in the USA.40 For example, a score of 12 indicates that a twelfth grader (aged 17 or 18) in the USA should be able to understand the content.
Step 6: quality, functionality, and other assessments
Quality
Evaluating the quality of apps requires a different approach to using critical appraisal tools or risk of bias measures commonly used in systematic literature reviews. Quality can be assessed using the Mobile App Rating Scale (MARS), which comprises 19 items across four objective scales (engagement, functionality, aesthetics and information quality) and an additional 4-item subjective quality scale.41 Each item is rated on a 5-point Likert scale: (1) inadequate, (2) poor, (3) acceptable, (4) good and (5) excellent. MARS has been translated into several languages, including French, Spanish, German and Italian42–45 and is suitable for assessing mobile apps for health conditions due to its reliability, validity and objectivity.46 In all but two of the app reviews we have conducted to date, we have excluded the subjective quality scale to ensure that assessments are as objective as possible. Nevertheless, Stoyanov et al41 reported that the objective measures correlated well with the subjective measures. A step-by-step training video on the use of the MARS is available on YouTube.47
The MARS question ‘Has the app been trialled/tested?’ can be answered by searching for literature on evaluation (eg, usability, satisfaction or effectiveness) and using more traditional methods, such as risk of bias, to evaluate the quality of the evidence.48 However, some review teams may wish to take this a step further if the review aims to recommend evidence-based apps to their target population. For example, in our review on strength and balance exercises for older adults,27 we also visited the app/developer websites and contacted the developers directly for information on any evaluations that had taken place concerning the effectiveness of the apps in preventing falls. Due to the absence of evaluations, we compared the interventions promoted by the app with those used in known ‘gold standard’ strength and balance programmes to determine if they had an evidence base.
Researchers may also build on the MARS items to assess the quality of each app in more detail. In one of our reviews, we added predetermined criteria to further evaluate the current state of development of apps for pain assessment and to provide future directions to developers.26 For example, for the item about customisation, we looked at whether the app provides a setting tool allowing users to change the interface to suit them best. For the interactivity item, we also extracted data on the manikin regarding dimension (2-dimensional or 3-dimensional), orientation (left/right) and gender (male, female or neutral). It is important to note that directly amending the MARS may impact the validity of the tool. However, additional relevant items can further explore the dimensions.
Functionality
We recommend using the IMS Institute for Healthcare Informatics functionality score to assess the functionality of apps.49 This records the availability of 11 different functions within an app, rated 1 if they are present and 0 if otherwise (see table 3). It complements the MARS functionality score, which measures the quality of performance, ease of use, navigation and design of an app using rating scales. The IMS functionality score is calculated from seven main criteria (inform, instruct, record, display, guide, remind or alert and communicate) and four subcategories under the ‘record’ item (collect, share, evaluate, intervene). An overall functionality score, between 0 and 11 for the full scale, is calculated by summing the scores across the individual items. The IMS scale may be tailored to ensure relevance for a specific review; for instance, in our review of hand hygiene apps, we omitted the ‘evaluate data’ criteria because it was irrelevant to the topic.19
Written results from the IMS scale are generally supported with visual representations, such as radar graphs/charts, which map variables onto axes protruding from a central point (see figure 2). Each axis can represent a different item of the IMS, with values plotted onto each axis. These types of data visualisations could be helpful for clinicians, patients, developers and other stakeholders so they can quickly see which health apps are ranked high or low across a range of evaluation metrics to inform decision-making about which, if any, to use.
Other assessments
Other ways to evaluate health apps include examining the user reviews of each app on the app stores. Plante et al50 took this approach when reviewing blood pressure measuring smartphone apps by downloading the ratings and reviews from the iTunes store and developing a series of narrative themes linked to the high and low-rated apps. Some themes associated with high user ratings included accuracy, login functionality, convenience and successful measurement. In contrast, lower-rated apps were associated with inaccuracy, inability to produce a successful reading and refunds requested by users. A qualitative approach to understanding the experiences of a range of users can help inform the final evaluation of a range of apps.
Other rating tools include THESIS, developed by evaluating over 200 mHealth apps with a panel of experts.51 THESIS encompasses six domains: (1) transparency, (2) health content, (3) technical content, (4) security/privacy, (5) usability and (6) subjective rating, which considers other factors like software stability, interoperability, bandwidth and application size. Additionally, health apps have been evaluated for their security features (eg, app signing security, encryption schemes, malware presence, permissions and secure communication adoption) along with users' subjective perceptions of app security.52 However, this approach requires technical expertise to undertake static and dynamic analysis techniques that may be outside some research teams' scope.
Other reviews have also evaluated criteria such as ethical values and medical claims. This includes beneficence, non-maleficence, autonomy, justice and legal obligation in COVID-19 mobile phone apps following the Systems Wide Analysis of mobile health-related Technologies provided in the NHS Digital Assessment Questionnaire53 and the medical claims of mental health apps such as scientific language, technical expertise and lived experience perspectives.10
Step 7: analysis and synthesis of findings
Data synthesis may be performed by generating descriptive statistics (sums, averages, standard deviations and percentages) on relevant items or combining these with forms of qualitative synthesis. Previously, we identified the highest-scoring apps regarding functionality and quality, presenting these with a written description of their main features. Inter-rater reliability can be calculated for the binary IMS Institute for Healthcare Informatics functionality scores using Cohen’s Kappa statistic54 and an intraclass correlation coefficient (ICC) can be used to calculate inter-rater reliability for the ordinal MARS scores.55 The ICC is the most commonly used statistic for assessing inter-rater reliability for ordinal variables. We have typically used an absolute agreement 2-way mixed-effects, average-measures model,56 which assumes that the raters are fixed and that systematic differences between raters are relevant.
Patient and public involvement and engagement
None of the commercial health app reviews generated by our research team actively included patients or the public due to pragmatic reasons, including time, resources and funding constraints. Additionally, we did not identify any health app review recently published in the top medical informatics journals that took this approach. However, patient and public involvement and engagement (PPIE) is viewed favourably by many funders, researchers and policymakers as it can add value to health research and facilitates its dissemination and impact. As outlined above, all stages of a commercial app review could benefit from the perspectives of patients, carers and members of the public towards the health apps being evaluated. As with traditional systematic reviews, they could assist with reviewing the protocol, searching, screening, extracting and analysing data57 and co-production by undertaking quality, usability or other assessments and participating in various dissemination activities. PPIE could provide another valuable dimension to the process and enrich the results of a commercial health app review.
Key differences between systematic literature reviews and systematic health app reviews
Here, we summarise the main differences between a traditional approach to systematic reviewing literature versus undertaking a commercial health app review, as outlined in this methodological discussion (see table 4).
Discussion
This methods paper outlines the 7-step process for conducting systematic reviews of commercial health apps. Through comparison with systematic literature reviews, we explore the complexities of each stage of an app review and provide suggestions on how to formulate a research question, develop and run scoping searches, register the protocol, determine the eligibility criteria, conduct the final search and screening, extract data, perform quality assessments and synthesise the findings. We also propose that the novel TECH framework is adopted to allow a standardised specification to be developed and applied in health app reviews, similar to using PICO, SPICE or SPIDER in traditional systematic reviews.29 Additionally, we highlight the potential for PPIE activities within health app reviews.
Although health app reviews share core features with systematic reviews, three key differences warrant discussion. First, commercial health apps, and reviews of these, are more transitory with rapid changes in the mHealth landscape that private industry providers dominate. For example, the geographical and price specificity of app sources are not replicable in the same way that evidence sources for systematic reviews are, and apps can appear, change and disappear quickly, impacting the replicability of the search results. Authors must be aware of the need to report granular details of their searches and to publish their review within a reasonable time from the search date, such as by using preprint servers while peer-review in a scientific journal takes place. Furthermore, the resources available to the review team are more critical than in systematic reviewing, so it is essential to transparently document scoping searches and iterative adjustments to review scope and inclusion criteria.58
Second, the critical appraisal process for health app reviews requires a radically different approach to quality assessments for literature reviews, with multiple approaches and tools being used to explore app functionality and quality. This proliferation of assessment approaches means there is no ‘gold standard’ equivalent to the Cochrane Risk of Bias tool for RCTs in systematic reviews.48 59 There is also no equivalent to the wider consideration of evidence certainty which is provided by Grading of Recommendations, Assessment, Development and Evaluation (GRADE) in systematic reviews.60 Researchers must select an approach that best fits the review aims. Additionally, they may need to consider national standards within some countries, such as those from the UK National Institute for Health Research on digital health technologies evaluation,61 which may require specific approaches for the review context.
The third significant difference between systematic literature and health app reviews is the extent to which guidance, guidelines and infrastructure support them. The methodological and reporting guidance for health app reviews is in its infancy. In contrast, an extensive body of literature outlining methods for different types of systematic and now scoping and rapid reviews exists.62–64 There are also clear reporting guidelines for systematic reviews, which have been expanded to scoping reviews but are lacking for health app reviews.34 65 While we are undertaking research to develop methods and reporting guidance for app reviews, systematic review guidance should be referred to and adapted as necessary as an interim measure. Ultimately, there may also be a need for a tool which will allow critical appraisal of these types of mHealth reviews to parallel those that exist for systematic reviews such as AMSTAR-266 or ROBIS.67
We propose that this outline will guide the conduct of good quality app reviews that can inform healthcare practitioners, patients, carers, health service managers, educators, and policymakers. We also recommend the prospective registration of an app review protocol on OSF as a suitable alternative to PROSPERO where systematic review protocols are held,68 and the use of preprint servers to make app reviews openly available online, allowing for rapid dissemination of findings ahead of journal publication.
Implications
This outline will help ensure that others can easily replicate the methods and that future app reviews are conducted in a standardised and rigorous manner. However, while outlining the methods, we noted gaps in conducting and reporting commercial health app reviews that need addressing. Hence, we are developing reporting guidelines for systematic health app reviews and plan to subsequently develop a quality appraisal tool. Similar to the 27-item PRISMA guideline for systematic reviews34 and the 22-item PRISMA-Scr guideline for scoping reviews,65 our guideline will consist of a structured list of items that should be included when reporting commercial health app reviews.
For those conducting commercial health app reviews, there is an opportunity for the inclusion of stakeholders to strengthen the quality and impact of their findings. This is particularly beneficial if the intended target audience experiences barriers when using health apps, as clear recommendations from an app review can help to improve the design and function of future versions of an app. Researchers should also be aware of the context in which the review is being conducted. Namely, companies owning the apps may use the review for business development and promotion opportunities or contest the quality scores. However, this highlights an opportunity for further stakeholder engagement: researchers could collaborate or consult with developers to ensure that the product aligns with the research assessment process of an app’s quality. This has the potential to influence and promote accessibility and quality as aspects of development that might not be considered otherwise. While industry developers focus on creating a commercially viable product, understanding this review process will potentially enhance and refine their development process to create a superior app than initially proposed. Ultimately, it is important to be aware of any conflicts of interest between researchers who are conducting reviews in systematic and robust ways, and industry who may wish to promote their work and financially benefit from the review findings. As with systematic reviews, collaborations which have the potential to generate such conflicts of interest should be fully and transparently reported in reviews, and review methods which minimise their potential impact should be implemented.
Strengths and limitations
This methodological discussion has numerous strengths, such as an experienced research team who have conducted various health app reviews and various types of traditional literature and systematic reviews. This was supplemented by identifying and including relevant app reviews from the top medical informatics journals and a robust analysis and comparison between traditional systematic reviews and commercial health app reviews. However, a systematic search of health app reviews was not undertaken (this will form part of our work in developing reporting guidance), nor did we focus on the emerging field of extended reality (ie, virtual, augmented and mixed reality) and their corresponding apps, many of which are health-related and available in other app stores (eg, Steam and Oculus/Meta). Additionally, our app reviews focused on apps for clients (eg, patients or the public) and healthcare providers rather than for the health system or data services, which are also target users of digital interventions.69 It is likely that the recommendations in this discussion about evaluating commercially available mHealth apps will also apply to other health apps, including extended reality, and could support researchers working in these fields.
Conclusion
Reviews of commercial apps can provide insights into the availability of apps for a specific health topic, including their quality and functionality. We have proposed a 7-step method in an effort to standardise the process of conducting mHealth reviews. At each step, we have discussed the methods in contrast to systematic literature reviews, given that the process should similarly be systematic and robust. We have also introduced the novel TECH acronym, which will assist researchers with writing research questions for app reviews and determining the eligibility criteria. Through ongoing collaboration, we will continue to advocate for transparency and quality in app reviews by working on reporting guidelines and a quality appraisal tool.
References
Footnotes
Twitter @shivoconnor
Contributors NG and DD conceptualised the paper. NG, DD, GN, LM, CE-T, DJ, AV, SMA and SO made intellectual contributions and contributed to writing the paper. All authors approved the final manuscript.
Funding NG, DD, GN, CE-T and SMA are funded by the National Institute for Health and Care Research Applied Research Collaboration Greater Manchester (NIHR ARC GM). LM is supported by a National Institute for Health and Care Research Senior Investigator Award to Professor Chris Todd (Reference NIHR200299). The views expressed in this publication are those of the authors and not necessarily those of the National Institute for Health and Care Research or the Department of Health and Social Care. Grant numbers: N/A.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.