Article Text

Original research
Biomarker discovery studies for patient stratification using machine learning analysis of omics data: a scoping review
  1. Enrico Glaab1,
  2. Armin Rauschenberger1,
  3. Rita Banzi2,
  4. Chiara Gerardi2,
  5. Paula Garcia3,
  6. Jacques Demotes3
  7. the PERMIT Group
  1. 1Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
  2. 2Center for Health Regulatory Policies, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milano, Italy
  3. 3European Clinical Research Infrastructure Network, ECRIN, Paris, France
  1. Correspondence to Dr Enrico Glaab; enrico.glaab{at}


Objective To review biomarker discovery studies using omics data for patient stratification which led to clinically validated FDA-cleared tests or laboratory developed tests, in order to identify common characteristics and derive recommendations for future biomarker projects.

Design Scoping review.

Methods We searched PubMed, EMBASE and Web of Science to obtain a comprehensive list of articles from the biomedical literature published between January 2000 and July 2021, describing clinically validated biomarker signatures for patient stratification, derived using statistical learning approaches. All documents were screened to retain only peer-reviewed research articles, review articles or opinion articles, covering supervised and unsupervised machine learning applications for omics-based patient stratification. Two reviewers independently confirmed the eligibility. Disagreements were solved by consensus. We focused the final analysis on omics-based biomarkers which achieved the highest level of validation, that is, clinical approval of the developed molecular signature as a laboratory developed test or FDA approved tests.

Results Overall, 352 articles fulfilled the eligibility criteria. The analysis of validated biomarker signatures identified multiple common methodological and practical features that may explain the successful test development and guide future biomarker projects. These include study design choices to ensure sufficient statistical power for model building and external testing, suitable combinations of non-targeted and targeted measurement technologies, the integration of prior biological knowledge, strict filtering and inclusion/exclusion criteria, and the adequacy of statistical and machine learning methods for discovery and validation.

Conclusions While most clinically validated biomarker models derived from omics data have been developed for personalised oncology, first applications for non-cancer diseases show the potential of multivariate omics biomarker design for other complex disorders. Distinctive characteristics of prior success stories, such as early filtering and robust discovery approaches, continuous improvements in assay design and experimental measurement technology, and rigorous multicohort validation approaches, enable the derivation of specific recommendations for future studies.

  • biomarkers
  • scoping review
  • omics
  • machine learning
  • stratification

Data availability statement

The study protocol was published on the online platform Zenodo.19 Copies of searches and data extraction sheets will be made publicly available on Zenodo as part of the database collection for all scoping reviews conducted in the PERMIT project.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This scoping review provides an overview of biomarker discovery studies using machine learning analysis of omics data which have led to clinically validated diagnostic and prognostic tools.

  • The review discusses shared characteristics of successful biomarker studies as a guidance for study design, discovery and validation method choices for future projects.

  • Data extraction and analysis methods focus on deriving recommendations to optimise the design of prospective studies and improve analysis workflows for retrospective studies.

  • The review applied minimum eligibility criteria for sample size and statistical validation, but did not assess the quality of the included studies.


Personalised medicine is a rapidly developing area in healthcare research and practice, which aims at providing more effective and safer therapies tailored to the individual patient, by exploiting subject-specific molecular, clinical and environmental data sources (box 1).

Box 1

What is personalised medicine?

According to the European Council Conclusion on personalised medicine for patients, personalised medicine is ‘a medical model using characterisation of individuals’ phenotypes and genotypes (eg, molecular profiling, medical imaging, lifestyle data) for tailoring the right therapeutic strategy for the right person at the right time, and/or to determine the predisposition to disease and/or to deliver timely and targeted prevention.116

In the context of the PERMIT project, we applied the following common operational definition of personalised medicine research: a set of comprehensive methods (methodology, statistics, validation, technology) to be applied in the different phases of the development of a personalised approach to treatment, diagnosis, prognosis or risk prediction. Ideally, robust and reproducible methods should cover all the steps between the generation of the hypothesis (eg, a given stratum of patients could better respond to a treatment), its validation and preclinical development, and up to the definition of its value in a clinical setting.19

A central tool in personalised medicine and the focus of this study is the machine learning (ML) analysis of omics profiling data to derive molecular biomarker signatures for disease-based or drug-based patient stratification.1 The major goals for ML-based omics biomarker development are to develop more reliable and robust tests for drug response prediction, early diagnosis, differential diagnosis or prognosis of the future clinical disease course.2 Omics-derived biomarker signatures may help to guide treatment decisions, and to focus therapies on the right populations to prevent overtreatment, increase success rates and reduce costs.3 As a research and information tool, they may enable a better monitoring of disease progression and treatment success, and guide new drug development and discovery.4 In contrast to classical single-molecule biomarker approaches, omics signatures have the potential to provide more sensitive, specific and robust predictions of disease-associated outcomes.5

However, while biomarker discovery projects using omics data have already led to the successful development of clinically validated diagnostic and prognostic tests,6–15 many biomarker studies are discontinued after early development stages or fail in later clinical validation stages. Dedicated statistical and ML methodologies for omics biomarker discovery and validation have been published, as well as recommendations for study design, implementation and reporting.16 17 The distinctive features and approaches which characterise prior successes in translating omics research findings into clinically validated tests have, however, not yet been investigated in detail. In order to guide future projects on suitable method choices, there is a need for dedicated studies on the key determinants of previous translational successes in ML-based omics biomarker development.

As part of an EU project on ‘Personalised Medicine Trials’ (PERMIT18), funded within the H2020 framework, we have therefore investigated the current methodological practices for personalised medicine, covering ML approaches for omics-based patient stratification as a major focus area. While a broader series of questions was established and examined for the overall scoping review,19 for this manuscript, we focused our analysis on biomarker discovery studies that have led to successful, clinically validated FDA-cleared tests or laboratory developed tests (LDTs), to determine their shared and distinctive characteristics compared with studies with no clinical translation. In particular, we aimed to address the following specific research questions:

  • Which omics-derived biomarker discovery studies have led to clinically validated tests for patient stratification (LDTs or FDA-cleared tests)?

  • What are the key characteristics shared by successful omics biomarker studies and distinguishing them from previously published biomarker studies which have not yet led to clinically validated tests?

  • Which types of model building and validation methods have been used to develop clinically validated biomarker signatures, and what are the lessons learnt and recommended workflows?

  • Which recommendations and guidelines have been proposed to address common challenges in biomarker development using omics data?

These questions lend themselves to a scoping review, because omics-derived biomarker development is still an evolving field, and a preliminary assessment of the potential scope and size of the available biomedical literature on these topics is required as a first step for further follow-up research. Therefore, the objective of this study was to address the above questions by retrieving and examining the current literature on biomarker discovery and validation studies using omics data and ML approaches. While the focus on articles describing discovery and validation approaches covers relevant aspects for clinical translation, we point out that other translational and regulatory aspects, such as the assessment of the clinical efficacy of biomarker-associated treatment decisions, the assessment of cost-effectiveness and research ethics, are not addressed in the present review, but have been discussed in previous dedicated articles.20–24 Our scoping review also does not aim at providing a quantitate benchmark evaluation of different ML approaches, but relevant studies have previously been presented for supervised ML,25 unsupervised clustering26 and survival prediction27 on multiple omics data types.


We conducted a scoping review following the methodological framework suggested by the Joanna Briggs Institute.28 This framework consists of six stages: (1) identifying the research questions, (2) identifying relevant studies, (3) study selection, (4) charting the data, (5) collating, summarising and reporting results and (6) consultation.

The scoping review approach was considered most suitable to respond to the broad scope and the evolving nature of the field. Compared with systematic reviews that aim to answer specific questions, scoping reviews present a general overview of the evidence pertaining to a topic and are useful to examine emerging trends, to clarify key concepts and identify gaps.29 30 Before conducting the review, a study protocol was published on the online platform Zenodo.19 Due to the iterative nature of scoping reviews, deviations from the protocol are expected and duly reported when occurred. We used the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews checklist to report our results31 (online supplemental file 1).

Study identification

Relevant studies and documents were identified, balancing feasibility with breadth and comprehensiveness of searches. We searched PubMed, EMBASE and Web of Science (last search date: 27 July 2021) for articles describing supervised or unsupervised ML analyses for biomarker discovery or personalised medicine, including both discovery and validation methods. The relevance of the search methodology was ensured by using a strict multistage filtering, considering only articles including at least one relevant search term per category from four categories of keywords (‘Personalised medicine/Biomarkers’, ‘Omics’, ‘Machine Learning’ and ‘Validation’, covering both synonyms for these terms and closely related keywords, see figure 1, illustrating the keyword-based search strategy, and online supplemental file 2 for the detailed search queries), and subsequently postfiltering the retrieved articles manually to exclude studies not involving omics-based biomarker research or lacking a description of ML and validation analyses (see sections on Eligibility criteria and Study selection). To cover only relevant scientific content, the scope was limited to journal publications and meeting abstracts from international conferences and workshops, and no other grey literature was included. We restricted inclusion to reports published from January 2000 to July 2021 (covering also ‘online first’ articles with official publication date in the future) in English, French, Spanish, Italian and German language. Since to the best of our knowledge, the first clinically validated FDA-cleared omics-derived biomarker signature was published in 2002,32 only few preliminary discovery studies were expected to have taken place significantly earlier than 2002, and we, therefore, did not extent the search further backwards in time than January 2000.

Figure 1

Keyword based search strategy for the scoping review. Four categories of keywords were defined to retrieve relevant articles from the biomedical literature on machine learning analyses of omics data for personalised medicine, which include a validation study (highlighted by the coloured boxes in the centre). For each category relevant keywords were determined, including controlled vocabulary terms from the Medical subject Headings (MeSH) thesaurus by the US National Library of Medicine (upper and lower boxes with frames coloured according to the corresponding category). As indicated by the keyword ‘and’ in the centre, a conjunctive search was conducted, that is, every retrieved article had to contain at least one keyword from each category. This strategy was adapted for searching the other databases.

Eligibility criteria

We included peer-reviewed methodology articles, review articles, opinion articles on supervised and unsupervised ML methods for omics disease prediction and stratification and associated statistical cross-validation (CV) and multicohort validation methods (addressing accuracy, robustness and clinical relevance). Only approaches tested on real-world biomedical omics data were reviewed, while studies relying purely on simulated data were excluded. We also excluded papers on biomarker methods without a demonstrated biomedical application, and those with insufficient sample size (ie, removing studies covering less than 50 samples per group for the main conditions studied, unless a dedicated power calculation was presented) or statistical validation (ie, lack of clear descriptions of CV or external testing methodology, performance metrics and test statistics). These exclusion criteria were not specified in the generic review protocol, but they were agreed among the authors prior to the screening process.

To cover both data from original research papers and prior systematic reviews, we extracted information from three main article types: (1) applied research papers, (2) methodology articles with demonstrated applications and (3) review articles on methods, applications and validation approaches.

Apart from these inclusion and exclusion criteria, for the final result presentation, the statistical investigations covered all selected articles, whereas the detailed discussion of study characteristics focused on the studies that led to clinically validated biomarker signatures tested on multiple cohorts with large sample sizes (ie, studies using a power calculation to demonstrate the adequacy of the chosen sample sizes, or covering hundreds or thousands samples per studied subject group).

Study selection

We exported the references retrieved from the searches into the online tool Rayyan.33 Duplicates were removed automatically using the reference manager Endnote V.X9 (Clarivate Analytics, Philadelphia, USA) and manually by the reviewers. One reviewer loaded the retrieved records into the online screening tool Rayyan,33 and two reviewers confirmed the eligibility independently by covering both the screening for all records and the full-text review for the articles preselected by the screening. Disagreements were solved by consensus.

Charting the data and synthesis of results

We designed a data extraction form using Excel (online supplemental file 3). General study characteristics extracted covered author names, title, citation, type of publication (eg, journal article, meeting abstract), study population and sample size (if applicable), methodology/study design and outcome measures (if applicable). Specific items associated with the topic of the scoping review included the study type (eg, case–control study, differential diagnosis study, prognostic study, review—methods, review—applications, review—validation); the article type (journal or conference article), the generic ML domain (eg, supervised/unsupervised); and the name of specific approaches for outcome prediction and for validation. Moreover, to capture key findings related to the review questions, relevant sentences were extracted from each reviewed article, and if needed, complemented by a brief explanatory remark, and by writing out abbreviations used in the original text.

The reviewers piloted the data extraction form using five records from the retrieved article collection. Two reviewers (EG, AR) working independently extracted the data from the included articles. In the case of disagreements, consensus was obtained by discussion.

In the final full-text review stage, the preselected articles were grouped by topic, categorising articles into applied versus methodological studies, supervised versus unsupervised analyses and assigning algorithm type identifiers to each article (review articles and papers on validation methodologies were considered as separate categories without a specific algorithm type assignment). The full-text review and categorisation of articles into different publication types was done through independent manual inspection by the two reviewers.

While the information on sample sizes and validation methods was documented as part of the data extraction (online supplemental file 4, a spreadsheet version has been made available on the online platform Zenodo34), it was not within the remit of this scoping review to assess the methodological quality of individual studies included in the analysis.

Consultation exercise

The members of the PERMIT consortium, associated partners, and the PERMIT project Scientific Advisory Board discussed the preliminary findings of the scoping review in a 2-hour online workshop.

Patient and public involvement

The European Patients' Forum is a member of PERMIT project. Although not directly involved in the conduction of the scoping review, they received the draft review protocol for collecting comments and feedback.


Study selection and general characteristics of reports

We retrieved 1563 abstracts from the literature search. After the removal of duplicates, we screened the remaining 1475 abstracts for eligibility. A total of 619 records were excluded, while 856 abstracts were retained for the full-text assessment. Finally, we included 352 articles that passed all filtering criteria in the data extraction and analysis (see flow chart in figure 2 and online supplemental file 4, providing the reference for each selected article, as well as information on the study type and methodology, the outcome measures, the validation type, and representative sentences from each article on the main study results and key findings; a spreadsheet version of this table has been made available on the online platform Zenodo34).

Figure 2

Study selection flow diagram. Flow diagram of the procedure for the scoping review article identification, screening, eligibility assessment and final inclusion, according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) scheme.31 Reasons for excluding full-text articles were not mutually exclusive.

The full-text article review revealed that many studies did not meet the pre-defined inclusion criteria: 371 articles (43%) were removed because of an insufficient sample size, and 105 further articles (12%) were excluded because they provided insufficient details on the validation results or methodology (see figure 2). This shows that the challenges of recruiting an adequate number of participants per study group or conducting sufficient omics profiling experiments for robust model building and validation are not met in a large proportion of omics biomarker studies. Moreover, many studies lack adequate documentation for the study design and validation.

For the selected articles that cover primary research on omics biomarker studies, the majority (78%) rely entirely on an internal validation involving data from only a single cohort, whereas studies that use an external validation on an independent cohort are still underrepresented (only 12% of articles describe both an internal CV and an external cohort validation, and an additional 10% include an external validation, but do not report internal CV results). However, when comparing the numbers of published studies over different periods of time during the past 20 years, the relative proportion of studies including an external validation has increased in recent years (see figure 3), suggesting a growing recognition of the importance of independent, multicohort validation.

Figure 3

Validation methods used in omics biomarker studies. Stacked bar chart of the number of articles retrieved in the scoping review for different categories of validation methods used in the underlying biomarker studies (covering time periods from 2000 to 2021). The majority of studies use only internal cohort validation approaches, such as CV, training/test set split validation, resampling/bootstrapping-based validation, out-of-bag validation (for tree-based classifiers), and combinations of CV and test set validation within the same cohort. Studies with an external validation on an independent patient cohort (with or without an additional internal CV) are still underrepresented, even in more recent time periods. All filtered full-text articles derived from the scoping review except for review articles were included in the analysis.

Next, we investigated the countries of origin for the selected articles, showing that the USA are contributing the largest proportion of validated biomarker studies (28%), followed by China (18%), Canada (5%), Germany (4%) and the UK and India (both 3%; see also figure 4, providing a map visualisation of the country statistics). These country representations show limited correlation with population sizes and may largely reflect worldwide variation in relative biomedical research productivity reviewed in previous study.35 Since the most prolific countries in the development of molecular diagnostics have already set up policies and regulations for omics-based and ML-based in vitro diagnostics and medical devices (eg, see the life cycle regulation of artificial intelligence based and ML-based software devices in the USA36), they may also provide a role model for countries still in the process of establishing similar regulatory frameworks.

Figure 4

Map representation of country statistics for the selected articles. The number of articles originating from different countries among the studies selected in the full-text review are visualised on a world map representation using a colour gradient from blue (1 article) to red (98 articles=maximum contribution by a single country; using a logarithmic colour gradient scale to highlight differences over a broad value range).

When inspecting the representation of study design types in the filtered article collection, the great majority of documents described diagnostic studies (67%), prognostic and survival prediction studies were covered in 8% of articles, and studies examining therapy or drug response in 7% (see figure 5). Apart from this, 13% of articles were reviews on methodologies and applications in the field, and 5% of articles described other rare study types (eg, tissue-of-origin prediction studies or combinations of different study types).

Figure 5

Representation of study types among the selected articles. The percentage of articles describing case–control studies, therapy/drug response studies, differential diagnosis studies, prognostic and survival prediction studies, as well as review studies and other study types is represented as a pie chart.

Since a detailed discussion of all filtered articles is not within the scope of the present review, in the following, we focus on reviewing representative omics biomarker studies which achieved the highest validation level, that is, clinical approval of the developed molecular signature as an LDT or FDA approved test (see the overview of studies in table 1 and the FDA web-site37). We investigate the shared features of these successful studies, examine how they address common shortcomings and missing features of other reviewed studies, and summarise the lessons learnt.

Table 1

Examples of clinically approved omics-derived diagnostic or prognostic tests designs applied to personalised medicine (synonyms for the same test are separated by the ‘/’-symbol)

Success stories in OMICs-based biomarker signature development

Cancer approved omics-derived diagnostic tests (nine studies)

The first and most well-known omics-derived molecular test to receive FDA clearance was MammaPrint, a prognostic signature using the RNA expression activity of 70 genes to estimate the risk for distant tumours metastasis and recurrence in early-stage breast cancer patients.6 32 38–41 This test was developed at the Netherlands Cancer Institute, using DNA microarray analysis to investigate primary breast tumours of 117 patients. Supervised ML was applied to the resulting data to identify a highly predictive gene signature for a short interval to distant metastases in lymph node negative patients.32

A distinctive feature of the development approach behind this signature in comparison to other reviewed studies was the multistage filtering and CV strategy used in the initial discovery study, which may explain the repeated confirmation of the signature in later validation studies.6 38–41 From 25 k genes represented on the DNA microarrays, only those significantly regulated in more than 3 tumours out of 78 sporadic lymph-node negative patients were preselected, and further filtered by retaining only the genes with a minimum absolute correlation with the disease outcome of 0.3. The resulting list of 231 genes, rank-ordered by absolute correlation, was investigated by sequentially adding the next top five genes from the list to a candidate ML classifier and evaluating its performance by leave-one-out CV. This procedure was repeated as long as the estimated accuracy of the classifier improved, providing a final candidate signature of 70 genes. The final signature was validated on multiple independent test sets, including a set of 19 external samples in the original study and several additional validations on independent cohorts in follow-up studies.6 38–41

The MammaPrint signature provided the role model for the subsequent development of a similar prognostic test for colon cancer, ColoPrint.42–47 This test aims at detecting the approx. 20% of patients with stage II colon cancer expected to experience a relapse and develop distant metastases. It uses an 18-gene expression signature, developed by analysing DNA microarray data in a similar manner to the MammaPrint approach. The diagnostic approach has been commercialised as an LDT to assist physicians in selecting treatment options for colon cancer patients. Similar to MammaPrint, the signature development was characterised by extensive discovery and validation studies, which involved multiple statistical reproducibility, stability and precision analyses for independent, large-scale patient cohorts.48

Another widely used cancer-related LDT, which received clearance by the U.S. Food and Drug Administration (FDA) in 2013, is the Prosigna Breast Cancer Prognostic Gene Signature Assay, previously called PAM50 test.49–53 This assay assesses mRNA expression for a signature of 58 genes (50 target genes + 8 endogenous control genes) to predict the risk of distant recurrence for hormone-receptor-positive breast cancer between 5 to 10 years after diagnosis (prerequisites are that the patients have been treated with hormonal therapy and surgery, and are stage I or stage II lymph-node negative, or in stage II with one to three positive nodes). The test development started with a microarray discovery study and involved a multistage filtering, using consecutive applications of statistical tests and CV to propose a subset of candidate gene markers.54 The authors compared the reproducibility of classification scores obtained with these markers for three centroid-based prediction methods to ensure the robustness of the methodology. By further developing the approach into a more sensitive PCR-based test, and later into an assay using the NanoString nCounter Dx Analysis System, the predictive performance was improved in a stepwise fashion. The original discovery study was characterised by significantly larger sample sizes than the majority of reviewed biomarker studies, with a training set of 189 samples, test sets of 761 patients evaluated for prognosis, and 133 patients evaluated for prediction of pathologic complete response to treatment with taxane and anthracycline. These study design features in combination with multistage filtering and validation approaches, and improved measurement technology during the course of the study, may explain the successful progression of the PAM50 test to FDA clearance. The test has only three genes in common with the MammaPrint approach (KNTC2, MELK, ORC6L), which may be explained by the different technical and analytical approaches used, but a previous comparative evaluation concluded that the tests provide broadly equivalent risk information for females with oestrogen receptor (ER)-positive breast cancers.55

Among the LDTs for breast cancer prognosis, Oncotype DX is a further test commonly used in clinical practice.8 56–59 The underlying gene signature consists of 16 cancer-associated genes and five reference genes, and is therefore often also referred to as ‘21-gene assay’. Its main application is to predict risk of recurrence in oestrogen-receptor positive tumours. The relevance of this prognostic tool for treatment selection may be explained by the strong association of the provided recurrence score with the probability of positive treatment response to chemotherapy.60 Oncotype DX was developed using a consecutive refinement procedure, starting with the reverse transcription–polymerase chain reaction (RT-PCR) assessment of 250 candidate genes across 447 patients from three distinct studies to identify the 21-gene signature after multiple filtering steps. A recurrence score algorithm built using the signature as input was clinically validated on 668 independent patients.61 The selection of the 16 cancer-related genes included in the assay involved scoring the performance of the candidate features in all three studies and the consistency of the primer/probe performance in the assay.62 Thus, particular strengths of the development process for this LDT include the consideration of both technical robustness and statistical robustness of the assay across distinct cohorts. The Oncotype DX signature shares one gene with MammaPrint (SCUBE2), and nine genes with the Prosigna PAM50 test (BIRC5, CCNB1, MYBL2, MMP11, GRB7, ESR1, PGR, BCL, BAG1). However, an independent clinical validation of Oncotype DX and the PAM50 signature for estimating the likelihood of distant recurrence in ER-positive, node-negative, post-menopausal breast cancer patients treated with endocrine therapy suggested that the PAM50 signature provided more prognostic information than Oncotype DX.63

While the first validated omics biomarker signatures were developed for breast cancer, similar diagnostic and prognostic tools have followed for other cancer types. One of these is the Decipher Prostate Cancer Test,9 64–68 which differs from other omics-derived diagnostic tools by being provided together with a software platform and database, the Decipher Genomic Resource Information Database (GRID), that captures 1.4 million expression markers per patient to facilitate personalised care. The test itself uses 22 preselected RNAs to predict clinical metastasis and cancer-specific mortality for patients who have undergone radical prostatectomy. An initial discovery study by the Mayo Clinic (Rochester, Minnesota, USA) investigated a cohort of 545 such patients, split into a training (n=359) and a validation cohort (n=186). Similar to other LDTs, the discovery started with a genome-wide profiling and used both statistical and ML analyses for filtering. First, t-tests were applied (reduction from 1.4 mil. to 18 902 differentially expressed RNAs), then regularised logistic regression (reduction to 43 candidate markers), and finally a random forest-based feature selection (reduction to final set of 22 RNAs). Apart from testing the signature in the validation cohort, further external validations were performed in subsequent studies.9 64–68 Overall, distinctive strengths of the used approach include the improved interpretability of the test results through supporting analyses on the GRID platform, and the robustness of the discovery and validation approach, involving large sample sizes and several complementary statistical and ML assessments.

While most diagnostic tests in oncology have been designed for specific cancer types, a dedicated LDT has also been developed for cancers of unknown or uncertain diagnosis. The Cancer Type ID test by bioTheranostics distinguishes between 50 different tumour types using a 92-gene RT-PCR expression measurement signature.15 69–71 This signature was derived from analyses of a microarray data collection covering 446 frozen tumour samples and 112 formalin-fixed, paraffin-embedded (FFPE) samples of both primary and metastatic tumours. Modelling steps involved k-nearest neighbour clustering and classification, and a genetic algorithm to explore the search space of possible feature subset selections. After successful CV (84% accuracy) and external validation (82% accuracy on 112 independent FFPE samples), the microarray-based signature was further developed to use more sensitive RT-PCR measurements. Testing the new approach on an independent validation set provided an increased accuracy (87%). Distinctive characteristics of the development process that may have contributed to the positive validation include the efficient and extensive exploration of the search space of possible gene subset selections via a genetic algorithm, the large sample sizes used for discovery and validation, and the transfer of the assay from microarrays to the more sensitive RT-PCR platform.

The first omics-derived biomarker signatures addressed only the most frequent cancer types, but more recent applications in oncology focus on the diagnosis of less common malignancies, such as thyroid cancer. Typically, deciding whether a thyroid nodule is benign or cancerous is possible via a fine needle aspiration (FNA) biopsy, without requiring more complex measurements or analyses. However, while direct FNA-based diagnosis is feasible in most cases, indeterminate results can occur.72 To help prevent unnecessary surgeries for the corresponding patients, a molecular signature and LDT known as the Afirma Gene Expression Classifier (GEC) has been developed to discriminate benign from cancerous thyroid nodules.72–77 The original discovery study behind the GEC signature used mRNA expression analysis in 315 thyroid nodules, covering 178 retrospective surgical tissues and 137 prospectively collected FNA samples. Two ML classifiers were trained separately on surgical tissues and FNAs, assessing the test set performance on 48 independent, prospective FNA samples (50% of which had indeterminate cytopathology). Discriminative features were selected using a linear modelling approach implemented in the software Limma, and a linear support vector machine was applied for model building and performance estimation via 30-fold CV. The successful CV results were confirmed on multiple distinct cohorts.72 75–78 While the internal validation used in the initial study cannot address cohort-specific biases, the combined use of established feature selection and modelling approaches, and the subsequent external validation across multiple cohorts with large sample sizes may account for the successful translation of this signature.

Most omics-based diagnostic tests identified in our study rely purely on gene expression profiling data. However, more recently, first multiomics signatures for diagnostic purposes have been developed. One of the first LDTs that integrated information from both RNA and DNA sequencing was the FoundationOne Heme assay.14 79–81 This assay aims to detect haematologic malignancies, sarcomas, paediatric malignancies or solid tumours (including among others leukaemias, myelodysplastic syndromes, myeloproliferative neoplasms, lymphomas, multiple myeloma, Ewing sarcoma, leiomyosarcoma and paediatric tumours). The test identifies four types of genomic alterations (base substitutions, insertions and deletions, copy number alterations, rearrangements) and reports microsatellite instability and tumour mutational burden to facilitate clinical decision making. This approach was originally developed and evaluated using reference samples of pooled cell lines in order to model the main characteristics that determine the test accuracy, including mutant allele frequency, indel length and amplitude of copy change.79 A first validation using 249 independent FFPE cancer samples, which had already been characterised by established assays, confirmed the accuracy of the test. External validation studies on independent cohorts corroborated the utility of the test for further diagnostic applications.14 82 The study results highlight the potential of integrating diverse biological data sources in order to obtain more robust and reliable predictions, a strategy that may be promising in particular for complex disorders that involve very heterogeneous phenotypes.

A common limitation of genomic profiling approaches for diagnostic testing is that most analyses have to be performed in centralised specialty laboratories, which limits a wider use and results in long waiting times. To address this shortcoming, the Elio Tissue Complete assay, an in vitro diagnostic test cleared in 2020 by the FDA for assessing somatic mutations and tumour mutation burden (TMB) in solid tumours, has been developed as an integrated DNA-to-report approach to enable a decentralised evaluation in all diagnostic labs with next generation sequencing (NGS) technology.83 The analytical performance of the test was assessed by comparing it with the FoundationOne test (see above) using a concordance analysis on 147 tumour specimens. It provided a positive percent agreement (PPA) above 95% for single nucleotide variants (SNVs) and insertions/deletions, and 80%–83% PPA for copy number alterations and gene translocations.83 The test has recently also been applied to investigate the response to immune checkpoint inhibitors (ICIs) in metastatic renal cell carcinoma, using a retrospective evaluation of SNVs, TMB, microsatellite status and genomic status of antigen presentation genes.84 While no correlation between treatment response and TMB was observed, one-third of patients with progressive disease following ICI therapy displayed loss of heterozygosity of major histocompatibility complex class I genes vs 6% of disease control patients, suggesting that loss of antigen presentation may restrict ICI response.84 In summary, the Elio Tissue Complete assay provides an example of how integrating NGS analyses with bioinformatics in a combined DNA-to-report approach could help to broaden the access to genomic diagnostics for both clinical and research applications.

Non-cancer approved omics-derived diagnostic tests (four studies)

While most clinically approved omics-derived diagnostic tests have been developed in the field of oncology, one of the first LDTs that received FDA clearance for a non-cancer disease was the AlloMap Heart test.13 85–87 It uses a gene expression signature of 11 target genes and 9 control genes in peripheral blood from heart transplant recipients to estimate the risk for acute cellular cardiac allograft rejection. The development process involved statistical analyses of leucocyte microarray profiling data from 285 samples, and subsequent RT-PCR validation and bioinformatics postprocessing.13 Prior knowledge from database and literature mining was included in the analysis by mapping the data to known alloimmune pathways. This allowed the researchers to narrow down 252 candidate marker genes. An RT-PCR validation on 145 samples confirmed 68 of these candidate genes, which distinguished rejection samples from quiescent samples according to a t-test (p<0.01). Six genes were eliminated due to significant variation in gene expression with sample processing time. Next, the investigators averaged correlated gene expression levels to create robust meta-level features, called ‘metagenes’, and added 20 of these features as new variables. A linear discriminant analysis was applied, providing a prediction model using four individual genes and three metagenes, which aggregate information from 11 original genes. Finally, bootstrap validation procedures and external test set validations were performed to confirm the accuracy of this signature. Overall, distinctive aspects of the development approach for the AlloMap signature include the knowledge-based gene discovery, a comprehensive RT-PCR validation of candidate genes, and the robust bootstrap and external validation analyses.

The first clinically validated LDT for a cardiovascular indication derived from omics data was the Corus coronary artery disease (CAD) test, developed to identify CAD in stable non-diabetic patients.11 88–91 In contrast to most other omics-based tests, Corus CAD is not a pure molecular signature test, but takes the clinical covariates gender and age into account. The initial discovery study used a retrospective microarray analysis of blood samples from 195 diabetic and non-diabetic patients from the Duke University CATHGEN registry. After ranking the studied genes by the statistical significance of group differences and prior biological knowledge on their disease relevance, 88 genes were selected for RT-PCR validation. Because diabetes status as a clinical covariate was significantly associated with the observed gene expression alterations, and the identified CAD-associated genes did not overlap between diabetic and non-diabetic patients, the authors decided to limit follow-up work to non-diabetic patients. In a prospective clinical trial, microarray profiling was conducted on blood samples from 198 patients, and top-ranked genes were further validated using RT-PCR for 640 blood samples. After multiple filtering steps, taking into account statistical significance in t-tests, biological relevance, gene correlation clustering and cell-type analyses, a final signature of 23 genes was derived, composed of 20 CAD-associated genes and 3 reference genes.92 To maximise the predictive performance, the final prediction algorithm was optimised to adjust for differences associated with age and gender. Compared with most other reviewed studies, the Corus CAD approach stands out by taking clinical covariates into account in the final prediction model, including an intermediate critical review and adjustment of the inclusion criteria (limiting the focus to nondiabetic patients), and integrating complementary filtering and validation analyses on large sample sizes.

For inflammatory diseases, a first omics-derived signature recently received approval for measuring rheumatoid arthritis (RA) inflammatory disease activity, the Vectra DA multibiomarker test.93–97 It uses blood serum samples and multispot 96-well immunoassay plates to assess serum concentrations of 12 protein biomarkers associated with the pathobiology of RA. The original Vectra DA score, which combines these measurements into a composite score between 1 and 100, was assessed via multivariate regression and displayed a high predictive power in estimating a standard RA score, the Disease Activity Score in 28 joints using the C reactive protein level (DAS28-CRP), in both seropositive (area under the receiver operating characteristic curve (AUC): 0.77, p<0.001) and seronegative (AUC: 0.70, p<0.001) patients.97 This score was later adjusted for age, gender and adiposity (based on leptin concentration), and validated in two cohorts against DAS28-CRP as a prognostic test for radiographic progression during the next year. The results showed that the new adjusted score was the most accurate independent predictor of progression, with the rate of progression increasing from <2% in the low1–29 adjusted score category to 16% in the high45–100 category.95 Overall, the Vectra DA approach illustrates the utility of omics-based biomarker signatures for prognostic applications in inflammatory disorders, and further highlights the benefit of integrating omics signatures with information from clinical covariates.

For neurodegenerative disorders, clinically approved diagnostic and prognostic omics-derived tests are still lacking. However, recently the Helix Genetic Health Risk App for Late-onset Alzheimer’s Disease (AD) was cleared by the FDA for over-the-counter use. It detects clinically relevant variants in genomic DNA isolated from human saliva of individuals≥18 years in order to report and interpret genetic health risks, and evaluates the information of variants with established genome-wide significant associations to AD. When tested on 99 human saliva samples, the accuracy was 100% with a lower 95% CI bound of 96.3%.98 The approach uses a whole exome sequencing (WES) constituent device, the Helix Laboratory Platform,99–101 as a qualitative in vitro diagnostics approach covering measurements for approximately 20k genes. The Helix Laboratory Platform has received FDA clearance through a new regulatory approval pathway established by the FDA for WES devices (Regulation 21 CFR 866.6000). Due to the generic applicability of the WES profiling assay used by this platform, called Exome+, the assay has also been applied to find statistically significant gene-based associations for several other phenotypes in large-scale cohort studies99 and to identify carriers of autosomal dominant diseases by population-based genetic screening.101 Thus, the Helix Laboratory Platform provides a first example for a new approval pathway for omics-based diagnostic tests, in which a clinically approved genomic testing device is not anymore linked to a single diagnostic application or a specific disease type. Instead, the market authorisation for diagnostic tests is obtained separately from the device and facilitated and accelerated by the prior approval of the constituent measurement device. For the future development of omics-derived biomarker signatures, this may allow researchers to focus on demonstrating the clinical utility of a new signature, while the analytical validity of the underlying testing device has already been established previously.


Statement of principal findings

The scoping review of articles on patient stratification using omics data revealed common limitations in the study design for many published biomarker development projects, such as insufficient and imbalanced sample sizes per study group and inadequate validation methods, but also identified multiple studies that have led to validated diagnostic and prognostic tests. These success stories were investigated in more detail to identify common characteristics in the study design, discovery and validation methods, which may have supported the clinical translation of the initial findings. Figure 6 outlines key shared aspects that are possible determinants of the study success and could help to guide future biomarker investigations. In particular, they cover the following main features:

  1. A sample size selection, study group and replicate design that provides adequate statistical power for the ML analyses.

  2. The application of robust statistical filtering and evaluation schemes (including multiple layers of statistical and ML-based feature selection, combined statistical and biological filters, robust validation schemes that involve multiple CV, bootstrapping and external validation analyses, using multiple suitable and complementary performance metrics, and providing information on the statistical variation and confidence intervals for the performance estimates, see figure 7 for an overview of recommended generic steps for robust model building and evaluation).

  3. Clarity of the study scope and goals (involving clear inclusion and exclusion criteria, primary and secondary outcomes, and decision processes to make necessary adjustments due to new knowledge gained during the project, such as the adjusted inclusion criteria in the Corus CAD study and the progression from non-targeted microarray technology to higher-sensitivity RT-PCR in the case of the Prosigna test and the Cancer Type ID test).

  4. Completeness and reproducibility of the study documentation (covering details on used instruments, parameters and settings, reproducible methods descriptions and information on data provenance).

  5. Interpretability and biological plausibility of the created predictive models (including explainable and justifiable predictions, human-interpretable model descriptions, and biologically plausible models that agree with the current mechanistic understanding of the studied disorder).

  6. Integration of prior biological knowledge into the predictive feature selection, model building and validation procedures (eg, using public data on disease-associated molecular pathways and networks; complementary clinical and real-world data, and relevant multiomics data).

Figure 6

Characteristics of successful omics-based studies. Six main categories of design and implementation aspects that characterise successful omics-based biomarker development studies were identified (starting from the centre left in the figure and proceeding clockwise): (1) adequacy of the study design and sample size selection; (2) rigour and robustness of the statistical evaluation; (3) clarity of scope and goals; (4) completeness and reproducibility of the study documentation; (5) interpretability and biological plausibility of the created predictive models; (6) integration of prior biological knowledge into the model building and validation procedures.

Figure 7

Recommended generic workflow for biomarker development using machine learning analysis of omics data. The machine learning analysis of omics data for biomarker discovery and validation should ideally involve dedicated quality control and preprocessing analyses, a dimension reduction using unsupervised feature selection (eg, a variance filtering) or data transformation approaches (eg, using a principal components analysis), a cross-validation on the discovery cohort, and an external validation on a distinct validation cohort.

Strengths and limitations

The majority of methodological recommendations derived from the study relate to the early planning and study design for biomarker discovery projects, involving considerations associated with the choice of the study group, sampling and blocking design, the measurement technology, and the input and output variables.16 17 These recommendations are therefore mainly applicable to prospective studies. For retrospective biomarker investigations of already collected data, the suggestions derived from the review are limited to guidance on improving analysis workflows, for example, for filtering and evaluation analyses, the integration of prior knowledge from multiomics data and public annotation databases, and the choice of robust and interpretable modelling approaches for the generation of biologically plausible and reproducible prediction models. While the focus of the review on studies that have already led to validated biomarker models and that fulfil minimum requirements for sample size and statistical model assessment helps to ensure the quality of the selected articles, no further quality evaluation was performed. The reader should also note the generic limitations of ML methods which can affect all biomarker studies: These include the necessity for a representative coverage of the relevant outcomes in the training and validation groups, a sufficiently comprehensive and sensitive coverage of informative predictor variables in the data for the outcomes of interest, which may not be achievable for omics data from tissues and body fluids with limited disease relevance or measurement sensitivity, and a sufficient data quality in terms of the influence of systematic biases and noise. Moreover, for multiomics biomarker analyses, in addition to adequate pre-processing and ML approaches, suitable strategies and methods for the integration of diverse omics data are also needed. These multiomics data integration strategies were not within the scope of the present review, but have been reviewed in previous publications.102–104 Finally, more recent methodological developments in the ML and CV analysis of omics data, such as meta-learning105 and bolstered CV,106 have only limited coverage among the articles that passed the eligibility criteria, and will therefore require further dedicated study in the future.

Discussing important differences in results

Previous reviews of ML approaches using omics data for patient stratification have focused on domain-specific analyses for specific types of diseases, or specific types of ML methodologies.107–115 By contrast, this scoping review focuses on disease-agnostic workflows with generic applicability across complex human disorders involving multifactorial molecular alterations. The coverage of statistical and ML approaches for stratification does not aim to provide a detailed discussion of specific algorithms, statistical methods or scoring metrics, but rather at identifying key determinants of success for generic analysis and validation workflows in biomedical stratification studies. Therefore, the results describe general workflow characteristics that distinguish omics biomarker studies with clinical translation from other studies, and cover associated disease-agnostic recommendations for future studies, whereas method recommendations specific to particular disease types or ML analysis types are covered elsewhere in domain-specific reviews.107–115

Meaning of the study: implications for clinicians and policy-makers

The previous clinical translation successes in omics-based biomarker development reviewed in this study, which have mostly been achieved in the field of oncology, highlight the potential for developing similar biomarker signatures for further disease indications. In contrast to conventional statistical biomarker discovery approaches, which focus on identifying single-molecule markers, systems-level analysis of omics data using multivariate ML approaches can identify multifactorial signatures which are robust against noise in individual gene or protein measurements, and more biologically insightful by reflecting disease-associated cellular process alterations in a more comprehensive fashion.

This scoping review has identified common characteristics of omics studies which have led to clinically validated diagnostic and prognostic tests. Thus, the conclusions drawn on recommended practices for sample size selection, biological data filtering and ML, and the implementation of adequate validation schemes may help to guide clinical researchers on study design choices and the selection of analysis methodologies. Additionally, the scoping review results can help to raise awareness of common pitfalls, such as issues associated with batch effects, biases, confounding factors, lack of statistical power and multiple hypothesis testing, and thus contribute to preventing these failure causes in biomarker development. For policy-makers and funding bodies, findings on the distinctive characteristics of studies with successful clinical biomarker translation, for example, concerning the specific requirements for robust CV and external result validation methods, may provide relevant information for the design of public and private funding schemes for biomedical research. Risks in funded research projects may be addressed upfront through appropriate guidelines and regulations for the study design and validation (eg, recommendations on power calculations and specific validation and documentation requirements). Finally, the scoping review results can guide clinicians involved in biomarker discovery on how to make better use of available public knowledge and data sources, for example, cellular pathway and molecular interaction databases, that may allow them to exploit prior knowledge effectively, and create more robust and interpretable biomarker models.

Unanswered questions and future research

Since the recommendations and guidelines identified from the reviewed articles are mostly derived from established biomarker discovery and validation approaches, new methodologies and upcoming trends could only be covered to a limited extent and may lead to changed recommendations in the future. In particular, in the reviewed patient stratification studies, some of more recently introduced ML concepts (eg, transfer learning, distance metric learning, semisupervised learning, structured ML, meta learning, multiview learning and generative models), data processing techniques (eg, new dimension reduction approaches, outlier removal methods, data augmentation techniques) and model validation methods (eg, bootstrapping or bolstered CV, uncertainty quantification), are still underrepresented among the eligible studies reviewed, and may provide suitable topics for follow-up research.

Overall, while the currently available literature on validated stratification biomarkers already provides ample information on common pitfalls and established practices, the development of widely accepted standard guidelines on methodologies for omics biomarker discovery will require further knowledge exchange and deliberation among stakeholders in the field. In particular, integration of domain-specific expertise in discussions involving clinicians, experimental and data scientists, and regulatory and legal experts is required as a follow-up effort to derive comprehensive methodological guidelines for future biomarker development.

Data availability statement

The study protocol was published on the online platform Zenodo.19 Copies of searches and data extraction sheets will be made publicly available on Zenodo as part of the database collection for all scoping reviews conducted in the PERMIT project.

Ethics statements

Patient consent for publication

Ethics approval

This study was based entirely on a scoping review of relevant published literature and did not require an ethics approval.


The authors thank Vanna Pistotti for her assistance with search strategy development and conduction.


Supplementary materials


  • Correction notice This article has been corrected since it was first published. The author byline section has been updated.

  • Collaborators PERMIT group: 1. Antonio L. Andreu 2. Florence Bietrix 3. Florie Brion Bouvier 4. Montserrat Carmona Rodriguez 5. Maria del Mar Polo-de Santos 6. Maddalena Fratelli 7. Rainer Girgenrath 8. Alexander Grundmann 9. Josep Maria Haro 10. Frank Hulstaert 11. Iñaki Imaz-Iglesia 12. Setefilla Luengo Matos 13. Emmet McCormack 14. Albert Sanchez Niubo 15. Emanuela Oldoni 16. Raphael Porcher 17. Vibeke Fosse 18. Luis M. Sánchez-Gómez 19. Lorena San Miguel 20. Cecilia Superchi 21. Teresa Torres 22. Anna Monistrol Mula

  • Contributors Study conception and design: EG and AR. Methodology: CG and RB. Data collection and analysis: EG and AR. Original draft preparation: EG. Review and editing: AR, EG, PG, CG, JD and RB. Project supervision: PG. Funding acquisition: JD. Responsible for the overall content as guarantor: EG. All authors have read and revised the manuscript and approved the final version. The members of the PERMIT group were involved in the preparation or revision of the joint protocol of the four scoping reviews of the PERMIT series, attended the joint workshop (consultation exercise) and are coauthors of the other scoping reviews of the PERMIT series.

  • Map disclaimer The inclusion of any map (including the depiction of any boundaries therein), or of any geographic or locational reference, does not imply the expression of any opinion whatsoever on the part of BMJ concerning the legal status of any country, territory, jurisdiction or area or of its authorities. Any such expression remains solely that of the relevant source and is not endorsed by BMJ. Maps are provided without any warranty of any kind, either express or implied.

  • Competing interests None declared

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.