FormalPara Key Points

Capturing disease activity in multiple sclerosis (MS) trials is a challenge and traditional outcome measures all have clear limitations.

Newer measures are being developed and increasingly used in trials.

Multidimensional outcome measures are promising because they have the potential to capture the full extent of disease activity by assessing various functional domains relevant for MS.

1 Background

Multiple sclerosis (MS) has a female predominance and typically develops at young age with a peak incidence between 20 and 40 years [1]. Clinically, it is characterized by a large variability of symptoms arising from focal inflammation of the central nervous system that may occur at various points in time. Symptoms generally last for several days to weeks, but occasionally persist for many months, with subsequent full or partial recovery. These periods are referred to as relapses. Radiologically, MS is characterized by typical white matter lesions that are best visualized with magnetic resonance imaging (MRI). The occurrence of clinical relapses or new white matter lesions on MRI is used to estimate disease activity.

Demonstrating dissemination in time and place, clinical or radiological, is the core feature of the diagnostic criteria [2].

The occurrence of relapses is the dominant clinical picture in the vast majority of patients during the earlier disease stages and is defined as relapsing-remitting MS (RRMS). If a patient only experienced a single episode with clinical symptoms, it is referred to as a clinically isolated syndrome (CIS). Relapses eventually subside and the disease course often evolves to a slow worsening of symptoms, leading to disability accrual (i.e. disease progression). When there is a disease progression independent from relapses, this is referred to as secondary-progressive MS (SPMS). Approximately 15% of patients have slowly progressive disease from onset without evident relapses and are categorized as primary-progressive MS (PPMS).The first effective immunomodulatory treatments were the injectables interferon-β and glatiramer acetate that were introduced in the 1990s [3]. After a decade, the more potent natalizumab (in 2004) and the first oral drug fingolimod (in 2010) were introduced. More recently approved treatments include teriflunomide, dimethylfumurate, alemtuzumab and daclizumab. Ocrelizumab and cladribine are expected to be approved in the near future. In the phase III trials of these treatments, the outcome measures used to evaluate efficacy were relapse rate, disability worsening and MRI [formation of new T2 hyperintense lesions [T2HL] or gadolinium-enhancing T1 lesions (GdT1L)]. These measures have been generally accepted as measures of (short-term) treatment effects.

Clearly, treatment options in MS are rapidly expanding and are applied in patients with different clinical phenotypes. It is therefore important to have clear, comprehensive and universally accepted outcome measures. For this purpose, an outcome measure has to be valid, reliable and responsive. In practical terms this means it must measure what it intends to measure, it should be free of measurement errors and able to detect true change of performance (due to disease activity or progression) [4]. Furthermore, it needs to capture clinically relevant changes and ideally has predictive value.

Unfortunately, standardized definitions of outcome measures in MS research are lacking, for which there are several explanations. First, the clinical disease expression and course are highly variable, which hampers defining a uniform concept of disability in MS [5,6,7]. There is wide variation between patients concerning relapse frequency (including seasonal variation [8]) and accrual of (relapse-related) disability. Also, patients may present with virtually all neurological symptoms that exhibit an age-dependent distribution (Table 1) [7]. Moreover, the extent to which symptoms contribute to overall disability is variable. This may be more dependent on the location of the lesion than on the size or activity. For example, a severe persisting hemiparesis may have a greater impact on disability than a mild sensory deficit, while both may result from pathologically comparable lesions. In fact, lesions may occur subclinically without causing disability worsening [9]. Another difficulty is that disability often accumulates slowly. Consequently, long-term follow-up is needed to assess treatment effect, which makes trials time-consuming and expensive. Lastly, disability is influenced by confounding factors that may not be directly related to disease activity (e.g. fatigue, mood disturbances, deconditioning, spasticity and side effects of medication) [10].

Table 1 Distribution of patients (%) by presenting clinical symptoms and age of onset [7]

With all these difficulties in mind, we aim to provide a non-systematical comprehensive overview of clinical and paraclinical outcome measures that are used in clinical research of MS (summarized in Table 2). We elaborate on traditional and newer measures such as brain atrophy, optical coherence tomography (OCT), biomarkers in body fluids and the concept of ‘no evidence of disease activity’ (NEDA). We highlight the most important advantages, limitations and caveats of these measures.

Table 2 Primary, secondary and exploratory outcome measures in phase III trials for MS

2 Clinical Outcome Measures

Outcome measures can be generic or disease-specific, physician- or patient-based, direct or indirect, and may cover all or specific aspects of MS. Various clinical outcome measures are available, assessing different disease characteristics. Which characteristics are important largely depends on the aim of the study. Here, we first describe the traditional measures Expanded Disability Status Scale (EDSS) and relapses. Subsequently, the more recently developed Multiple Sclerosis Functional Composite (MSFC) will be discussed. Finally, we elaborate on patient-reported outcome measures (PROMs) as these patient-based measures are increasingly being used in MS trials.

2.1 The Expanded Disability Status Scale

The EDSS intends to capture disability of MS patients based on neurological examination by describing symptoms and signs in eight functional systems (FS). Furthermore, it encompasses ambulatory function and the ability to carry out activities of daily living (ADL). An overall score can be given on an ordinal scale ranging from 0 (normal neurological examination) to 10 (death due to MS). Scores from 0 to 4.0 are determined by FS scores, which means that in this range the EDSS is essentially a measure of impairment. Scores from 4.0 higher basically address disability. Ambulatory function and the use of walking aids heavily determine the range of 4.0–7.0, and scores between 7.0 and 9.5 are largely determined by the ability to carry out ADL. A schematic representation of the EDSS is given in Fig. 1.

Fig. 1
figure 1

Schematic representation of Expanded Disability Status Scale (EDSS) depicting the factors that determine overall score; the graph shows the distribution of patients over the EDSS [7]. MS multiple sclerosis

In clinical trials of MS, the EDSS is the most widely used outcome measure to determine disability worsening and define relapse-related change in neurological function. Furthermore, it is used as an inclusion criterion and to characterize study populations. The value of the EDSS as a surrogate outcome measure for future disability is limited [11,12,13,14,15].

2.1.1 Limitations and Caveats

Despite general acceptance of the EDSS, there are many limitations and caveats (summarized in Table 3) [16]. First of all, EDSS holds high intra- and inter-rater variability [10, 11, 17,18,19]. This can be explained by the subjective nature of the neurological examination itself on which the EDSS is largely based, particularly in the lower EDSS range. Also, complex and ambiguous scoring rules for the FS probably explain some of the variability.

Table 3 Limitations, caveats and improvements for clinical outcome measures

Non-linearity of the EDSS is another limitation (visualized in Fig. 1). The staying time in the middle scores is shortest and this results in a bimodal distribution with peaks at 1.0–3.0 and 6.0–7.0 [7, 20]. It means that the rate of progression as assessed by the EDSS varies depending on baseline score. Furthermore, responsiveness of the EDSS is limited [16, 21]. Scores higher than 4.0 are less influenced by changes in FS scores. For example, development of a paresis in a patient with an EDSS of 6.0 will not result in a higher EDSS. Conversely, EDSS would have changed with a baseline EDSS of 4.0.

The non-linearity and limited responsiveness should both be accounted for when interpreting changes over time [22]. Nevertheless, EDSS change is often presented without accounting for the baseline score. As a result, statistically significant change may erroneously be presented as clinically relevant and vice versa. An increasingly used clinically meaningful change is a change of 1.0 or more if EDSS at baseline was 0 to 5.5, and 0.5 or more for higher baseline EDSS scores. This is more driven by reproducibility data than by clinical relevance data.

Because the EDSS is an ordinal scale, non-parametric statistics should be used in statistical analysis. This implies that significant differences between groups can be calculated, but the magnitude of differences cannot. In line with this, results should not be presented with means and standard deviation, but with median values and interquartile ranges. Also, a caveat of numeric values is that they might give the false impression of being precise.

Another limitation is that clinical phenotypes are unevenly distributed across the EDSS. Because ambulatory dysfunction is one of the main characteristics in patients with progressive disease (SPMS and PPMS), these patients represent a larger proportion in the range of 4.0–7.5 [23, 24].

Lastly, several domains are not (sufficiently) assessed. Examples are cognitive function, mood, energy level and quality of life. Symptoms in these domains are frequently observed in MS patients and they may influence FS scores, ambulation and ADL function.

2.1.2 Suggested Improvements

During the International Conference on Disability Outcomes in MS (ICDOMS) that was held in 2011, several refinements for the EDSS were suggested to improve performance [25]. Firstly, a standardized script for questioning patients (which is necessary for some FS scores) might improve reliability and decrease the risk of unblinding in clinical trials (an example of the Neurostatus form may be found on http://www.neurostatus.net/). Secondly, simplification of scoring rules might reduce intra- and inter-rater variability. Thirdly, long-term disability worsening should be assessed with confirmation of EDSS worsening at 6 rather than 3 months. The main reason for this is that relapses may improve beyond 3 months, and thus EDSS worsening may be temporary [26]. Fourthly, streamlining of the EDSS might be achieved by finding the components of FS that contribute most to confirmed worsening of disability and omitting the other less informative components. Lastly, modification of the EDSS to improve linearity of measurement will facilitate statistical analysis and clinical understanding.

Whatever its limitations, the EDSS will probably continue to be the main disability measure for the near future because of the vast experience with it and the possibility of making historical comparisons. Until we have better alternatives, clinical assessment can be improved by using the EDSS in conjunction with other measures.

2.2 Relapses

The other traditional outcome measure is assessment of relapses. By consensus, a relapse has been defined as new or worsening neurological symptoms that are objectified on neurological examination in the absence of fever and last for more than 24 h, and have been preceded by a period of clinical stability of at least 30 days, with no other explanation than MS [27, 28].

The relationship between number of relapses and disability worsening is not completely clear, although conclusions may be drawn from natural history studies. Various of these studies showed that relapses early in the course of MS were associated with long-term disability and increased risk of conversion to SPMS, which probably relates to faster disability worsening [29,30,31,32]. However, superimposed relapses in the progressive phase did not lead to faster disability worsening [33].

Treatment effects on relapses are confined to the change in annualized relapse rate or time to second relapse (i.e. conversion to clinically definite MS) [34]. Treatment effect on relapses gives a fair reflection of short-term efficacy.

2.2.1 Limitations and Caveats

There are several caveats when using relapses as an outcome measure (summarized in Table 3). First of all, identification of a relapse is subjective. Ensuring perfect blinding for treatment is therefore essential. To limit subjectivity, a second assessment can be performed to objectify the relapse. The problem with this approach is that symptoms or signs may already have recovered, and recall bias of the patient and observer bias from the examiner may influence the second assessment [35].

Another caveat is that identification of a relapse largely depends on a patient reporting new symptoms. When a patient only reports new symptoms on scheduled visits and not spontaneously, the established relapse rate will be lower than in reality. In fact, increasing the number of visits in a trial period may increase the relapse rate [36].

An interesting phenomenon is that relapse rate is often remarkably high prior to inclusion into trials. Various explanations may be given for this [37, 38]. First of all, relapses in the preceding period of a trial are usually determined retrospectively and patients may over-report the exact number to qualify for inclusion. Secondly, the inclusion criterion of relapse rate is often high, meaning that only patients with very active disease are included. As a consequence, it can be expected that the relapse rate of these patients will decrease towards a disease average during the trial (i.e. regression to the mean). Thirdly, patients participating in a trial may do better merely because of a placebo effect or better comprehensive care during the trial. Lastly, during the natural course of MS the relapse rate will eventually decrease, independent of treatment [39]. These factors may obscure the interpretation of absolute relapse rate reduction in treatment trials.

2.3 The Multiple Sclerosis Functional Composite

Because of the limitations of the EDSS and assessment of relapses, the MSFC was developed to improve clinical assessment [40, 41]. It was introduced in the early 1990s, a time when the first effective treatments were introduced. In contrast with the EDSS, the MSFC covers three functional domains: ambulatory, hand and cognitive function (a schematic summary is given in Fig. 2). The results of the tests that assess these domains are depicted in an interval scale (seconds or number of correct responses) and can be converted to a Z score that is based on values of a reference population [42]. An overall score can be calculated by averaging the Z score of the subtests.

Fig. 2
figure 2

Schematic representation of the Multiple Sclerosis Functional Composite (MSFC) with candidate components

The MSFC has been extensively evaluated. The overall score of MSFC correlated strongly with EDSS [43] and subtest scores did moderately [40]. Also, change of MSFC correlated with EDSS change and relapse rate [40, 44, 45]. Furthermore, it was predictive of conversion from RRMS to SPMS [44]. Concerning the relation with MRI abnormalities, MSFC correlated with white matter lesion load and various atrophy measures [46,47,48]. Lastly, correlations with several PROMs [43, 49,50,51], employment status [52] and driving performance [53] were found.

2.3.1 The Original Components

Ambulatory function is tested with the timed 25-foot walk test (T25W, explained in Table 4). The T25W is a reliable test for patients with more severe gait impairment, because it primarily assesses walking speed. Assessing walking speed seems clinically relevant, because it relates to the capacity to perform outdoor activities important in daily life [54]. For patients with mild gait impairment, the T25W may not be sensitive enough to detect abnormalities and because of that has a floor/ceiling effect [55]. For these patients, it may be more appropriate to assess walking endurance with longer walking distances; for example, with a 6-minute walking test [56].

Table 4 Description of components of the Multiple Sclerosis Functional Composite (MSFC)

Hand function is tested with the nine-hole peg test (9HPT, explained in Table 4). A change of 9HPT correlated with long-term disability [57].

The paced auditory serial addition task (PASAT, explained in Table 4) was originally included to cover the cognitive domain [58]. It measures processing speed and working memory, both of which are frequently affected functions in MS patients [59]. The test has moderate reliability and sensitivity for detection of cognitive impairment, and has limited responsiveness to change [60]. Furthermore, it requires a certain mathematical ability and has a clear ceiling effect [49, 61]. Finally, it is often disliked by patients because the time limit induces stress.

2.3.2 Candidate Components

A candidate cognitive test that may replace the criticized PASAT is the symbol digit modalities test (SDMT, explained in Table 4) [62, 63]. It measures information processing speed. The advantages of the SDMT are that it is easily administered, better tolerated by patients (probably because there is no time pressure) [64] and more robust and reliable than the PASAT [65, 66]. Moreover, the SDMT correlated more strongly with white matter abnormalities than PASAT [67, 68]. It also correlated with worsening of cognitive impairment [69, 70] and MRI abnormalities (atrophy measures in particular) [71, 72]. A limitation is that a patient has to have an intact visual system, which may be impaired in MS patients. Although there is a ceiling effect, it is less pronounced than for the PASAT. All points considered, the SDMT is probably a good replacement for the PASAT.

When the MSFC was developed, no data on suitable tests to assess visual function were available. In the past decade, various visual outcome measures for MS research have been studied [73]. Of these, the low-contrast letter acuity test (LCLA, explained in Table 4) may be a good candidate to add to the MSFC [74]. Results correlated with clinical phenotypes, MRI abnormalities and PROMs for visual impairment and quality of life (which supports clinical relevance) [75, 76]. Moreover, some clinical trials showed treatment effect on the LCLA in the active group compared with placebo [77].

2.3.3 Limitations and Caveats

There are several limitation and caveats of the MSFC (summarized in Table 3). A frequently postulated objection to the MSFC is that the overall score lacks a clear dimension, which hinders interpretability and therefore appears to be difficult for the interpreter to get familiar with the score. In other words, it is difficult to form a ‘mental picture’ of it [78]. This difficulty may be addressed by keeping the elements of the MSFC score separated instead of combining them into a single score. Nonetheless, comparison of subtest results between studies remains impossible due to the Z scores that obscure the meaning of crude scores.

Another problem is that results from the reference group strongly influence the Z scores of patients [79]. With that, assessing changes in time is problematic, because the overall score is influenced by variability between time points of both the reference and patient group. Consequently, it is impossible to determine if change is a result of statistical variance or true progression of disability [38].

A potential solution to some of the statistical caveats of Z scores might be to determine the minimal clinically relevant change [21, 80]. This means that change should be confirmed on a subsequent time point, preferably at 6 months (because of possible disability improvement after a relapse). This approach has been tested in a clinical trial dataset [45]. Sensitivity of worsening was found to be similar between MSFC and EDSS, and it correlated with other clinical and MRI outcome measures. However, the downside of this approach is that it will hamper sensitivity to change, which is of particular importance in patients with severe disability.

Despite its disadvantages, the MSFC is an appealing alternative for the EDSS. It can be performed within 20 minutes, covers three domains, has good intra- and inter-rater reliability and it results in a score on a continuous scale. The MSFC has been used as the primary outcome in a treatment trial in SPMS [49]. While MSFC progression was slowed, treatment effects were not observed with the EDSS. If the components are applied in a sensible way, the MSFC may be used as the primary endpoint in future clinical trials.

2.4 Patient-Reported Outcome Measures

A PROM is defined as “any report of a patient’s health condition that comes directly from the patient, without interpretation of the patient’s response by a clinician or anyone else” [81].

A PROM may provide valuable insight into the patient perspective of a treatment or matter of interest. For example, treatment success for a patient might be more influenced by adverse events than a physician perceives or deduces from other outcome measures. Furthermore, it may detect clinically meaningful changes and leave out changes with no clinical relevance. A PROM can assess perceived efficacy, side effects, depression and anxiety, fatigue, mobility, quality of life, ability to carry out ADL, sexual dysfunction and symptoms specific for MS. A list of PROMs that are being used in MS research is presented in Table 5 [82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105].

Table 5 Patient-reported outcome measures that are used in MS research

PROMs that assess the ability to carry out ADL may be of particular value. They are able to demonstrate clinical relevance of MS-specific outcome measures. For example, one study found a correlation between the EDSS and a 42-item ADL scale that was mostly driven by impairment of mobility [106]. Another advantage is that measuring ADL activity allows comparison between studies of MS as well as other diseases. Currently, no MS-specific ADL measures are available. Nevertheless, PROMs that were developed for stroke patients (Ranking scale [107, 108] and Bartel index [109]) were used in some MS trials [110, 111].

There are several limitations of PROMs (summarized in Table 3). Among these are their unblinded nature and potential expectance bias. Also, questionnaires assessing quality of life are prone to being influenced by more than just disability. Other factors that are commonly seen in MS patients contribute as well (e.g. fatigue, depression, anxiety and physical comorbidities) [112]. Also, the individual questions should be weighted appropriately. Summing up all the subscores assumes equal importance which is generally not the case. Lastly, PROMs are prone to response shift over time [113]. Response shift occurs when a patient answers an item differently from their previous responses due to a change of internal standards, values or conceptualization of the purposed domain (e.g. quality of life).

Typically, PROMs are fixed in length and all patients have to fill in the complete questionnaire. The number of questions that have to be answered can be reduced with computer adaptive testing [114]. It leads the patient through an iterative process in which the answer to a question determines what question is presented next. For example, if a patient is fully dependent on a wheelchair, a question about climbing stairs is irrelevant. With these methods, patients’ tolerability for a questionnaire may be improved.

3 Paraclinical Outcome Measures

Numerous paraclinical outcome measures are available and could be used as adjunct to clinical measures to obtain information on treatment efficacy. Some are potentially valuable (e.g. cerebrospinal fluid [CSF], visual evoked potentials) while others are less suitable (e.g. brainstem auditory evoked potentials) [115]. Here, we shortly discuss the value of white matter pathology as detected on MRI. Subsequently, we will elaborate on newer outcome measures, such as brain atrophy, persisting black holes (PBH), OCT and biomarkers in body fluids.

3.1 Magnetic Resonance Imaging

3.1.1 White Matter Pathology

MRI is sensitive to detect, characterize and quantify lesions in the white matter. It plays a fundamental role in the McDonald diagnostic criteria for MS to demonstrate dissemination in time and space in addition to clinical signs [2]. Radiological dissemination in space is defined as having at least one lesion in at least two typical (for MS) areas in the central nervous system. Dissemination in time is determined when at least one new lesion is demonstrated on a follow-up MRI, or if one asymptomatic gadolinium-enhancing and one non-enhancing lesion are demonstrated on the initial MRI.

The MAGNIMS workgroup recently proposed a revision of these criteria allowing even earlier diagnosis with MRI [116]. The value of MRI as a diagnostic tool is principally the high sensitivity to detect (past) disease activity. Formation of new T2HL and GdT1L may occur subclinically and are thus more frequently seen than clinical relapses [9, 117]. The moderate correlation of T2HL load with relapse rate [26, 118] and disability [119, 120] is possibly related to this phenomenon. Nevertheless, white matter pathology has predictive value for the clinical disease course. For example, patients with a CIS and a high T2HL load at baseline had an increased risk of reaching an EDSS of 3.0 [121]. Also, the presence of two or more GdT1L in patients treated with interferon-β predicted EDSS worsening at 15 years [122].

Because of the high sensitivity for detecting disease activity, MRI has been widely accepted as a secondary endpoint in clinical trials. Moreover, demonstrating efficacy on MRI lesions is crucial in the development of immunomodulatory treatments. Treatment effects on MRI could also act as a surrogate endpoint for clinical disease activity. A study supported this by showing that treatment effect on MRI activity explained >80% of the variance of treatment effect on relapse rate [123]. Other studies confirmed this by showing the related MRI effects on relapse rate and accumulation of disability worsening (up to 16 years) [124,125,126].

These classical MRI parameters largely depict (past) neuroinflammation in MS. However, the neurodegenerative aspect of MS is being increasingly studied with MRI. One reason for this is that with the current therapy we are now able to suppress neuroinflammation effectively, but the ultimate goal of therapy is prevention of neuronal tissue loss or, in the long run, to stimulate neuronal repair. Another reason is that neuropathological and MRI techniques have improved our insight into the underlying neurodegenerative processes of MS [127]. Consequently, measures that reflect these processes are more frequently used as secondary outcome measures. The most widely used neurodegenerative MRI measures are atrophy and PBH.

3.1.2 Atrophy

Brain volume loss in MS patients occurs considerable faster than in healthy people: 0.5–1.0% versus 0.1–0.3% brain volume loss per year [128, 129]. Atrophy may be found throughout the disease course, even in the early phases [130]. Remarkably, the atrophy rate of gray matter structures accelerates in patients with SPMS to 14-fold that of healthy persons [131]. Virtually all gray matter structures are affected, although variation exists between clinical phenotypes [132].

Brain volume can be visualized in various ways. The somewhat older measures assess loss of brain volume indirectly by measuring corpus callosum size [133], bicaudate ratio [72] and ventricular volumes [72, 133]. Also, whole brain volume can be measured directly with conventional MRI [72, 128]. Nowadays, segmentation of the brain into white and gray matter compartments or specific gray matter structures is possible and several automated methods reduced processing time [134,135,136].

The relationship between atrophy measures and clinical signs has been extensively investigated. Whole brain and gray matter atrophy correlated strongly with disability and cognitive impairment, both cross-sectionally and longitudinally [132]. These correlations existed throughout the disease course and clinical phenotypes. Atrophy of gray matter structures may even be more closely related to clinical signs than white matter lesions or whole brain atrophy [137]. Atrophy of several structures correlated remarkably strongly with certain clinical symptoms. For example, cerebellar gray matter atrophy correlated strongly with cerebellar symptoms and hand function [138], upper cervical cord area with ambulatory dysfunction [139], and hippocampal atrophy with memory deficits [140]. Thalamic volume showed a remarkably firm correlation with cognitive impairment [141]. Also, various atrophy measures showed predictive value for future disability and cognitive impairment [137, 142,143,144].

Furthermore, spinal cord volumes can be assessed, for which the upper cervical cord area is often used. Several studies showed a correlation between spinal cord volume loss and clinical disability [144,145,146]. It has also been correlated with long-term disability [147].

An extensive summary of clinical trials that used brain atrophy as a secondary endpoint may be found elsewhere [148, 149]. Noteworthy is a recent meta-analysis that showed that 75% of the variance of treatment effect on disability was explained by whole brain atrophy and T2HL [150]. Another meta-analysis found evidence that whole brain atrophy in patients that received immunomodulatory treatment was lower than in the placebo group [151].

Although volumetric measurements are appealing outcome measures, there are some caveats and limitations. Firstly, atrophy accumulates very slowly, which generally means that longer follow-up is needed to detect significant changes. Clearly, this accounts particularly for treatment effects on smaller structures, such as thalamic volume. Secondly, the short-term effect of immunosuppression on brain tissue may cause a decrease in brain volume due to resolution of inflammation. This volume loss is not a sign of neurodegeneration, because there is no loss of neuronal tissue. This is often referred to as ‘pseudo-atrophy’. Importantly, this effect may last up to 1 year after initiation of treatment [152, 153]. Thirdly, various physiological variations in the content of the intra- and extra-cellular compartments affect volumetric measurements [154]. Lastly, factors that are not MS-specific (such as dehydration, alcohol use, smoking, genetic variation, comorbidities and age) may influence brain volume [154].

3.1.3 Persisting Black Holes

Another MRI marker for neurodegeneration is formation of PBH. These lesions are often defined as non-enhancing T2HL with persisting signal intensity between that of the gray matter and the CSF on T1-weighted scans [155]. Approximately 30–40% of active T2HL will eventually evolve into PBH within 6–12 months [156]. The underlying neuropathology of PBH is severe and irreversible tissue damage [156]. Accumulation of PBH is associated with accrual of disability [157, 158]. Furthermore, the PBH load correlated with disability worsening over 10 years [159]. Some clinical trials found significant effects of treatment on the formation of PBH [160,161,162,163].

Several more advanced MRI techniques are potentially valuable outcome measures, although they need further research to clarify the exact relevance. Examples are functional MRI for analysis of functional connectivity [164], diffusion tensor imaging to examine brain tissue integrity [165] and magnetization transfer ratio MRI as a marker for brain myelin content [166, 167].

3.2 Optical Coherence Tomography

The retina can be visualized non-invasively, safely and fast with OCT. This technique uses the reflection of near infra-red light on the retina. Different layers of the retina can be distinguished on high-resolution images. It has been proven to be valuable in quantifying pathology in these layers, although the exact underlying pathophysiological processes of these findings are largely unclear [168, 169].

Most findings of the research with OCT in MS point to neurodegenerative changes such as axonal loss and neuronal soma shrinkage [170]. Therefore, OCT is a good candidate outcome measure to assess treatment effect on neurodegeneration, which makes it an attractive tool in progressive MS trials. For this purpose, the retinal nerve fibre layer (RNFL) is of particular interest. The thickness of this layer may be decreased following optic neuritis [171, 172], but also decreases more slowly in patients without prior optic neuritis [171, 173]. The latter may indicate ongoing neurodegeneration. Furthermore, RNFL thickness correlated with cerebral atrophy measures [174, 175] and with axonal loss in the anterior visual pathway [176, 177].

Clinically, thinning of the RNFL correlated with worse performance on the LCLA (explained in Table 4) [171, 178], and a reduced visual quality of life [179]. Correlations of RNFL thickness with EDSS were less consistent [180, 181]. In a recent large multicenter study of patients without prior optic neuritis, persons with a RNFL thickness in the lowest tertile at baseline had double the risk of disability worsening in 2 years compared with the other tertiles [182]. The risk further increased with a longer follow-up. The clinical relevance of other layers, such as macular volume [183] and retinal ganglion-cell/inner plexiform layer thickness [184, 185], is less clear.

The advantage of OCT over MRI is that it is technically easier and widely accessible. When using a predefined scanning protocol, it has good reliability [186]. Nevertheless, further research is needed before OCT can be implemented as an outcome measure. This is particularly the case for longitudinal data of the various layers.

3.3 Biomarkers in Body Fluids

Both MRI and OCT allow detection of neuroinflammation and neurodegeneration at various time points, but have limited sensitivity to detect ongoing processes. Biomarkers in body fluids, such as CSF and blood, might be more useful for this purpose. Although it is beyond the scope of this review to discuss this topic thoroughly (it was recently reviewed elsewhere [187]), a few biomarkers are worth mentioning.

There are several potentially valuable CSF biomarkers that might give a real-time reflection of ongoing neurodegeneration. A biomarker that reflects axonal injury is neurofilament. This protein is a major component of the axonal cytoskeleton and is released following neuronal damage [188]. Neurofilament levels in CSF are generally raised in MS patients, particularly during an acute relapse [189, 190]. Furthermore, increased levels were associated with worse EDSS [190], faster disability worsening in 15 years [191], gadolinium-enhancing lesion load [192] and atrophy (of the brain and spinal cord) in 15 years [193]. Neurofilament levels were also responsive to treatment with fingolimod [194] and natalizumab [195], and therefore might be biomarkers for treatment effect.

Other proteins of the axonal cytoskeleton that can be measured in CSF are actin [196, 197] and tubulin [197, 198]. Proteins that indicate ongoing disease activity are sphingolipids (component of the myelin sheet) [199], glial fibrillary acidic protein (GFAP) [200], S100B [200] and Chitinase 3-like proteins [201].

Compared with CSF, blood is generally less well studied for biomarkers, but clearly has the advantage that it is much easier to obtain. As in CSF, neurofilament in the blood might act as a biomarker for neurodegeneration. Neurofilament levels predicted recovery of spinal cord lesions [202], and higher concentrations were associated with faster conversion to definite MS and more cerebral lesions [203]. Another biomarker that is used to determine bioactivity of interferon-β is myxovirus-resistance protein A (MxA). It also seems to be indicative of recent and future disease activity [204, 205]. Lastly, various small noncoding microRNAs are potentially valuable for predicting disease course and treatment response [187].

The exact value of these biomarkers as outcome measures will have to be determined. If clinically meaningful, they will probably be used in combination with other measures. They may be particularly useful to assess treatment effects in trials with progressive MS, because identification of progression or neurodegenerative changes remains very challenging.

4 No Evidence of Disease Activity

The concept of a ‘disease-activity-free status’ as the ultimate treatment goal has been used in other medical conditions, including cancer and inflammatory diseases such as rheumatoid arthritis. It implies the absence of measurable disease activity. This concept has been translated to NEDA and is used in more recent MS trials as a secondary outcome measure [206, 207]. It is essentially a multidimensional measure that typically covers (confirmed) EDSS progression, relapse rate and formation of MRI lesions (T2HL or GdT1L). However, any parameter related to disease activity may be added.

A recent study in a cohort of RRMS patients found that NEDA at 2 years had a positive predictive value for absence of disability progression at 7 years of 78% [207]. Furthermore, the predictive value of NEDA was greater than each of the individual components. Other studies also showed that combinations of clinical and MRI parameters had better predictive value for disability progression than individual measures [125, 150, 208,209,210]. For example, a recent meta-analysis found that treatment effect on T2HL and brain volume combined explained 75% of the variance of disability progression in 2 years, and this was significantly higher than predictive values of the MRI measures individually [150].

In clinical practice, NEDA-like models are used to identify responders and non-responders to treatment. Examples are the Modified Rio Score [211] and the Canadian Treatment Optimization Recommendation Model [35]. Such tools need to have good long-term predictive power for disability, before a treatment decision can be based on them.

When using NEDA as an outcome measure to assess treatment efficacy, it is important to consider the timing of assessment. The reason for this is that a treatment needs to have had enough time to become effective. This can be illustrated by the finding that 70% of patients had NEDA 2 years after initiating treatment with natalizumab with baseline assessment after 1 year, compared with 51% NEDA with a baseline at initiation of therapy [212]. For alemtuzumab timing is different, because the true treatment effect starts after the second infusion cycle, 1 year after the initial course [213]. This issue has implications when determining if NEDA can be a valid outcome measure for disability in the long run.

Although NEDA seems an appealing outcome measure in some ways, it is not yet clear which (functional) domains are important to include and when or how frequently these should be assessed. It should, for example, reflect what is important in daily life for patients. Therefore, including a PROM seems indispensable. Also, markers for neurodegeneration should be involved when tissue loss is considered to be the ultimate treatment goal. Therefore, brain volume is increasingly added to NEDA (referred to as NEDA-4) [214]. However, adding more assessments likely reduces the number of patients fulfilling NEDA, and may raise the bar to a too-high level resulting in the rejection of highly active, but not perfect, interventions.

Taken together, NEDA will continue to evolve while evidence accumulates about what are valuable outcome measures. Standardization of timing and functional subdomains are imperative for comparison between studies.

5 Future Perspectives

The number and quality of outcome measures is increasing, and with that the assessment of treatment efficacy will improve over the coming years. Until new measures are validated and generally accepted, the traditional outcome measures of EDSS and relapse rate will remain primary endpoints in clinical trials. However, it is very unlikely that these measures are sufficient to fully assess treatment efficacy. Eventually, measures that more explicitly capture multiple dimensions (e.g. MSFC and NEDA) will probably become the new standard. They are particularly useful to detect infrequent events (e.g. relapses) or small changes (e.g. brain atrophy and disability worsening) under treatment, which is increasingly important with highly effective therapy. The same accounts for treatment of progressive disease (SPMS and PPMS), in which small and gradual treatment effects can be expected. Moreover, multidimensional measures might decrease duration and size of clinical trials. The caveats of multidimensional measures that have to be taken into account are summarized in Table 6 [25].

Table 6 Limitations and caveats of multidimensional measures

In addition to improvement of existing outcome measures, innovative techniques such as electronic devices and mobile device applications are potentially valuable. They allow, for instance, multiple or continuous assessment which might give a more adequate picture of a patient’s ability or disability and the impact of the disease on daily living.

Several electronic devices are under development to assess disability. An example of this is the Assess MS system that uses an infra-red camera to register movements of upper and lower limbs, trunk and ambulation for automatic quantification of these movements. Results from a pilot study in MS are promising and these preliminary results are currently being validated with a new high-resolution camera [215]. Another device that has been developed is the Glove analyzer system, which is able to record data from finger movements to assess hand and arm function [216]. Also, accelerometers are potentially useful tools to measure mobility automatically [217]. Apart from other attractive aspects, electronic devices are free of intra-rater variability.

Mobile device applications are increasingly being used in the medical field and are also potentially useful in assessing outcomes in MS trials. Applications can be easily distributed and accessible for everyone with a smart phone. They can be used in several ways; for example, for assessing a PROM on a regular basis—up to several times per day. Also, applications may be connected online with investigators to get real-time access to or feedback from a patient’s status. This may decrease the number of visits needed or could help decide whether or not face-to-face contact with a patient is needed. In past years, healthcare ‘hackathons’ (i.e. an acronym of HACKers marATHONS) were organized to stimulate development and integration of medical devices and mobile device applications [218, 219]. However, many of these applications need rigorous scientific validation before they may be considered as outcome measures in clinical trials.

6 Conclusions

To conclude, assessing outcome in clinical trials in MS is not straightforward and is therefore a challenging field. Although much has been achieved the past decades, ‘old habits die hard’ and traditional measures will probably remain the standard in the near future. When more advanced measures have proven their value, they need to earn general acceptance by healthcare providers and especially regulatory agencies. In the end, only multidimensional measures will allow full coverage of disease activity and progression of MS and are thus best suited to assessing treatment efficacy in MS trials.