Framework to construct and interpret latent class trajectory modelling

Objectives Latent class trajectory modelling (LCTM) is a relatively new methodology in epidemiology to describe life-course exposures, which simplifies heterogeneous populations into homogeneous patterns or classes. However, for a given dataset, it is possible to derive scores of different models based on number of classes, model structure and trajectory property. Here, we rationalise a systematic framework to derive a ‘core’ favoured model. Methods We developed an eight-step framework: step 1: a scoping model; step 2: refining the number of classes; step 3: refining model structure (from fixed-effects through to a flexible random-effect specification); step 4: model adequacy assessment; step 5: graphical presentations; step 6: use of additional discrimination tools (‘degree of separation’; Elsensohn’s envelope of residual plots); step 7: clinical characterisation and plausibility; and step 8: sensitivity analysis. We illustrated these steps using data from the NIH-AARP cohort of repeated determinations of body mass index (BMI) at baseline (mean age: 62.5 years), and BMI derived by weight recall at ages 18, 35 and 50 years. Results From 288 993 participants, we derived a five-class model for each gender (men: 177 455; women: 111 538). From seven model structures, the favoured model was a proportional random quadratic structure (model F). Favourable properties were also noted for the unrestricted random quadratic structure (model G). However, class proportions varied considerably by model structure—concordance between models F and G were moderate (Cohen κ: men, 0.57; women, 0.65) but poor with other models. Model adequacy assessments, evaluations using discrimination tools, clinical plausibility and sensitivity analyses supported our model selection. Conclusion We propose a framework to construct and select a ‘core’ LCTM, which will facilitate generalisability of results in future studies.


INTRODUCTION
In many epidemiological studies, a risk factor is measured at a single point in time and related to the subsequent development of disease, under the assumption that a single 'oneoff' measure is an approximation for that exposure over a long time. Thus, baseline measurement of body mass index (BMI) is associated with subsequent development of common disease like cardiovascular disease (1), diabetes (2), several cancers (3), and allcause mortality (4). This approach is crude and many investigators seek to use alternative methods that might better capture long-term risk factor exposure termed life-course analysis.
There are widely-used examples that capture cumulative exposure, such as pack-years for smoking and lung cancer, but the assumption that incidence rate is proportional to total lifetime dose is questionable (5). Many other life-course models simply extract features for use in standard regression approaches; for example, a weight change over time. A more sophisticated approach, which takes account of within-individual correlations, is mixed-effect modelling, but this is difficult to interpret for public health implementation. An extension of this approach is the use of latent classes, also termed growth mixture models.
Latent class trajectory modelling (LCTM) simplifies heterogeneous populations into more homogeneous clusters or classes. From these, one can potentially include random effects to allow for individual variation within these classes. These models have a long history in the criminology (6) and psychology (7) literatures, and now, are increasingly reported in the human epidemiology literature. Of relevance to this paper, LCTM has been used in association studies of repeated BMI measures with all-cause mortality (8) and cancer incidence (multiple cancer types (9)); gastro-oesophageal (10); prostate (11)) and cancer mortality (11). The LCTM has two general advantages compared with using 'one-off' exposure determinations: first, it better informs aetiological associations by deeply phenotyping certain 'at risk' subpopulations; and second, LCTM offers a public health strategy to identify early divergent adverse trajectories as potential intervention targets.
Some researchers additionally argue that LCTM is well-equipped for future forecasting and new-patient generalisations in prediction models, as it handles data following a different predictable pattern from that learnt by the model (12).
However, LCTM is a complex form of modelling and requires several different structure assumptions (13). Implementation of different specifications within LCTMs might yield different findings. Although firmly acknowledged in the GRoLTS-Checklist: Guidelines for Reporting on Latent Trajectory Studies (14), structure-related assumptions have not been systematically evaluated. Thus, different observed patterns of latent classes between studies might reflect different modelling assumptions rather than true differences between populations. Here, we propose a framework to build and interpret LCTM, using an example of repeatedly determined body mass index (BMI) across adulthood in the National Institutes of Health (NIH)-AARP Diet and Health Study cohort.

Cohort
The NIH-AARP Diet and Health Study is a US cohort of men and women recruited from 1995 (15). A baseline medical and lifestyle questionnaire, including self-reported BMI, was returned by 566,398 participants (aged 50 to 71 years; mean age: 62.5 years). An additional risk factor questionnaire, including recalled weights at different ages was mailed in 1996 and completed by a sub-cohort of 327,860; 288,993 (177,455 men and 111,538 women) provided weight for all four time-points -ages 18, 35 and 50 years, and baseline weight and height, and these are the data included in the present analysis. We excluded participants with extreme BMI values (<15 or >70 kg/m 2 ) recorded at any time point. Means and standard deviations for derived recalled BMI distributions are representative of BMI distributions for historical period-equivalent US populations (16).

Latent class trajectory modelling
We developed an eight step framework (Table 1) modelling BMI as a function of age. Latent classes were used to identify subgroups of participants with distinct trajectories (detailed  (17). We used maximum likelihood approaches to fit the model with the 'hlme' function from 'lcmm' library (18) in the R software environment (version 3.2.1) (19) and cross-checked results using the 'PROC TRAJ' function in 'SAS traj' (SAS Institute, Inc., Cary, NC) (20) (Table S1).
Step 1: We initially constructed a scoping model provisionally selecting the plausible number of classes based on available literature -in the context of BMI trajectories, we used k = 5 classes as reported elsewhere (9,11). To determine the initial working model structure, we followed the rationale of Verbeke and Molenburgh (21) and examined the shape of standardised residual plots for each of the five classes in a model with no random-effects. If the residual profile could be approximated by a flat, straight line or a curve, then a random intercept, slope or quadratic term, respectively, should be considered. Preliminary plots suggested preference for a quadratic random-effects model ( Figure S1).
Step 2: We refined the preliminary working model from step 1 to determine the optimal number of classes, testing k = 1 to 7. We built models for both genders, as BMI patterns of lifetime changes differ for men and women (16). The number of classes chosen was based on the lowest Bayesian Information Criteria (BIC).
Step 3: We further refined the model, using the preferred k derived in step 2, testing for the optimal model structure. We tested seven models (detailed in Table S2): ranging from a simple fixed-effects model (model A) through a rudimentary method that allows the residual variances to vary between classes (model B) to a suite of five random-effects models with different variance structures (models C-G). We selected an optimal model structure using the lowest BIC value and referred to the outcome of steps 2 and 3 as the preferred model.
Step 4: We then performed a number of model adequacy assessments. First, we calculated the posterior probability for each participant of being assigned to each trajectory class, and assigned the individual to the trajectory with the highest probability. An average of these maximum Posterior Probability of Assignments (APPA) above 70%, in all classes, is regarded as acceptable (6). We further assessed model adequacy using Odds of Correct Classification (OCC), mismatch scores and Entropy, E k (detailed in Table S3). These diagnostic tools assist in model selection (6,22) -if these are strongly violated, one might go back to steps 2 and 3 and consider a different model with a higher BIC value.
Step 5: We used three graphical presentation approaches. The conventional approach is to plot mean trajectories with time encompassing each class. Alternatives include the use of mean trajectory plots with 95% predictive intervals for each class, which displays the estimated random variation within each class; or to plot individual level 'spaghetti plots' with time (for example, a random sample of participants), which allows the reader to observe the patterns of changes within classes.
Step 6: We assessed model discrimination, including Degrees of Separation, ‫ܵܦ‬ (23,24) and Elsensohn's envelope of residuals (25). To describe the separation of latent trajectory curves, a multivariate Mahalanobis distance was used. Peugh and Fan (24) argue that it is reasonable to speculate that identification of heterogeneous latent trajectories is facilitated by large statistical separation distance among the subpopulations. Thus, larger values of ‫ܵܦ‬ indicate the mean trajectories are well separated while ‫ܵܦ‬ equal to zero is the special case when all mean trajectories are identical. If the ‫ܵܦ‬ value is small, then one might consider a model with fewer classes.
To check structure assumptions in fixed-effects latent class models, Elsenhohn et al.  suggest that across class variability may not be fully accounted for.
Step 7: We assessed for clinical characterisation and plausibility using four approaches: i) assessing the clinical meaningfulness of the trajectory patterns, aiming to include classes with at least 1% capture of the population; ii) assessing the clinical plausibility of the trajectory classes; iii) tabulation of characteristics by latent classes and description of patterns compared with conventional categorisations; and iv) concordance of class characteristics with those for other well-established variables, arguing that low concordance indicates that the new latent classes offer information above and beyond that from the conventional categorisation.
Step 8: We conducted sensitivity analyses, in this example, with individuals with at least two and three BMI values, as LCTMs are flexible enough to deal with different observation times between participants.

Statistical algorithms
All R and SAS codes used to implement these tools are available via the authors and can be downloaded from www.github.com/hlennon/LCTMtools.

Number of classes
From the preliminary working model of a quadratic random-effects model, model F (proportional covariance structure), we derived BICs for up to seven classes: three of the class models failed to converge in men and women. Table 2 reports that the lowest BIC was obtained with five classes in men and women -confirming our initial working model. The proportions by class in men were 68.1%, 25.0%, 3.8%, 2.7%, and 0.4%; and in women, were 32.6%, 41.1%, 21.1%, 3.5% and 1.7%. For model G (our second choice model structure), the lowest BICs were noted for five classes in men and women, and without problems of failure to converge (Table S2).

Assessment for model structures
With the number of classes now selected as five, we tested the seven model structures -A to G. Table 3 reports that the lowest BIC was for model F in men and women, justifying the selection of model F in the preliminary working phase. The class sizes varied between models, with Class I ranging from 41% to 68% in men and from 32% to 95% in women. The There was moderately good concordance (unweighted and weighted) between the unstructured variance models G with model F in men (ߢ: 0.57) and women (ߢ: 0.65) (Tables   S4 and S5), but poorer concordance between the final preferred models and fixed-effects models in men, indicating that the models are informing different patterns.

Graphical presentation
We plotted the mean trajectories for model A, B, C, D, F and G in men and women ( Figure   1) illustrating the increased complexity from model A to G. As alternatives, we plotted separately mean trajectories with 95% predictive intervals for each class, in model F ( Figure   S2), which displays the estimated random variation within each of the classes with time, noting that variation was greater with the more 'complex' classes (classes IV and V compared with classes I, II and III). Spaghetti plots of individual level data illustrated that the timing and size of BMI increases characterise the classes -for example, sharp increases in BMI in early adulthood in Class III but later in adulthood for Class IV ( Figure S3).

Additional tools of suitability of fit
The DoS k values ranged from 0.10 to 0.36 and 0 to 0.34, in men and women, respectively (Table 3). The covariances were high and in the positive direction and therefore models with non-parallel mean trajectories lead to higher separation.
We plotted the local standard deviation of the residuals with time and found that these were broadly homogeneous, i.e., there were few parallel boundaries (Figure 2). The local residuals for the rapidly obese groups in both genders are the exceptions to parallel lines, which might reflect comorbidities in this group and smaller numbers.

Clinical assessment
Having established the preferred model, model F with five classes, in both genders, we assigned descriptive labels to each respective class as follows (Table 4): stable normal weight; normal weight to overweight; normal weight to obese; overweight to obese; and rapid early obesity. We noted that the proportion in the rapid early obesity (Class V) was less than 1% in men. However, overall, the proportion for Class V for men and women combined was nearly 1% -we retained this class in our preferred model as we judged it to be clinically meaningful as follows. In men and women, there was a rapid increase in obesity from early to middle adulthood, then apparent severe weight loss. We rationalised that this was clinically plausible, as it could be explained either by intentional (e.g. bariatric surgery) or non-intentional weight loss (e.g. reverse causality from development of disease).

Sensitivity analyses
We tested the preferred model using a larger sample of individuals with at least three measures, and found no material differences between these models, in men and women, and the main model ( Figure S4).

Main findings
We propose a standardised framework for model selection in the context of latent class trajectory modelling; thus, we argue for random-rather than fixed-effects models. We show that different model structures result in very different classes and hence phenotypes, and we propose pre-specified criteria for model selection and reporting of a 'core' model can facilitate generalisability. Compared with 'one-off' covariate categorisation, latent classes offer additional phenotypic information; and offer a disease prevention opportunity through identifying classes of exposure prior to disease diagnosis.

Context of other literature
Criticisms of the use of LCTM models previously have been that the uncertainty in class membership is not discussed. However, entropy-based measures, such as ‫ܧ‬ and ‫ܧ‬ , which have been developed to quantify uncertainty (19) can be used with LCTMs. This issue can be further overcome if these tools are made publicly available to non-specialists and can be used in an easy manner.  12 We tested a variety of assumptions and demonstrated that small changes in randomeffects structures can have dramatic effects on the shapes of the trajectories derived and, therefore, it is vital that when conducting a LCTM study, analysts are aware of the differences between fixed-effects and random-effects models and are transparent when reporting their model selection procedures. For example, Viristen et al. (27) specified a random slope but didn't explain the reason for the choice of this assumption. Since the publication of the GRoLTS Guidelines (Guidelines on Reporting of Latent Trajectory Studies) earlier this year, we hope this will become common practice for future LCTM studies.

Strengths and weaknesses
The study has strengths. First, the considered and strategic workflow to optimise identification and application of latent classes provides for a more robust and transparent application of these models in epidemiology. Second, the results presented are based on modelling data from a large well-characterised US cohort, therefore allowing the derivation of numerically meaningful subpopulations (i.e. classes) with distinct phenotypes. We have demonstrated that these subpopulations are markedly different to considering only one measure at baseline and more likely linked to what the clinician sees in practise as the BMI history has been taken into account. Third, we constructed several gender-specific models with differing number of classes and a range of distributional assumptions; we extensively explored different model selection and adequacy tools and described extensions to other tools, to supplement the assessment of suitable model fit. Fourth, we have made the code freely and publicly available to aid ease of interpretation of these measures to allow LCTM analysts to consider the model specification more carefully.
There are several study weaknesses. First, directly capturing records of other common disease risk factors, such as smoking, over time (i.e., in a time-varying manner) is difficult with LCTMs as we are currently considering trajectories of one risk factor at a time.
Second, the two-stage approach requires a methodological approach to ensure model selection is not unconsciously 'cherry-picked'. Although specialist knowledge of the data aids  13 model building, a clear understanding of the difference between fixed-and random-effects models is required by the analyst. Third, the Kappa value is not optimal criteria for LCTM comparisons. It is typically used to for inter-rater agreement where the labels are the same across the classes, e.g, BMI categories -as opposed to this unsupervised learning approach where the classes are defined by the algorithm. To draw a fair comparison between models and the matching of the class labels across the models, we computed ߢ for all 120 possible combinations and selected the optimal ߢ to report. Fourth, while the multiple diagnostics tests we have implemented can be useful, it is likely that some measures favour one model, while other measures favour another. Therefore it is important that the analyst is aware that model selection of any LCTM must be guided by model interpretation, as well as likelihood-based model-fit criteria (17). Nagin (6) summarises this well by saying "In the end, the objective of the model selection is not the maximisation of some statistic of model fit.
Rather it is to summarise the distinctive features of the data in as parsimonious a fashion as possible".

Clinical implications and future research
Variations of model A (fixed-effects) have been commonly used in the clinical literature (8)(9)(10)(11), which assume no within-class variability when deriving latent classes. Interpretation in this setting is that variation from the mean trajectory is random, i.e., the correlation between measurements for the same individual is explained by latent class membership. In the context of any repeated measures in the general population, this assumption might not be valid (13). Saunders (28) argued in support of full random-effects models (i.e. models F and G), calling upon Moffitt's theory from criminology, which recognises that "there are distinct developmental clusters of trajectories of anti-social behaviour that are the result of divergent aetiologies" -in other words, it is unlikely that latent classes start from a similar baseline.
A larger number of classes can become difficult to interpret, and five BMI trajectory classes is supported by other studies in the literature (8)(9)(10)(11). Determining the number of latent classes seems to be the focus of many trajectory studies and method papers. Whilst

Role of the funding source
The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report.

Conflict of Interest
AGR has received lecture honoraria from Merck Serona and Janssen-Cilag, and independent research funding from Novo Nordisk. All other authors have no conflicts of interest to declare.

MORE DETAILED DESCRIPTION OF MODEL STRUCTURE
For ! individuals, the latent class trajectory model we consider is given by: The random effect ! ! is class-specific and follows a multivariate Normal distribution with zero mean and a 3×3 variance-covariance matrix !. For the residual error term ! ! , the usual assumption hold, ! ! is normally distributed with zero mean and variance ! ! . The probability of an individual belonging to class ! is described by a multinomial distribution, i.e., such that ! ! are parameters to be estimated in the model. To select the number of latent classes, we assume a working model (Equation 1) for the random effect structure and the criterion used to select the number of classes was the lowest Bayesian Information Criteria (BIC).

MORE DETAILED DESCRIPTION OF MODEL ADEQUACY ASSESSMENT
Common tools for model adequacy assessment checking are described below in Table S2. Here we give more details of the two extensions of these tools; degree of separation and Elsensohn's residuals (to random effects).

Degrees of separation
A model's ability to detect classes accurately is affected by the degree of separation between latent trajectory curves (8,9). To describe the separation of latent growth curves, we used the multivariate Mahalanobis distance with the multivariate Mahalanobis distance (D) units defined as: where ! ! is a !×1 vector of mean values for class !, and ! !! is the inverse of a ! ×! matrix of sample covariances of at times ! = 1, … , !. The larger the difference, the larger the separation between curves. Peugh and Fan (9) argue that it is reasonable to expect that it is easier to identify heterogeneous latent growth trajectories when the statistical separation distance among the subpopulations is larger than when the separation distance among the latent subpopulations is much smaller.
To give an overall measure of separation for each model, we propose a weighted sum of multivariate Mahalanobis distance matrix with weights being the estimated class proportions, ! ! ..
Then the degree of separation !"! ! is defined as Larger values of !"# ! indicate the mean trajectories are well separated while !"# ! is zero in the special case when all mean trajectories are identical. If the !"# ! value is small, then you may wish to consider a model with fewer classes.

Elsensohn's envelope of residuals
To check the model assumption in fixed effect latent class models, Elsenhohn et al. (6) suggest plotting the local standard deviations of the residuals to check the appropriateness of the model.
With the assumption that the residuals are homogeneous over time. To check the appropriateness of each of our model assumptions, we extended their method to include random effects in the models. We compute the local standard deviations of the residuals using the following steps: 1) Compute the observed residuals ! !"# for each subject !, at time ! given the individual is in class ! Equation 1: where ! !"# is the observed value for individual ! at time t in class ! and ! !"# is the fitted value of BMI from our fitted model, here the random effect model. 2) Compute the class-and time-specific weighted local variance of the residuals, !"! ! ! !"# , with weights being ! !" , the posterior probabilities of individual ! belonging to group !. #) Plot the upper and lower boundaries for the local standard deviations of the residuals  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Mismatch
Close to 0 for each class The difference between the estimated class proportions and the class membership proportions once individuals have been assigned to a class, i.e.
where ! ! is the number of individuals in a class and ! is the total number.

Entropy
Close to 0 Entropy is a global measure of classification uncertainty, which takes into account all ! × ! posterior probabilities. The entropy of a model is defined as which takes values from [0, ∞), with higher values indicating a larger amount of uncertainty. Entropy values closest to 0 correspond to models with least classification uncertainty.

Methods:
We developed an eight step framework: step 1, a scoping model; step 2, refining the number of classes; step 3, refining model structure (from fixed-effects through to a flexible random-effect specification); step 4, model adequacy assessment; step 5, graphical presentations; step 6, use of additional discrimination tools ('degree of separation'; Elsensohn's envelope of residual plots); step 7, clinical characterisation and plausibility; and step 8, sensitivity analysis. We illustrated these steps using data from the NIH-AARP cohort of repeated determinations of body mass index (BMI) at baseline (mean age: 62.5 years), and BMI derived by weight recall at ages 18, 35 and 50 years.  (3), and allcause mortality (4). This approach is crude and many investigators seek to use alternative methods that might better capture long-term risk factor exposure termed life-course analysis.

Results
There are widely-used examples that capture cumulative exposure, such as pack-years for smoking and lung cancer, but the assumption that incidence rate is proportional to total lifetime dose is questionable (5). Many other life-course models simply extract features for use in standard regression approaches; for example, a weight change over time. A more sophisticated approach, which takes account of within-individual correlations, is mixed-effect modelling, but this is difficult to interpret for public health implementation. An extension of this approach is the use of latent classes, also termed growth mixture models.
Latent class trajectory modelling (LCTM) simplifies heterogeneous populations into more homogeneous clusters or classes. From these, one can potentially include random effects to allow for individual variation within these classes. These models have a long history in the criminology (6) and psychology (7) literatures, and now, are increasingly reported in the human epidemiology literature (for example, disentangling the heterogeneity of childhood asthma (8)). Of relevance to this paper, LCTM has been used in association studies of repeated BMI measures with the following endpoints: all-cause mortality (9); cancer incidence (multiple cancer types (10)); gastro-oesophageal (11); prostate (12)); and cancer mortality (12). The LCTM has three general advantages compared with using 'oneoff' exposure determinations: first, it better informs aetiological associations by deeply phenotyping certain 'at risk' subpopulations; and second, LCTM offers a public health strategy to identify early divergent adverse trajectories as potential intervention targets.
Some researchers additionally argue that LCTM is well-equipped for future forecasting and new-patient generalisations in prediction models, as it handles data following a different predictable pattern from that learnt by the model (13). Thirdly, the trajectory approach allows a better understanding of the causes of between-individual variation in certain features (e.g., weight variation over age), by analysing the trajectory as an outcome rather than exposure.
However, LCTM is a complex form of modelling and requires several different structure assumptions (14). Although firmly acknowledged in the GRoLTS-Checklist: Guidelines for Reporting on Latent Trajectory Studies (15), structure-related assumptions have not been systematically evaluated. For many exposures of interest, typically two to seven classes might be described and, as detailed latter, at least seven model structures might be fitted, with and without linear curve properties, such that it is possible to derive greater than eighty different models. Thus, reported differences between studies using latent class modelling might reflect different modelling assumptions rather than true differences between populations. To facilitate the generalizability of results in future studies, here, we propose a framework to construct and select a 'core' LCTM, using an example of repeatedly determined body mass index (BMI) across adulthood in the National Institutes of Health (NIH)-AARP Diet and Health Study cohort. For exposure-disease outcome association analyses, current approaches generally use two stages: first, latent class trajectory modelling, followed by standard association modelling. The framework described here is limited to the first stage.

Cohort
The NIH-AARP Diet and Health Study is a US cohort recruited from 1995 (16 (17).

Latent class trajectory modelling
We developed an eight step framework (Table 1) modelling BMI as a function of age. Latent classes were used to identify subgroups of participants with distinct trajectories (detailed mathematical equations in supplemental material p2) (18). We used maximum likelihood approaches to fit the model with the 'hlme' function from 'lcmm' library (19) in the R software environment (version 3.2.1) and cross-checked results using the 'PROC TRAJ' function in 'SAS traj' (SAS Institute, Inc., Cary, NC) (20) (Table S1).
Step 1: We initially constructed a scoping model provisionally selecting the plausible number of classes based on available literature -in the context of BMI trajectories, we used K = 5 classes as reported elsewhere (10,12). We built models for both genders, as BMI patterns of lifetime changes differ for men and women (21). To determine the initial working model structure, we followed the rationale of Verbeke and Molenburgh (22) and examined the shape of standardised residual plots for each of the five classes in a model with no randomeffects. If the residual profile could be approximated by a flat, straight line or a curve, then a random intercept, slope or quadratic term, respectively, were considered. Preliminary plots suggested preference for a quadratic random-effects model ( Figure S1).
Step 2: We refined the preliminary working model from step 1 to determine the optimal number of classes, testing K = 1 to 7. The number of classes chosen was based on the lowest Bayesian Information Criteria (BIC). Step 3: We further refined the model, using the favoured K derived in step 2, testing for the optimal model structure. We tested seven models (detailed in Table S2): ranging from a simple fixed-effects model (model A) through a rudimentary method that allows the residual variances to vary between classes (model B) to a suite of five random-effects models with different variance structures (models C-G).
Step 4: We then performed a number of model adequacy assessments. First, for each participant, we calculated the posterior probability of being assigned to each trajectory class, and assigned the individual to the class with the highest probability. An average of these maximum Posterior Probability of Assignments (APPA) above 70%, in all classes, is regarded as acceptable (6). We further assessed model adequacy using Odds of Correct Classification (OCC), mismatch scores and Entropy, E k (detailed in Table S3). These diagnostic tools assist in model selection (6,23). Step 5: We used three graphical presentation approaches. The conventional approach is to plot mean trajectories with time encompassing each class. Alternatives include the use of mean trajectory plots with 95% predictive intervals for each class, which displays the predicted random variation within each class; or to plot individual level 'spaghetti plots' with To check structure assumptions in fixed-effects latent class models, Elsenhohn et al. suggest that across class variability may not be fully accounted for.
Step 7: We assessed for clinical characterisation and plausibility using four approaches: i) assessing the clinical meaningfulness of the trajectory patterns, aiming to include classes with at least 1% capture of the population; ii) assessing the clinical plausibility of the trajectory classes; iii) tabulation of characteristics by latent classes versus conventional categorisations; and iv) concordance of class membership with conventional BMI category membership using the kappa statistic (as LCTM is an unsupervised learning approach, we computed ߢ for all possible combinations and selected the optimal ߢ).
Step 8: We conducted sensitivity analyses, in this example, with individuals with at least two and three BMI values, as LCTMs are flexible enough to deal with different observation times between participants.

Statistical algorithms
All R and SAS codes used to implement these tools are available via the authors and can be downloaded from www.github.com/hlennon/LCTMtools.

Number of classes
From the preliminary working model of a quadratic random-effects model, model F (proportional covariance structure), we derived BICs for up to seven classes: three of the class models failed to converge in men and women. Table 2 reports that the lowest BIC was obtained with five classes in men and women -confirming our initial working model. The proportions by class in men were 68.1%, 25.0%, 3.8%, 2.7%, and 0.4%; and in women, were 32.6%, 41.1%, 21.1%, 3.5% and 1.7%. For model G (our second favoured model), the lowest BICs were noted for five classes in men and women (Table S2).

Assessment for model structures
With the number of classes now selected as five, we tested the seven model structures -A to G. There was moderately good concordance (unweighted and weighted) between the unstructured variance models G with model F in men (ߢ: 0.57) and women (ߢ: 0.65) (Tables   S4 and S5), but poorer concordance between the favoured models and fixed-effects models in men.

Graphical presentation
We plotted the mean trajectories for model A, B, C, D, F and G in men and women ( Figure   1) illustrating the increased complexity from model A to G. As alternatives, we plotted separately mean trajectories with 95% predictive intervals for each class, in model F ( Figure   S2), which displays the predicted random variation within each of the classes with time, noting that variation was greater with the more 'complex' classes (classes IV and V compared with classes I, II and III). Spaghetti plots of individual level data illustrated that the timing and size of BMI changes characterise the classes -for example, sharp increases in BMI in early adulthood in Class III but later in adulthood for Class IV ( Figure S3).

Additional tools of suitability of fit
The DoS k values ranged from 0.10 to 0.36 and 0 to 0.34, in men and women, respectively (Table 3). The covariances were high and in the positive direction and therefore models with non-parallel mean trajectories lead to higher separation.
We plotted the local standard deviation of the residuals with time and found that these were broadly homogeneous, i.e., there were few parallel boundaries ( Figure 2). The local residuals for the rapidly obese groups in both genders are the exceptions to parallel lines, which might reflect comorbidities in this group and smaller numbers. Having established the favoured model, model F with five classes in both genders, we assigned descriptive labels to each respective class as follows (Table 4): stable normal weight; normal weight to overweight; normal weight to obese; overweight to obese; and rapid early obesity. We noted that the proportion in the rapid early obesity (Class V) was less than 1% in men. However, overall, the proportion for Class V for men and women combined was nearly 1%. Thus, we retained this class as we judged it to be clinically meaningful as follows.

Clinical assessment
In both genders, there were rapid increases in obesity from early to middle adulthood, then apparent severe weight reductions. We rationalised that this was clinically plausible, as it could be explained either by intentional (e.g. bariatric surgery) or non-intentional weight loss (e.g. reverse causality from development of disease).
Finally, we noted very poor concordance between the favoured model and conventional BMI categorisation in men (ߢ: 0.18) and women (ߢ: 0.52) ( Table 5).

Sensitivity analyses
We tested the favoured model using a larger sample of individuals with at least three measures, and found no material differences between these models, in men and women, and the main model ( Figure S4).

Main findings
We propose an eight-step framework for the construction and selection of models derived from latent class trajectory modelling. We evaluated a range of model structures from fixed- phenotypes. We propose pre-specified criteria for model selection and that the reporting of a 'core' model will facilitate generalizability of results in future studies.

Context of other literature
To the best of our knowledge, this is the first study to systematically address structurerelated assumptions in LCTMs, and their potential impact on clinically-relevant endpoints -in this example, BMI trajectories. Anecdotally, there is a justifiable criticism regarding the use of LCTM models and an uncertainty of how class memberships are derived -a 'black box' effect. The proposed framework, here, encourages the opposite -a transparent stepwise approach to class and model structure selection. To enhance this process, for example, we have 'borrowed' tools developed to address to quantify uncertainty, such as entropy measures, ‫ܧ‬ and ‫ܧ‬ , and applied them to assist assessment of model adequacy. A further modification of discrimination measurement with variance estimation has been described by Shah and colleagues (27), and might have importance for class assignment where 'yes/no' treatment decisions are required.
Variations of model A (fixed-effects) have been reported in the clinical literature (9)(10)(11)(12), which assume no within-class variability when deriving latent classes. Interpretation in this setting is that variation from the mean trajectory is random, i.e., the correlation between measurements for the same individual is explained by latent class membership. In the context of any repeated measures in the general population, this assumption might not be valid (14). Saunders (28) argued in support of full random-effects models (i.e. models F and G), calling upon Moffitt's theory from criminology, which recognises that "there are distinct developmental clusters of trajectories of anti-social behaviour that are the result of divergent aetiologies" -in other words, it is unlikely that latent classes start from a similar baseline. The publication of the 16-item GRoLTS Guidelines in 2017 (15) heralded an important advance for the application of LCTM. Here, we add a framework for construction and interpretation.

Strengths and weaknesses
The study has strengths. First, the considered and strategic workflow to optimise identification and application of latent classes provides for a more robust and transparent application of these models in epidemiology. Second, the results presented are based on modelling data from a large well-characterised US cohort, therefore allowing the derivation of numerically meaningful subpopulations (i.e. classes) with distinct phenotypes. We uniquely used averaged kappa values to demonstrate that the LCTM-derived subpopulations are markedly different to those derived from a 'one-off' BMI determinations. In turn, BMI trajectories are more likely to reflect normal clinical practice of considering a 'weight history'.
Third, we extensively explored different model selections and adequacy tools, and described extensions to other tools, to supplement model interpretation. Fifth, we further supplement model interpretation, we embedded this project within a multi-disciplinary research team including data scientists, statisticians, clinicians and epidemiologists -an approach echoed elsewhere (29). Finally, we have made the statistical algorithms freely available.
There are several study weaknesses. First, LCTMs currently only considers trajectories of one risk factor at a time. Second, there were only four time points in the AARP such that it was not possible to assess weight cycling. Third, whilst we described multiple diagnostic tests, ultimately model selection was based on case-study appropriate model interpretation (for example, model adequacy; discrimination; clinical plausibility; sensitivity analyses) as well as likelihood-based model-fit criteria (18). The trade-off between BIC efficiency and model adequacy is summarised by Nagin (6)   For future research, improving the construction, interpretation and reporting of LCTM (advocated here) is hugely important as the LCTM approach has opportunities to identify and intervene early in sub-populations with adverse trajectories. This approach is analogous to the well-held public strategy of using childhood growth charts to identify and intervening in young children failing to thrive. Thus, in the example of BMI, remembering that 80% of obese adults were not obese in childhood (30), future LCTM studies might identify (new) individuals in their 20s or early 30s on adverse trajectories towards later adulthood obesity. This strategy is a new methodological paradigm, as the repeated measurement of a risk factor (here, BMI) becomes a clinically-relevant endpoint rather than just an exposure.

Role of the funding source
The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report.

Conflict of Interest
AGR has received lecture honoraria from Merck Serona and Janssen-Cilag, and independent research funding from Novo Nordisk. All other authors have no conflicts of interest to declare.

Contributions
HL, MS, MC and AGR conceptualised the paper. HL, MS and SK designed the statistical approaches; HL performed the modelling. AC and ML facilitated data access and interpretation of the AARP data. All authors contributed to data interpretation; IB and AGR put modelling into clinical context.

Acknowledgements
We acknowledge the generous funding from Cancer Research UK National Awareness and Early Detection Initiative (NAEDI).

Data sharing statement
All statistical codes used in this paper are available via the authors and can be downloaded                 No random effects -with the interpretation that any deviation of an individuals trajectory from its mean class trajectory is due to random error only SAS traj PROC TRAJ Fixed effects heteroscedastic (class-specific residual variances) The same interpretation as Model A with random errors that can be larger and smaller in different classes.
R mmlcr mmlcr The interpretation is allowing individuals to vary in initial weight but each class member is assumed to follow the same shape and magnitude of the mean trajectory SAS traj PROC TRAJ

MORE DETAILED DESCRIPTION OF MODEL STRUCTURE
For ! individuals, the latent class trajectory model we consider is given by: The random effect ! ! is class-specific and follows a multivariate Normal distribution with zero mean and a 3×3 variance-covariance matrix !. For the residual error term ! ! , the usual assumption hold, ! ! is normally distributed with zero mean and variance ! ! . The probability of an individual belonging to class ! is described by a multinomial distribution, i.e., such that ! ! are parameters to be estimated in the model. To select the number of latent classes, we assume a working model (Equation 1) for the random effect structure and the criterion used to select the number of classes was the lowest Bayesian Information Criteria (BIC).

MORE DETAILED DESCRIPTION OF MODEL ADEQUACY ASSESSMENT
Common tools for model adequacy assessment checking are described below in Table S2. Here we give more details of the two extensions of these tools; degree of separation and Elsensohn's residuals (to random effects).

Degrees of separation
A model's ability to detect classes accurately is affected by the degree of separation between latent trajectory curves (8,9). To describe the separation of latent growth curves, we used the multivariate Mahalanobis distance with the multivariate Mahalanobis distance (D) units defined as: where ! ! is a !×1 vector of mean values for class !, and ! !! is the inverse of a ! ×! matrix of sample covariances of at times ! = 1, … , !. The larger the difference, the larger the separation between curves. Peugh and Fan (9) argue that it is reasonable to expect that it is easier to identify heterogeneous latent growth trajectories when the statistical separation distance among the subpopulations is larger than when the separation distance among the latent subpopulations is much smaller.
To give an overall measure of separation for each model, we propose a weighted sum of multivariate Mahalanobis distance matrix with weights being the estimated class proportions, ! ! ..
Then the degree of separation !"! ! is defined as Larger values of !"# ! indicate the mean trajectories are well separated while !"# ! is zero in the special case when all mean trajectories are identical. If the !"# ! value is small, then you may wish to consider a model with fewer classes.

Elsensohn's envelope of residuals
To check the model assumption in fixed effect latent class models, Elsenhohn et al. (6)  : where ! !"# is the observed value for individual ! at time t in class ! and ! !"# is the fitted value of BMI from our fitted model, here the random effect model. 2) Compute the class-and time-specific weighted local variance of the residuals, !"! ! ! !"# , with weights being ! !" , the posterior probabilities of individual ! belonging to group !. #) Plot the upper and lower boundaries for the local standard deviations of the residuals

Mismatch
Close to 0 for each class The difference between the estimated class proportions and the class membership proportions once individuals have been assigned to a class, i.e.
where ! ! is the number of individuals in a class and ! is the total number.

Entropy
Close to 0 Entropy is a global measure of classification uncertainty, which takes into account all ! × ! posterior probabilities. The entropy of a model is defined as which takes values from [0, ∞), with higher values indicating a larger amount of uncertainty. Entropy values closest to 0 correspond to models with least classification uncertainty.

Class II
Class III

Class IV
Class V      property. Here, we rationalise a systematic framework to derive a 'core' favoured model.

Methods:
We developed an eight step framework: step 1, a scoping model; step 2, refining the number of classes; step 3, refining model structure (from fixed-effects through to a flexible random-effect specification); step 4, model adequacy assessment; step 5, graphical presentations; step 6, use of additional discrimination tools ('degree of separation'; Elsensohn's envelope of residual plots); step 7, clinical characterisation and plausibility; and step 8, sensitivity analysis. We illustrated these steps using data from the NIH-AARP cohort of repeated determinations of body mass index (BMI) at baseline (mean age: 62.5 years), and BMI derived by weight recall at ages 18, 35 and 50 years.

Conclusion:
We propose a framework to construct and select a 'core' LCTM, which will facilitate generalisability of results in future studies.   (3), and allcause mortality (4). This approach is crude and many investigators seek to use alternative methods that might better capture long-term risk factor exposure termed life-course analysis.
There are widely-used examples that capture cumulative exposure, such as pack-years for smoking and lung cancer, but the assumption that incidence rate is proportional to total lifetime dose is questionable (5). Many other life-course models simply extract features for use in standard regression approaches; for example, a weight change over time. A more sophisticated approach, which takes account of within-individual correlations, is mixed-effect modelling, but this is difficult to interpret for public health implementation. An extension of this approach is the use of latent classes, also termed growth mixture models.
Latent class trajectory modelling (LCTM) simplifies heterogeneous populations into more homogeneous clusters or classes. From these, one can potentially include random effects to allow for individual variation within these classes. These models have a long history in the criminology (6) and psychology (7) literatures, and now, are increasingly reported in the human epidemiology literature (for example, disentangling the heterogeneity of childhood asthma (8)). Of relevance to this paper, LCTM has been used in association studies of repeated BMI measures with the following endpoints: all-cause mortality (9); cancer incidence (multiple cancer types (10)); gastro-oesophageal (11); prostate (12)); and cancer mortality (12). The LCTM has three general advantages compared with using 'oneoff' exposure determinations: first, it better informs aetiological associations by deeply phenotyping certain 'at risk' subpopulations; and second, LCTM offers a public health strategy to identify early divergent adverse trajectories as potential intervention targets.
Some researchers additionally argue that LCTM is well-equipped for future forecasting and new-patient generalisations in prediction models, as it handles data following a different predictable pattern from that learnt by the model (13). Thirdly, the trajectory approach allows a better understanding of the causes of between-individual variation in certain features (e.g., weight variation over age), by analysing the trajectory as an outcome rather than exposure.
However, LCTM is a complex form of modelling and requires several different structure assumptions (14). Although firmly acknowledged in the GRoLTS-Checklist: Guidelines for Reporting on Latent Trajectory Studies (15), structure-related assumptions have not been systematically evaluated. For many exposures of interest, typically two to seven classes might be described and, as detailed latter, at least seven model structures might be fitted, with and without linear curve properties, such that it is possible to derive greater than eighty different models. Thus, reported differences between studies using latent class modelling might reflect different modelling assumptions rather than true differences between populations. To facilitate the generalizability of results in future studies, here, we propose a framework to construct and select a 'core' LCTM, using an example of repeatedly determined body mass index (BMI) across adulthood in the National Institutes of Health (NIH)-AARP Diet and Health Study cohort. For exposure-disease outcome association analyses, current approaches generally use two stages: first, latent class trajectory modelling, followed by standard association modelling. The framework described here is limited to the first stage.

Cohort
The NIH-AARP Diet and Health Study is a US cohort recruited from 1995 (16 (17).

Latent class trajectory modelling
We developed an eight step framework (  (Table S1).
Step 1: We initially constructed a scoping model provisionally selecting the plausible number of classes based on available literature -in the context of BMI trajectories, we used K = 5 classes as reported elsewhere (10,12). We built models for both genders, as BMI patterns of lifetime changes differ for men and women (21). To determine the initial working model structure, we followed the rationale of Verbeke and Molenburgh (22) and examined the shape of standardised residual plots for each of the five classes in a model with no randomeffects. If the residual profile could be approximated by a flat, straight line or a curve, then a random intercept, slope or quadratic term, respectively, were considered. Preliminary plots suggested preference for a quadratic random-effects model ( Figure S1).
Step 2: We refined the preliminary working model from step 1 to determine the optimal number of classes, testing K = 1 to 7. The number of classes chosen was based on the lowest Bayesian Information Criteria (BIC). Step 3: We further refined the model, using the favoured K derived in step 2, testing for the optimal model structure. We tested seven models (detailed in Table S2): ranging from a simple fixed-effects model (model A) through a rudimentary method that allows the residual variances to vary between classes (model B) to a suite of five random-effects models with different variance structures (models C-G).
Step 4: We then performed a number of model adequacy assessments. First, for each participant, we calculated the posterior probability of being assigned to each trajectory class, and assigned the individual to the class with the highest probability. An average of these maximum Posterior Probability of Assignments (APPA) above 70%, in all classes, is regarded as acceptable (6). We further assessed model adequacy using Odds of Correct Classification (OCC), mismatch scores and Entropy, E k (detailed in Table S3). These diagnostic tools assist in model selection (6,23). Step 5: We used three graphical presentation approaches. The conventional approach is to plot mean trajectories with time encompassing each class. Alternatives include the use of mean trajectory plots with 95% predictive intervals for each class, which displays the predicted random variation within each class; or to plot individual level 'spaghetti plots' with To check structure assumptions in fixed-effects latent class models, Elsenhohn et al. suggest that across class variability may not be fully accounted for.
Step 7: We assessed for clinical characterisation and plausibility using four approaches: i) assessing the clinical meaningfulness of the trajectory patterns, aiming to include classes with at least 1% capture of the population; ii) assessing the clinical plausibility of the trajectory classes; iii) tabulation of characteristics by latent classes versus conventional categorisations; and iv) concordance of class membership with conventional BMI category membership using the kappa statistic (as LCTM is an unsupervised learning approach, we computed ߢ for all possible combinations and selected the optimal ߢ).
Step 8: We conducted sensitivity analyses, in this example, with individuals with at least two and three BMI values, as LCTMs are flexible enough to deal with different observation times between participants.

Statistical algorithms
All R and SAS codes used to implement these tools are available via the authors and can be downloaded from www.github.com/hlennon/LCTMtools.

Number of classes
From the preliminary working model of a quadratic random-effects model, model F (proportional covariance structure), we derived BICs for up to seven classes: three of the class models failed to converge in men and women. Table 2 reports that the lowest BIC was obtained with five classes in men and women -confirming our initial working model. The proportions by class in men were 68.1%, 25.0%, 3.8%, 2.7%, and 0.4%; and in women, were 32.6%, 41.1%, 21.1%, 3.5% and 1.7%. For model G (our second favoured model), the lowest BICs were noted for five classes in men and women (Table S2).

Assessment for model structures
With the number of classes now selected as five, we tested the seven model structures -A to G. There was moderately good concordance (unweighted and weighted) between the unstructured variance models G with model F in men (ߢ: 0.57) and women (ߢ: 0.65) (Tables   S4 and S5), but poorer concordance between the favoured models and fixed-effects models in men.

Graphical presentation
We plotted the mean trajectories for model A, B, C, D, F and G in men and women ( Figure   1) illustrating the increased complexity from model A to G. As alternatives, we plotted separately mean trajectories with 95% predictive intervals for each class, in model F ( Figure   S2), which displays the predicted random variation within each of the classes with time, noting that variation was greater with the more 'complex' classes (classes IV and V compared with classes I, II and III). Spaghetti plots of individual level data illustrated that the timing and size of BMI changes characterise the classes -for example, sharp increases in BMI in early adulthood in Class III but later in adulthood for Class IV ( Figure S3).

Additional tools of suitability of fit
The DoS k values ranged from 0.10 to 0.36 and 0 to 0.34, in men and women, respectively (Table 3). The covariances were high and in the positive direction and therefore models with non-parallel mean trajectories lead to higher separation.
We plotted the local standard deviation of the residuals with time and found that these were broadly homogeneous, i.e., there were few parallel boundaries (Figure 2). The local residuals for the rapidly obese groups in both genders are the exceptions to parallel lines, which might reflect comorbidities in this group and smaller numbers. Having established the favoured model, model F with five classes in both genders, we assigned descriptive labels to each respective class as follows (Table 4): stable normal weight; normal weight to overweight; normal weight to obese; overweight to obese; and rapid early obesity. We noted that the proportion in the rapid early obesity (Class V) was less than 1% in men. However, overall, the proportion for Class V for men and women combined was nearly 1%. Thus, we retained this class as we judged it to be clinically meaningful as follows.

Clinical assessment
In both genders, there were rapid increases in obesity from early to middle adulthood, then apparent severe weight reductions. We rationalised that this was clinically plausible, as it could be explained either by intentional (e.g. bariatric surgery) or non-intentional weight loss (e.g. reverse causality from development of disease).
Finally, we noted very poor concordance between the favoured model and conventional BMI categorisation in men (ߢ: 0.18) and women (ߢ: 0.52) ( Table 5).

Sensitivity analyses
We tested the favoured model using a larger sample of individuals with at least three measures, and found no material differences between these models, in men and women, and the main model ( Figure S4).

Main findings
We propose an eight-step framework for the construction and selection of models derived from latent class trajectory modelling. We evaluated a range of model structures from fixed- phenotypes. We propose pre-specified criteria for model selection and that the reporting of a 'core' model will facilitate generalizability of results in future studies.

Context of other literature
To the best of our knowledge, this is the first study to systematically address structurerelated assumptions in LCTMs, and their potential impact on clinically-relevant endpoints -in this example, BMI trajectories. Anecdotally, there is a justifiable criticism regarding the use of LCTM models and an uncertainty of how class memberships are derived -a 'black box' effect. The proposed framework, here, encourages the opposite -a transparent stepwise approach to class and model structure selection. To enhance this process, for example, we have 'borrowed' tools developed to address to quantify uncertainty, such as entropy measures, ‫ܧ‬ and ‫ܧ‬ , and applied them to assist assessment of model adequacy. A further modification of discrimination measurement with variance estimation has been described by Shah and colleagues (27), and might have importance for class assignment where 'yes/no' treatment decisions are required.
Variations of model A (fixed-effects) have been reported in the clinical literature (9)(10)(11)(12), which assume no within-class variability when deriving latent classes. Interpretation in this setting is that variation from the mean trajectory is random, i.e., the correlation between measurements for the same individual is explained by latent class membership. In the context of any repeated measures in the general population, this assumption might not be valid (14). Saunders (28) argued in support of full random-effects models (i.e. models F and G), calling upon Moffitt's theory from criminology, which recognises that "there are distinct developmental clusters of trajectories of anti-social behaviour that are the result of divergent aetiologies" -in other words, it is unlikely that latent classes start from a similar baseline. The publication of the 16-item GRoLTS Guidelines in 2017 (15) heralded an important advance for the application of LCTM. Here, we add a framework for construction and interpretation.

Strengths and weaknesses
The study has strengths. First, the considered and strategic workflow to optimise identification and application of latent classes provides for a more robust and transparent application of these models in epidemiology.

Role of the funding source
The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report.

Conflict of Interest
AGR has received lecture honoraria from Merck Serona and Janssen-Cilag, and independent research funding from Novo Nordisk. All other authors have no conflicts of interest to declare.

Contributions
HL, MS, MC and AGR conceptualised the paper. HL, MS and SK designed the statistical approaches; HL performed the modelling. AC and ML facilitated data access and interpretation of the AARP data. All authors contributed to data interpretation; IB and AGR put modelling into clinical context.

Patient and Public Involvement
No patients and or public were involved with this manuscript.

Acknowledgements
We acknowledge the generous funding from Cancer Research UK National Awareness and Early Detection Initiative (NAEDI).

MORE DETAILED DESCRIPTION OF MODEL STRUCTURE
For ! individuals, the latent class trajectory model we consider is given by: where !"# !"# is the BMI of individual ! = 1, … , !, at time ! = 1, … , ! in class ! = 1, … , !. The random effect ! ! is class-specific and follows a multivariate Normal distribution with zero mean and a 3×3 variance-covariance matrix !. For the residual error term ! ! , the usual assumption hold, ! ! is normally distributed with zero mean and variance ! ! . The probability of an individual belonging to class ! is described by a multinomial distribution, i.e., such that ! ! are parameters to be estimated in the model. To select the number of latent classes, we assume a working model (Equation 1) for the random effect structure and the criterion used to select the number of classes was the lowest Bayesian Information Criteria (BIC).

MORE DETAILED DESCRIPTION OF MODEL ADEQUACY ASSESSMENT
Common tools for model adequacy assessment checking are described below in Table S2. Here we give more details of the two extensions of these tools; degree of separation and Elsensohn's residuals (to random effects).

Degrees of separation
A model's ability to detect classes accurately is affected by the degree of separation between latent trajectory curves (8,9). To describe the separation of latent growth curves, we used the multivariate Mahalanobis distance with the multivariate Mahalanobis distance (D) units defined as:  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   3 where ! ! is a !×1 vector of mean values for class !, and ! !! is the inverse of a ! ×! matrix of sample covariances of at times ! = 1, … , !. The larger the difference, the larger the separation between curves. Peugh and Fan (9) argue that it is reasonable to expect that it is easier to identify heterogeneous latent growth trajectories when the statistical separation distance among the subpopulations is larger than when the separation distance among the latent subpopulations is much smaller.
To give an overall measure of separation for each model, we propose a weighted sum of multivariate Mahalanobis distance matrix with weights being the estimated class proportions, ! ! ..
Then the degree of separation !"! ! is defined as Larger values of !"# ! indicate the mean trajectories are well separated while !"# ! is zero in the special case when all mean trajectories are identical. If the !"# ! value is small, then you may wish to consider a model with fewer classes.

Mismatch
Close to 0 for each class The difference between the estimated class proportions and the class membership proportions once individuals have been assigned to a class, i.e.
where ! ! is the number of individuals in a class and ! is the total number.

Entropy
Close to 0 Entropy is a global measure of classification uncertainty, which takes into account all ! × ! posterior probabilities. The entropy of a model is defined as which takes values from [0, ∞), with higher values indicating a larger amount of uncertainty. Entropy values closest to 0 correspond to models with least classification uncertainty.

Class II
Class III

Methods:
We developed an eight step framework: step 1, a scoping model; step 2, refining the number of classes; step 3, refining model structure (from fixed-effects through to a flexible random-effect specification); step 4, model adequacy assessment; step 5, graphical presentations; step 6, use of additional discrimination tools ('degree of separation'; Elsensohn's envelope of residual plots); step 7, clinical characterisation and plausibility; and step 8, sensitivity analysis. We illustrated these steps using data from the NIH-AARP cohort of repeated determinations of body mass index (BMI) at baseline (mean age: 62.5 years), and BMI derived by weight recall at ages 18, 35 and 50 years.  (3), and allcause mortality (4). This approach is crude and many investigators seek to use alternative methods that might better capture long-term risk factor exposure termed life-course analysis.

Results
There are widely-used examples that capture cumulative exposure, such as pack-years for smoking and lung cancer, but the assumption that incidence rate is proportional to total lifetime dose is questionable (5). Many other life-course models simply extract features for use in standard regression approaches; for example, a weight change over time. A more sophisticated approach, which takes account of within-individual correlations, is mixed-effect modelling, but this is difficult to interpret for public health implementation. An extension of this approach is the use of latent classes, also termed growth mixture models.
Latent class trajectory modelling (LCTM) simplifies heterogeneous populations into more homogeneous clusters or classes. From these, one can potentially include random effects to allow for individual variation within these classes. These models have a long history in the criminology (6) and psychology (7) literatures, and now, are increasingly reported in the human epidemiology literature (for example, disentangling the heterogeneity of childhood asthma (8)). Of relevance to this paper, LCTM has been used in association studies of repeated BMI measures with the following endpoints: all-cause mortality (9); cancer incidence (multiple cancer types (10)); gastro-oesophageal (11); prostate (12)); and cancer mortality (12). The LCTM has three general advantages compared with using 'oneoff' exposure determinations: first, it better informs aetiological associations by deeply phenotyping certain 'at risk' subpopulations; and second, LCTM offers a public health strategy to identify early divergent adverse trajectories as potential intervention targets.
Some researchers additionally argue that LCTM is well-equipped for future forecasting and new-patient generalisations in prediction models, as it handles data following a different predictable pattern from that learnt by the model (13). Thirdly, the trajectory approach allows a better understanding of the causes of between-individual variation in certain features (e.g., weight variation over age), by analysing the trajectory as an outcome rather than exposure.
However, LCTM is a complex form of modelling and requires several different structure assumptions (14). Although firmly acknowledged in the GRoLTS-Checklist: Guidelines for Reporting on Latent Trajectory Studies (15), structure-related assumptions have not been systematically evaluated. For many exposures of interest, typically two to seven classes might be described and, as detailed latter, at least seven model structures might be fitted, with and without linear curve properties, such that it is possible to derive greater than eighty different models. Thus, reported differences between studies using latent class modelling might reflect different modelling assumptions rather than true differences between populations. To facilitate the generalizability of results in future studies, here, we propose a framework to construct and select a 'core' LCTM, using an example of repeatedly determined body mass index (BMI) across adulthood in the National Institutes of Health (NIH)-AARP Diet and Health Study cohort. For exposure-disease outcome association analyses, current approaches generally use two stages: first, latent class trajectory modelling, followed by standard association modelling. The framework described here is limited to the first stage.

Cohort
The NIH-AARP Diet and Health Study is a US cohort recruited from 1995 (16 (17).

Latent class trajectory modelling
We developed an eight step framework (  (Table S1).
Step 1: We initially constructed a scoping model provisionally selecting the plausible number of classes based on available literature -in the context of BMI trajectories, we used K = 5 classes as reported elsewhere (10,12). We built models for both genders, as BMI patterns of lifetime changes differ for men and women (21). To determine the initial working model structure, we followed the rationale of Verbeke and Molenburgh (22) and examined the shape of standardised residual plots for each of the five classes in a model with no randomeffects. If the residual profile could be approximated by a flat, straight line or a curve, then a random intercept, slope or quadratic term, respectively, were considered. Preliminary plots suggested preference for a quadratic random-effects model ( Figure S1).
Step 2: We refined the preliminary working model from step 1 to determine the optimal number of classes, testing K = 1 to 7. The number of classes chosen was based on the lowest Bayesian Information Criteria (BIC). Step 3: We further refined the model, using the favoured K derived in step 2, testing for the optimal model structure. We tested seven models (detailed in Table S2): ranging from a simple fixed-effects model (model A) through a rudimentary method that allows the residual variances to vary between classes (model B) to a suite of five random-effects models with different variance structures (models C-G).
Step 4: We then performed a number of model adequacy assessments. First, for each participant, we calculated the posterior probability of being assigned to each trajectory class, and assigned the individual to the class with the highest probability. An average of these maximum Posterior Probability of Assignments (APPA) above 70%, in all classes, is regarded as acceptable (6). We further assessed model adequacy using Odds of Correct Classification (OCC), mismatch scores and Entropy, E k (detailed in Table S3). These diagnostic tools assist in model selection (6,23). Step 5: We used three graphical presentation approaches. The conventional approach is to plot mean trajectories with time encompassing each class. Alternatives include the use of mean trajectory plots with 95% predictive intervals for each class, which displays the predicted random variation within each class; or to plot individual level 'spaghetti plots' with To check structure assumptions in fixed-effects latent class models, Elsenhohn et al. suggest that across class variability may not be fully accounted for.
Step 7: We assessed for clinical characterisation and plausibility using four approaches: i) assessing the clinical meaningfulness of the trajectory patterns, aiming to include classes with at least 1% capture of the population; ii) assessing the clinical plausibility of the trajectory classes; iii) tabulation of characteristics by latent classes versus conventional categorisations; and iv) concordance of class membership with conventional BMI category  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   9 membership using the kappa statistic (as LCTM is an unsupervised learning approach, we computed ߢ for all possible combinations and selected the optimal ߢ).
Step 8: We conducted sensitivity analyses, in this example, with individuals with at least two and three BMI values, as LCTMs are flexible enough to deal with different observation times between participants.

Patient and Public Involvement
No patients and or public were involved with this manuscript.

Statistical algorithms
All R and SAS codes used to implement these tools are available via the authors and can be downloaded from www.github.com/hlennon/LCTMtools.

Number of classes
From the preliminary working model of a quadratic random-effects model, model F (proportional covariance structure), we derived BICs for up to seven classes: three of the class models failed to converge in men and women.  (Table S2).

Graphical presentation
We plotted the mean trajectories for model A, B, C, D, F and G in men and women ( Figure   1) illustrating the increased complexity from model A to G. As alternatives, we plotted separately mean trajectories with 95% predictive intervals for each class, in model F ( Figure   S2), which displays the predicted random variation within each of the classes with time, noting that variation was greater with the more 'complex' classes (classes IV and V compared with classes I, II and III). Spaghetti plots of individual level data illustrated that the timing and size of BMI changes characterise the classes -for example, sharp increases in BMI in early adulthood in Class III but later in adulthood for Class IV ( Figure S3).

Additional tools of suitability of fit
The DoS k values ranged from 0.10 to 0.36 and 0 to 0.34, in men and women, respectively (Table 3). The covariances were high and in the positive direction and therefore models with non-parallel mean trajectories lead to higher separation.

Clinical assessment
Having established the favoured model, model F with five classes in both genders, we assigned descriptive labels to each respective class as follows (Table 4): stable normal weight; normal weight to overweight; normal weight to obese; overweight to obese; and rapid early obesity. We noted that the proportion in the rapid early obesity (Class V) was less than 1% in men. However, overall, the proportion for Class V for men and women combined was nearly 1%. Thus, we retained this class as we judged it to be clinically meaningful as follows.
In both genders, there were rapid increases in obesity from early to middle adulthood, then apparent severe weight reductions. We rationalised that this was clinically plausible, as it could be explained either by intentional (e.g. bariatric surgery) or non-intentional weight loss (e.g. reverse causality from development of disease).
Finally, we noted very poor concordance between the favoured model and conventional BMI categorisation in men (ߢ: 0.18) and women (ߢ: 0.52) ( Table 5).

Sensitivity analyses
We tested the favoured model using a larger sample of individuals with at least three measures, and found no material differences between these models, in men and women, and the main model ( Figure S4).  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   12 We propose an eight-step framework for the construction and selection of models derived from latent class trajectory modelling. We evaluated a range of model structures from fixedeffect models to a set of random-effects models, favouring the latter models in this case study, as they include different variance structures, and more likely to reflect the natural history of changes with time in BMI distributions in different sub-populations. We showed that different model structures resulted in different classes with contrasting clinical phenotypes. We propose pre-specified criteria for model selection and that the reporting of a 'core' model will facilitate generalizability of results in future studies.

Context of other literature
To the best of our knowledge, this is the first study to systematically address structurerelated assumptions in LCTMs, and their potential impact on clinically-relevant endpoints -in this example, BMI trajectories. Anecdotally, there is a justifiable criticism regarding the use of LCTM models and an uncertainty of how class memberships are derived -a 'black box' effect. The proposed framework, here, encourages the opposite -a transparent stepwise approach to class and model structure selection. To enhance this process, for example, we have 'borrowed' tools developed to address to quantify uncertainty, such as entropy measures, ‫ܧ‬ and ‫ܧ‬ , and applied them to assist assessment of model adequacy. A further modification of discrimination measurement with variance estimation has been described by Shah and colleagues (27), and might have importance for class assignment where 'yes/no' treatment decisions are required.

Conflict of Interest
R mmlcr mmlcr C

Random intercept
The interpretation is allowing individuals to vary in initial weight but each class member is assumed to follow the same shape and magnitude of the mean trajectory SAS traj PROC TRAJ D Random slope Allowing individuals to vary in initial weight and slope of the mean trajectory but same curvature as trajectory SAS traj PROC TRAJ E Random quadratic -Common variance structure across classes Additional freedom of allowing individuals to vary within classes by initial weight, shape and magnitude, however each class is assumed to have the same amount of variability R lcmm hlme/lcmm F Random quadratic -Proportionality constraint to allow variance structures to vary across classes Increasing flexibility of model E as variance structures are allowed to differ up to a multiplicative factor to allow some classes to have larger or smaller within-class variances. This model is can be thought of more parsimonious version of model G from (reducing the number of variance-covariance parameters to be estimated from 6xK parameters to 6+(K-1) parameters.
R lcmm hlme/lcmm G Random quadratic -Class-specific variance structure (unstructured) The most flexible model in which each class has its own separate random quadratic variance structure to describe its own within-class variability. Statistically this permits the variance and covariance of the intercept, slope and quadratic term to vary freely across all classes.
SAS traj PROC TRAJ 1 The SAS traj package has been converted for Stata users as the traj command in Stata (Collage Station, TX, USA).

MORE DETAILED DESCRIPTION OF MODEL STRUCTURE
For ! individuals, the latent class trajectory model we consider is given by: where !"# !"# is the BMI of individual ! = 1, … , !, at time ! = 1, … , ! in class ! = 1, … , !. The random effect ! ! is class-specific and follows a multivariate Normal distribution with zero mean and a 3×3 variance-covariance matrix !. For the residual error term ! ! , the usual assumption hold, ! ! is normally distributed with zero mean and variance ! ! . The probability of an individual belonging to class ! is described by a multinomial distribution, i.e., such that ! ! are parameters to be estimated in the model. To select the number of latent classes, we assume a working model (Equation 1) for the random effect structure and the criterion used to select the number of classes was the lowest Bayesian Information Criteria (BIC).

MORE DETAILED DESCRIPTION OF MODEL ADEQUACY ASSESSMENT
Common tools for model adequacy assessment checking are described below in Table S2. Here we give more details of the two extensions of these tools; degree of separation and Elsensohn's residuals (to random effects).

Degrees of separation
A model's ability to detect classes accurately is affected by the degree of separation between latent trajectory curves (8,9). To describe the separation of latent growth curves, we used the multivariate Mahalanobis distance with the multivariate Mahalanobis distance (D) units defined as: where ! ! is a !×1 vector of mean values for class !, and ! !! is the inverse of a ! ×! matrix of sample covariances of at times ! = 1, … , !. The larger the difference, the larger the separation between curves. Peugh and Fan (9) argue that it is reasonable to expect that it is easier to identify heterogeneous latent growth trajectories when the statistical separation distance among the subpopulations is larger than when the separation distance among the latent subpopulations is much smaller.
To give an overall measure of separation for each model, we propose a weighted sum of multivariate Mahalanobis distance matrix with weights being the estimated class proportions, ! ! ..
Then the degree of separation !"! ! is defined as Larger values of !"# ! indicate the mean trajectories are well separated while !"# ! is zero in the special case when all mean trajectories are identical. If the !"# ! value is small, then you may wish to consider a model with fewer classes.

Elsensohn's envelope of residuals
To check the model assumption in fixed effect latent class models, Elsenhohn et al. (6) suggest plotting the local standard deviations of the residuals to check the appropriateness of the model.
With the assumption that the residuals are homogeneous over time. To check the appropriateness of each of our model assumptions, we extended their method to include random effects in the models. We compute the local standard deviations of the residuals using the following steps: 1) Compute the observed residuals ! !"# for each subject !, at time ! given the individual is in class ! Equation 1: where ! !"# is the observed value for individual ! at time t in class ! and ! !"# is the fitted value of BMI from our fitted model, here the random effect model. 2) Compute the class-and time-specific weighted local variance of the residuals, !"! ! ! !"# , with weights being ! !" , the posterior probabilities of individual ! belonging to group !. #) Plot the upper and lower boundaries for the local standard deviations of the residuals Equation 2: ! !"! = ! !" ± !"! ! ! !"# , !ℎ!"! ! !" is the mean value of class ! at time !. 4) Plot the boundaries ! !"# onto the mean trajectory plots.
The shape of the local standard deviation of the residuals indicates the appropriateness of the model assumptions; where non-parallel boundaries indicate heteroscedasticity of residuals suggesting poor model fit, and differing interval widths suggest across group variability may not be fully accounted for. We believe this complements the others metrics well. If the boundaries suggest a poor fit, you may consider a more complex random effect structure to be more suitable, for example, a higher order polynomial.

Mismatch
Close to 0 for each class The difference between the estimated class proportions and the class membership proportions once individuals have been assigned to a class, i.e.
where ! ! is the number of individuals in a class and ! is the total number.

Entropy
Close to 0 Entropy is a global measure of classification uncertainty, which takes into account all ! × ! posterior probabilities. The entropy of a model is defined as which takes values from [0, ∞), with higher values indicating a larger amount of uncertainty. Entropy values closest to 0 correspond to models with least classification uncertainty.

Class II
Class III