Objectives Diet and nutrition might play an important role in the aetiology of metabolic syndrome (MetS). Most studies that examine the effects of food intake on MetS have used conventional statistical analyses which usually investigate only a limited number of food items and are subject to sparse data bias. This study was undertaken with the goal of investigating the concurrent effect of numerous food items and related nutrients on the incidence of MetS using Bayesian multilevel modelling which can control for sparse data bias.
Design Prospective cohort study.
Setting This prospective study was a subcohort of the Tehran Lipid and Glucose Study. We analysed dietary intake as well as pertinent covariates for cohort members in the fourth (2008–2011) and fifth (2011–2014) follow-up examinations. We fitted Bayesian multilevel model and compared the results with two logistic regression models: (1) full model which included all variables and (2) reduced model through backward selection of dietary variables.
Participants 3616 healthy Iranian adults, aged ≥20 years.
Primary and secondary outcome measures Incident cases of MetS.
Results Bayesian multilevel approach produced results that were more precise and biologically plausible compared with conventional logistic regression models. The OR and 95% confidence limits for the effects of the four foods comparing the Bayesian multilevel with the full conventional model were as follows: (1) noodle soup (1.20 (0.67 to 2.14) vs 1.91 (0.65 to 5.64)), (2) beans (0.96 (0.5 to 1.85) vs 0.55 (0.03 to 11.41)), (3) turnip (1.23 (0.68 to 2.23) vs 2.48 (0.82 to 7.52)) and (4) eggplant (1.01 (0.51 to 2.00) vs 1 09 396 (0.152×10–6 to 768×1012)). For most food items, the Bayesian multilevel analysis gave narrower confidence limits than both logistic regression models, and hence provided the highest precision.
Conclusions This study demonstrates that conventional regression methods do not perform well and might even be biased when assessing highly correlated exposures such as food items in dietary epidemiological studies. Despite the complexity of the Bayesian multilevel models and their inherent assumptions, this approach performs superior to conventional statistical models in studies that examine multiple nutritional exposures that are highly correlated.
- metabolic syndrome (Mets)
- Tehran Lipid and Glucose Study (TLGS)
- generalized linear mixed model (GLIMMIX)
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
- metabolic syndrome (Mets)
- Tehran Lipid and Glucose Study (TLGS)
- generalized linear mixed model (GLIMMIX)
Strengths and limitations of this study
A prospective cohort study using three statistical models.
A Bayesian multilevel model was used to control for sparse data bias present in many nutritional studies that use non-Bayesian analyses.
Generation of precise effect estimates for all comparisons.
Food frequency questionnaires used in this study may be subject to measurement bias.
Metabolic syndrome (MetS) is the clustering of at least three of the five following medical conditions: central/abdominal obesity, hypertension, elevated blood sugar, elevated triglyceride levels and reduced high density lipoprotein (HDL) levels.1 MetS is associated with the risk of developing cardiovascular disease and diabetes.1 According to the WHO, approximately 20%–25% of the world’s adult population is affected by MetS.1 MetS is considered a multifactorial disease in which nutritional exposures and diet are major contributing factors. According to nutritional studies, a number of foods have been recommended for preventing MetS. These foods include legumes, whole grains, fruits, vegetables, nuts, fish, low-fat dairy products and moderate consumption of alcohol. Moreover, other dietary patterns and approaches to slow the incidence of hypertension, including a vegetarian diet have been proposed.2 Thus far, the effects of different foods on MetS have only been investigated in many epidemiological studies using conventional statistical analyses such as multiple logistic regression (LR).3–8 In most of these studies, only a limited number of food items have been investigated. This approach excludes potential benefits of foods that might exist through their nutrient contents. Conversely, a conventional model that includes only measured nutrients erroneously assumes that there are no unmeasured indirect nutrient effects or interactions among the modelled nutrients under the assumption that all food effects are transmitted through the measured nutrients.9 10 Simultaneous effects of numerous food items and related nutrients cannot be studied with conventional statistical models due the potential for collinearity (strong correlation between two nutrient variables that may lead to loss of precision of effect sizes). Another limitation is that inclusion of all food items in conventional statistical model is that the estimates from these models may suffer from sparse data bias.11–13 In such circumstances, Bayesian multilevel models can be used to deal with the aforementioned problems by providing substantial improvement in the precision of effect sizes.14 Therefore, our study objective was to examine the simultaneous effects of different food items and related nutrients on the incidence of MetS in healthy adults, using (1) a Bayesian multilevel model, (2) a conventional full LR model and (3) a reduced LR model through backward selection.
Materials and methods
This prospective study is part of the Tehran Lipid and Glucose Study (TLGS).15 The TLGS began in 1998 and was conducted on 15 005 persons aged 3 to 63 years from Tehran’s District 13. We used the data collected during the fourth (2008–2011) and fifth (2011-–2014) follow-up examinations. Data related to dietary intake and other covariates were collected from the fourth phase, and incident MetS cases were identified from the fifth phase, which was considered the follow-up phase (figure 1).
We selected 3616 adults aged ≥20 years who were not affected by MetS at the fourth follow-up examination (2008) and who had dietary information (figure 1). Among this cohort, 590 cases of MetS were met our inclusion criteria.
Subjects who were eligible for the study included adults aged ≥20 years who had been followed from the fourth to the fifth phase and who had the following criteria: no history of chronic diseases (diabetes, stroke, thyroid problems and cancer); did not follow any specific dietary regiments (such as a weight loss diet or the intake of fewer than 800 kcal or greater than 4000 kcal per day) and no previous diagnosis of MetS.
Measurement of outcome
MetS was defined according to the recent published consensus guidelines16 as having at least three of the following criteria: (1) abdominal obesity (waist circumference >90 cm in both genders, according to the “third National survey of risk factors of non-communicable diseases (2007). This new cut-off was obtained based on the International Diabetes Federation criteria. These criteria have shown a sensitivity and specificity of 65%, and positive predictive value of 74% for the diagnosis of MetS. Also, the data were weighted for the following variables: age, gender and residential status)’.17; (2) serum HDL levels (levels lower than 40 mg/dL in men and 50 mg/dL in women or the consumption of HDL-elevating drugs); (3) hypertension (a systolic BP ≥130 mm Hg or a diastolic BP ≥85 mm Hg or the consumption of antihypertensive drugs); (4) hyperglycaemia (a fasting blood glucose ≥100 mg/dL or the consumption of hypoglycaemic drugs) and (5) hypertriglyceridaemia (a serum triglyceride level ≥150 mg/dL or the consumption of triglyceride-lowering drugs).
Measurement of exposure
Nutritional data on the participants’ dietary intake were collected using a semiquantitative food frequency questionnaire (FFQ), which consists of 147 food items. Several nutritionists who had been trained in this field completed the questionnaires through face-to-face interviews. During the interview, the average size of each of the FFQ food items (which is equal to one food serving) was described to each participant and was subsequently asked about the number of times each item was consumed in the previous year. The validity and reliability of the FFQ have been assessed through several studies in Iran and have been found to be acceptable.18 19 The consumption frequency of each food item in the previous year was assessed on a daily, weekly, monthly or yearly basis. Participants were asked to use food scales to report grams per day of consumption for each food item. The amount of intake of energy and nutrients was determined using a food composition table (see online supplementary appendix 1).
Ascertainment of measured variables
Other measured ascertained covariates included: weight, height, age, gender, marital status, history of hospitalisation in the previous 3 months, history of cancer, education (primary, intermediate, high school and high school graduate, academic education) and tobacco use (never smoked, previously smoked, currently smoking). Data were collected using a general information questionnaire administered by a licensed nutritionist. Finally, we used the Strengthening the Reporting of Observational Studies in Epidemiology checklist to ensure all methodological aspects of the study and appropriately reported and accounted for.
We estimated the effects of food items and nutrients on MetS using both a Bayesian multilevel and conventional analyses. The PROC generalised linear mixed model (GLIMMIX) in SAS (V.9.4) was used for the Bayesian multilevel analysis. LR with two types of variable selection (stepwise backward selection and selection of all variables) was also applied, and their results were compared with the Bayesian multilevel analysis.
In the Bayesian multilevel approach (first analysis), we investigated the concurrent effects on MetS of 95 food items (listed in online supplementary appendix 1) and 12 nutrients (carbohydrates, protein, total fat, monounsaturated fatty acids, carotenoids, calcium, folate, magnesium, zinc, fibre, glucose and fructose), adjusted for nine covariates (age, gender, cancer history, hospitalisation status, educational status, body mass index, marital status, smoking history and calories).
In the first conventional analysis (second analysis, full model), 95 food items and nine covariates were forced into the model. Due to the high correlation between food items and nutrients resulting in the non-convergence of maximum likelihood estimates, the effects of nutrients were not investigated in the conventional analysis.
In the third conventional analysis using stepwise backward selection, the alpha level (level of statistical significance) for selection of food items was set at 0.2, and all nine confounders were forced into the model. Seventy-seven food items were removed at this stage, leaving only 18 food items.
In all three models, the following six food items were removed from the models due to high degree of collinearity between variables (Pearson correlation ≥0.4), retaining the food with a statistically stronger effect (specified in parentheses) in the final analysis: jam (sugar), plum (peach), lemon juice (lemon), apple juice (apple), orange juice (orange) and cooked vegetables (cooked carrots). Moreover, in all the models, 46 food items (data available on request) were excluded from our analyses because it seemed unlikely that they would have had considerable dietary effects on MetS. Thus 95 (147–(6+46)) food items were retained in the analysis.
To interpret the effects of foods on MetS more easily, each food item variable was transformed from ‘grams’ to specified servings using valid references based on daily servings.20
Data analysis was done with Stata V.11 (Stata) for the conventional analysis and SAS 9.2 for the Bayesian multilevel approach. The parameters of the LR and Bayesian multilevel models were estimated using maximum likelihood and shrinkage (penalised likelihood) methods, respectively. To compare the precision of estimates, we calculated the difference in confidence limits for ORs of foods in the logarithm scale (upper log-OR minus lower log-OR).
Structure of the Bayesian multilevel model
We can write the first stage model as: logit
In this model, p is risk of MetS, X is the matrix of food items information, W is the matrix of other potential confounders and β (β1,…, β95) is the vector of LR coefficients corresponding to the 95 foods items. The first stage model is also the LR for the conventional analysis.
Second stage (2):
where π is the vector of coefficients of second-stage covariates for nutrients that may contribute to dietary effects on MetS. These second-stage covariates ( Z ) include nutrients carbohydrates, protein, total fat, monounsaturated fatty acids, carotenoids, calcium, folate, magnesium, zinc, fibre, glucose and fructose. The quantity δj is the residual effect of food item j, which is assumed to be an independent normal random variable with zero mean and SD τ j. Following Witte et al,14 we specified a fixed value of tau to improve estimation convergence. Based on a similar study,14 21 we set the SD τ j equal to 0.35 for all food items. This corresponds to having 95% certainty that the OR for the residual effects of foods (per serving of each food) lies between 0.5 and 2.0. The second stage can be interpreted as the prior distribution for the beta coefficients in the Bayesian multilevel method. The second-stage model shrinks the ordinary estimates for food items towards each other when they have similar levels of nutrients.
Models 1 and 2 can be combined into a ‘mixed-effects’ model
In this model, π and γ are treated as vectors of fixed coefficients, and δ is treated as a vector of random coefficients with mean zero and variance=0.1225. Hence, one interpretation is that the multilevel model includes XZ interactions, which allow the effects of X on MetS to be similar when there is a similar nutrient level in the food items.
For the estimation of the fixed and random effects in the Bayesian multilevel model, the mixed-model equations solution matrix (MMEQSOl) from SAS GLIMMIX output was used. MMEQSOl contains fixed , random , and covariate estimates and their respective estimated covariance matrices. In our study, the MMEQSOl was a 117*117 (95 foods+12 nutrients+9 covarites+1 intercept) matrix (online supplementary appendix 2).
Patient and public involvement
No patients were involved in the development and design of this prospective study.
The mean (SD age of participants and median follow-up time were 40.6 (12.6) years and 24.6 months, respectively. The total incidence rate of MetS was 82.2 (95% CI: 75.8 to 89.1) per 10 000 person-years. The incidence rate of MetS was higher in men than in women (125.6 vs 65.3 per 10 000 person-years, p<0.001). In both genders, those affected by MetS were older (p<0.001). Also, the percentages of married individuals and those who had previous history of a heart attack were higher among those with MetS than in the non-MetS people (p<0.001) (table 1).
The adjusted ORs and corresponding 95% confidence limits (95% CI) for food intakes and other covariates using full LR model and LR model with stepwise backward selection are reported in table 2. The results of the conventional analysis have been described in details elsewhere.22
Full model (LR with all food variables in the model)
Based on this model, two food items were associated with MetS: bananas (OR=1.38, 95% CI: 1.05 to 1.83) and grapes (OR=1.14, 95% CI: 1.01 to 1.29). Two other food items that were weakly associated with MetS were beef (OR=1.71, 95% CI: 0.95 to 3.08) and chicken (OR=1.24, 95% CI: 0.99 to 1.56). On the other hand, there was a weak evidence of an inverse association of lamb meat (OR=0.44, 95% CI: 0.17 to 1.12) with MetS.
LR using backward selection method
In this analysis, only 18 foods remained in the final model. Based on this reduced model, grapes (OR: 1.11, 95% CI: 1.01 to 1.29; p=0.03) and bananas (OR=1.37, 95% CI: 1.05 to 1.78; p=0.02) were associated with MetS risk. Also, there was weak evidence of the increase in MetS risk for the intake of rice (OR=1.11, 95% CI: 0.99 to 1.2; p=0.06), turnip (OR=2.41, 95% CI: 0.77 to 6.69; p=0.09) and seeds (OR=1.32, 95% CI: 0.99 to 1.77, p=0.053). On the other hand, lamb meat was inversely associated with MetS risk (OR: 0.40, 95% CI: 0.16 to 0.99; p=0.05).
Multi-level Bayesian analysis via the GLIMMIX
Based on this model, grapes (OR=1.14, 95% CI: 1.01 to 1.27; p=0.03) and bananas (OR=1.32, 95% CI: 1.01 to 1.74; p=0.05) were positively associated with MetS. There was also evidence that fructose was positively associated with the MetS risk (OR=1.84, 95% CI: 0.97 to 3.51; p=0.06) (table 2).
On comparing the three models, 15 (83.3%) of the common OR estimates were the smallest (toward the null) in the Bayesian multilevel model, which is not surprising given that the mean of the residual effects of foods (δj) was prespecified to zero, so the OR estimates underwent shrinkage toward the null. In the remaining three food items (16.7%), the OR estimates were similar between models (table 2).
Although diet may play a role in the aetiology of MetS, most previous studies have only looked at a limited number of food items mainly because of limitations of conventional modelling approaches.8 9 On the other hand, multilevel models and shrinkage estimators are known to give lower prediction error and improve the precision and accuracy of the effect sizes.14 This study used novel Bayesian multilevel models to study the simultaneous effects of different food items and related nutrients on the incidence of MetS and compared it to conventional models. Bananas and grapes were the only items that were associated with MetS in all three models. However, on stratifying by history of diabetes, the effects were weaker in the non-diabetes group. Furthermore, because of the small sample size of the diabetic group (37 new cases of MetS in the 328-populated diabetics group: 0.11 case per event), model fitting in this group failed.
The histogram of regression coefficients of dietary items indicates the penalised likelihood estimates (from GLIMMIX) are much less dispersed than the maximum likelihood estimates in the conventional analyses (figure 2). Also, GLIMMIX has a better goodness of fit properties than the conventional models as the deviance information criterions for backward selection method, full model and Bayesian multilevel model were 29057.6, 27679.9 and 18122.1, respectively.
The largest OR estimates were observed in the full model signalling sparse data bias. The OR estimates in the Bayesian multilevel model were more similar to the LR model with backward selection rather than to the full LR model. For 10 (55.6%) of 18 common ORs, the Bayesian multilevel model had the narrowest confidence limits and the highest precision. For seven (38.9%) of ORs, the backward model had the best precision whereas there was similar precision for only one (5.6%) of the ORs. Although in the backward method only 18 variables remained in the final model, the Bayesian multilevel model outperformed the backward method in terms of precision of the OR estimates.
In the 77 (95 – 18) remaining food items that were common in the Bayesian multilevel models and full model, Bayesian multilevel modelling exhibited better precision (60 (78%) vs 15 (0.20%)). In two (2%) of ORs, both models exhibited similar precision.
In the Bayesian multilevel model, the confidence limits for three extreme OR estimates in the full model were more precise and biologically plausible. Specifically, these OR estimates were as follows: noodle soup ((0.67–2.14) in the Bayesian multilevel model vs (0.65–5.64) in the full model), beans (0.5–1.85) vs (0.03–11.41), turnip and (0.68–2.23) vs (0.82–7.52)). In the full model, the estimation for eggplant OR was strongly affected by the sparse data bias11 12: OR=109 396, 95% CI=0.152×10-6 to 768×1012), but this implausible and imprecise estimation was balanced in the Bayesian multilevel model (OR=1.01, 95%CI=0.51 to 2.00). This balancing of extreme estimates has been shown in previous studies.14 21
The most significant limitation of the stepwise backward selection method was the need for the deletion of some variables from the model as the model assumes (with full certainty) that these variables have no effect on the outcome. As such the final selected model does not take into account the uncertainty in the selection procedure. The backward selection method had excluded 77 variables from the final model. This manner of variable selection led to downward bias in the p values and subsequent standard errors for the reaming variables in the model.23
Various studies24 25 have shown the protective effects of vegetables and fruits on MetS. These nutrients might exert their protective effects potentially through the effects of antioxidants, fibre, potassium and other phyto chemicals, reducing the concentration of C reactive protein.26 However, due to low statistical power of this study, LR models (which usually requires a minimum of 10 events per predictor variable) were deemed underpowered to detect a statistically significant difference for the following food items: vegetables, like kiwifruit, watermelon, apple, cherry, plum, tangerine, dates, nectarine, lemon, tomato, celery, raw onion, cooked cabbage, lettuce and potato.
We observed a weak association between fructose intake and MetS. Some studies27–29 have shown that the consumption of foods and beverages that are high in fructose facilitate dyslipidaemia (increased triglycerides and low density lipoproteins and decreased HDL). As previously mentioned,1 hyperlipidaemia is considered as one of the components of MetS, hence this finding is consistent with earlier studies.
Unlike our study, a study by Esmailzadeh et al 30–32 have shown the protective effects of whole grains on the incidence of MetS although this study only assessed a limited number of foods and its results might be subject to a number of biases.
One notable limitation of this study was the use of a FFQ to assess food intake. Several studies have shown that the FFQ has limitations in determining dietary patterns—since it encompasses a long list of foods consumed during the past year which may increase the possibility of recall bias. Moreover, the FFQ underestimates the consumption of proteins and carbohydrates allowing the possibility of measurement error.33–35 Our study had limited statistical power for some of the analysis. The general statistical rule of thumb for sample size calculations suggests that LR models require a minimum of 10 cases per covariate for optimal statistical power.36 As we estimated the effects of 104 variables (95 food intakes plus 9 confounders), we required 1050 cases to satisfy the criteria for adequate sample size. Unfortunately, we only had 590 new cases of MetS in this study. However, we partially made up for this limitation through the use of the Bayesian multilevel approach. Finally, as with many nutritional epidemiological studies, there might be other sources of bias including measurement error, model misspecification, unmeasured confounding and potential for time-varying confounding.37
In conclusion, Bayesian multilevel models present more precise and biologically plausible estimates of association than conventional frequentist models and are better able to control for sparse data bias. Despite the complexity of the semi-Bayes models, this model is highly recommended for nutritional studies that involve multiple, correlated and multilevel nutritional exposures.
We thank John Witte for his helpful comments on an earlier draft of this paper. We extend our gratitude to the authorities of the ‘Research Institute for Endocrine Sciences’ at Shahid Beheshti University of Medical Sciences for sharing their valuable data with us. Moreover, we are grateful to all the participants for availing us their valuable time.
Patient consent for publication Obtained.
Contributors ZCh: conducted research. ZCh, NMo: provided essential reagents or provided essential materials. ZCh, SN, MAM, PM: performed the statistical analysis. ZCh, SN, MAM, ME, PM, LCMc: wrote the paper. ZCh, SN, MAM, PM, NMa, NMo, ME, LCMc: had primary responsibility for final content. ZCh: had responsibility for all parts of the manuscript.
Funding This article has been extracted from the first author’s PhD thesis (code: 912-112-8002) and was supported by Tehran University of Medical Sciences (TUMS).
Competing interests None declared.
Ethics approval Shahid Beheshti University of Medical Sciences (SUMS)/ Research Ethics Board.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Data available on request.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.