Article Text

Cohort profile
Cohort profile: The Health, Food, Purchases and Lifestyle (SMIL) cohort – a Danish open cohort
  1. Kathrine Kold Sørensen1,
  2. Mikkel Porsborg Andersen1,
  3. Frederik Trier Møller2,
  4. Caroline Eves2,
  5. Thor Grønborg Junker3,
  6. Bochra Zareini1,4,
  7. Christian Torp-Pedersen1,4
  1. 1Department of Cardiology, Nordsjællands Hospital, Hillerod, Denmark
  2. 2Department of Infectious Disease Epidemiology and Prevention, Statens Serum Institut, Copenhagen, Denmark
  3. 3Department of Epidemiology Research, Statens Serum Institut, Copenhagen, Denmark
  4. 4Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
  1. Correspondence to Kathrine Kold Sørensen; kathrine.kold.soerensen{at}regionh.dk

Abstract

Purpose The Health, Food, Purchases and Lifestyle (SMIL) cohort is a prospective open Danish cohort that collects electronic consumer purchase data, which can be linked to Danish nationwide administrative health and social registries. This paper provides an overview of the cohort’s baseline characteristics and marginal differences in the monetary percentage spent on food groups by sex, age and hour of the day.

Participants As of 31 December 2022, the cohort included 11 214 users of a smartphone-based receipt collection application who consented to share their unique identification number for linkage to registries in Denmark. In 2022, the composition of the cohort was as follows: 62% were men while 24% were aged 45–55. The cohort had a median of 63 (IQR 26–116) unique shopping trips. The cohort included participants with a range of health statuses. Notably, 21% of participants had a history of cardiovascular disease and 8% had diabetes before donating receipts.

Findings to date The feasibility of translating consumer purchase data to operationalisable food groups and merging with registers has been demonstrated. We further demonstrated differences in marginal distributions which revealed disparities in the amount of money spent on various food groups by sex and age, as well as systematic variations by the hour of the day. For example, men under 30 spent 8.2% of their total reported expenditure on sugary drinks, while women under 30 spent 6.5%, men over 30 spent 4.3% and women over 30 spent 3.9%.

Future plans The SMIL cohort is characterised by its dynamic, continuously updated database, offering an opportunity to explore the relationship between diet and disease without the limitations of self-reported data. Currently encompassing data from 2018 to 2022, data collection is set to continue. We expect data collection to continue for many years and we are taking several initiatives to increase the cohort.

  • NUTRITION & DIETETICS
  • EPIDEMIOLOGIC STUDIES
  • Observational Study

Data availability statement

Data may be obtained from a third party and are not publicly available. According to the Danish Data Protection Act and the General Data Protection Regulation, data that contains personally identifiable information cannot be made publicly available. The data used are solely available through the state organisation Statistics Denmark but can be accessed in collaboration with authorised Danish researchers. Requests for statistical code and inquiries regarding collaborations can be directed to the corresponding author who has full access.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

STRENGTHS AND LIMITATIONS OF THIS STUDY

  • Each member of the Health, Food, Purchases and Lifestyle (SMIL) cohort provides their unique identification number, which is given to all Danish residents, enabling linkage of their data to all Danish registries.

  • Consumer purchase data provide a novel longitudinal approach to the assessment of diet and has several advantages including its low burden on the participant and objective nature without recall bias or deliberate misinformation, compared to traditional self-reported methods.

  • Given the cohort’s size and age distribution, it is most relevant to investigate a wide range of prevalent lifestyle diseases, such as hypertension and diabetes.

  • Currently, it may mainly be relevant to investigate association, not causality, but as the cohort ages, it may be possible to investigate causal questions as well.

  • The cohort data, limited to supermarket purchases tracked through a smartphone-based application and representing household consumption, cannot assess specific dietary intake but is suitable for examining variations in dietary habits among groups or over time among individuals.

Introduction

In 2017, the Global Burden of Disease study attributed 11 million deaths to dietary risk factors, with high sodium intake, low intake of whole grains and low intake of fruits being the primary global culprits 1. Maintaining a healthy diet is crucial for promoting a long life without overweight/obesity and reducing the risk of non-communicable diseases such as type 2 diabetes mellitus, cardiovascular disease and certain cancers.2 Food-Frequency Questionnaires (FFQs) encompass an established and validated method that has been widely used for assessing dietary information.3 4 However, FFQs rely on self-reporting, which makes them subject to recall and social desirability bias. Further, it has been demonstrated that FFQs are prone to differential misclassification, which is especially concerning for valid inference.5 Consumer purchase data offer an alternative approach to traditional dietary assessment methods, providing several advantages, such as cost-efficiency, objectivity, longevity and minimal burden on both researchers and participants.6 7

Several studies have shown that large-scale consumer purchase data provide a diverse sample with good coverage and objectivity.8 Using consumer purchase data from loyalty cards, it has been demonstrated that nutrient diversity and caloric concentration are strong predictors of the prevalence of hypertension, diabetes and hypercholesterolaemia.9 Comparisons between consumer purchase data and FFQs have shown that consumer purchase data effectively estimated food consumption in adults, particularly in single-person households.10 Comparing consumer purchase data to repeat 24-hour dietary recalls, revealed that consumer purchase data can serve as a successful surrogate measure of individual nutrient intake, especially for energy obtained from fat.11 While there is a reasonable agreement between purchased and consumed amounts for aggregated food groups, differences have been seen for some individual items.12 Accordingly, household food purchases can provide an unbiased and reasonably accurate estimate of overall diet quality but may be less valid for characterising the intake of specific nutrients.13 Consumer purchase data have been used for many years to address health-related inquiries, however, earlier data collection methods involved more researcher labour, such as recruiting study participants outside of grocery stores,14 requesting the sharing of receipts via emails8 15 or physical receipts as well as participant involvement, such as manual registration.16 Furthermore, previous studies have been hampered by access to data from only one supermarket chain,17 or in combination with other challenges to data collection, relied on self-reported demographic data.15 18 More current methods have automated data collection, greatly expanding its scope. The Health, Food, Purchases and Lifestyle (SMIL) cohort is an open cohort of users of a smartphone-based receipt collection application who have agreed to share the entirety of their receipts as well as their unique identification number through a protected website. This cohort provides consumer purchase data collected prospectively from five major Danish retailers and automatically allows for linkage on an individual level with the vast number of Danish registries through Statistics Denmark. This paper aims to outline the baseline characteristics of the SMIL cohort, describe the algorithm used to translate receipt data into operationalisable food groups, and discuss the strengths and limitations of this type of data, along with potential future applications, including demonstrating the monetary percentage spent on certain good by demographic factors as well as the time of day.

Cohort description

Inclusion period

Data collection started in 2018 and the data have been updated up to and including 31 December 2022. The cohort data are updated annually.

Purchases

The consumer purchase data include the date and time of purchase, product price, amount of product, bar code number and product name. One key challenge in analysing consumer purchase data is the need to transition from product-specific names to broader food groups that can be effectively analysed and interpreted. Between 1 January 2018 and 31 December 2022, the raw data included 157 998 751 unique purchases with 615 915 unique product names. Analysis by product-specific names is currently constrained by the limited size of the cohort, leading to zero-inflation (frequent zero-values observations) issues as numerous products are purchased infrequently enough to challenge the robustness of statistical models. Therefore, analyses are currently performed at the broader food group level. This necessary shift, while mitigating zero-inflation problems, inherently entails a reduction in data granularity, resulting in some loss of detail and information. However, as the cohort expands in the future, we anticipate increasing the granularity of our analysis, thereby paving the way for more detailed hypotheses. Notably, during March, May and June 2019, there were data delivery issues, resulting in the unavailability of data for these 3 months. However, since then, there have been no further issues related to data delivery. We have also experienced issues with duplicate receipts, that is, due to the batch deliveries of the data necessitated by the design of the data collection solution, which we identified and remedied through an algorithm that cross-checked the manually calculated total order price with the reported total order price and then used this information in combination with counts of goods on the concerned receipt. When originally received, the data included both edible and non-edible items. The categorisation of these items relies on the product names indicated on the receipts. These names are uniquely assigned by each grocer when the items are added to their inventory, resulting in a variety of naming conventions across different grocers.

Product matching

The Food Institute at the Danish Technical University maintains a public food composition database called Frida.19 Within this database, there are 1190 identifying names that aim to include the majority of generic grocery types that are widely available in the Danish market, further detailed with 150 pieces of information, including energy, micronutrients and macronutrients, and basic food components in units per 100 g of edible food.19 The description and the code for the matching process, which links consumer purchase data to food groups in Frida, are available at https://github.com/ssi-dk/MyPurchaseCohort/tree/master. In brief, the process involves several steps, the first of which includes removing abbreviations and extracting relevant information such as weight and volume. The second step involves using a scoring algorithm to match each product name to an adapted version of Frida, which comprises columns of regular expressions related to brand, product type, taste and flavours, and fat content. We selected these specific variables/columns because the manual review indicated that the product names frequently include word combinations pertinent to these categories which can affect their predictive value based on the context.

The pipeline incorporates steps of manual checking for validation purposes which have been described more thoroughly elsewhere.20 The matching process is implemented by using R V.4.2.1.21

Possibilities of linkage

In Denmark, access to healthcare services is provided equally to all residents, covering primary as well as hospital care. All Danish residents are assigned a unique 10-digit personal identification (CPR) number, either at birth or immigration. This unique identifier allows for linkage to the nationwide administrative health and social registries22. Supermarket receipts and CPR numbers of the SMIL cohort participants are collected through a protected project website, enabling the secure linkage to the Danish registries in Statistics Denmark, where the registry data are stored and organised. The national registers contain a wide range of information relevant to the investigation of health-related questions. Currently, the cohort has been enriched with information from the following six registers: (1) The Danish Civil Registration System, which contains information on date of birth, sex, emigration/immigration, family structure and vital status,23 (2) The Danish National Prescription Registry, which contains information on dosage, dates and anatomical therapeutic chemical (ATC) codes on all prescriptions dispensed from pharmacies since 1995,24 (3) The Danish National Patient Registry, which contains information on all outpatient contacts and hospital admissions since 1977, coded according to The International Classification of Disease,25 (4) The Income Statistics Registry, which contains income data on a yearly basis26 (5) The Population Education Register, which contains information on the highest attained education27 and (6) The Clinical Laboratory Information System Research Database, which contains information on individual-level biomarker results and date and time of sampling.22 This extensive range of data also allows for the enrichment of the cohort with information from the Danish registries on all household members, providing a more comprehensive understanding of the health status of the population exposed to the purchases.

Enrichment

Table 1 provides an overview of the characteristics of the cohort using counts and percentages. The age of the shopper was calculated for the particular calendar year and education level was defined as the highest achieved education level in the year prior to the particular calendar year and classified into four groups based on the International Standard Classification of Education (ISCED).28 Group 1 included ISCED levels 0–2; early childhood, primary and lower secondary education, group 2 included ISCED level 3; general upper secondary education, group 3 included ISCED levels 5–6; short-cycle tertiary, medium-length tertiary, and bachelors-level education and group 4 included ISCED levels 7–8; masters or PhD level.28 Income was included as the 5-year mean equivalised income prior to the particular calendar year to ensure comparability between different households as it accounts for redistribution within the family.29 Diseases were identified through the Danish National Prescription Registry and the Danish National Patient Registry, and a full list of ATC and diagnosis codes used for definitions is available in online supplemental table S1. Prevalent diseases were defined as diseases that were prevalent at any time point before the current calendar year. Medicine use was identified through the Danish National Prescription Registry and defined as any redemption of a prescription 180 days prior to 1 January of that particular calendar year (a list of ATC codes is available in online supplemental table S2). Blood chemistry tests were identified through the Clinical Laboratory Information System Research Database and defined as the number of unique measurements of the blood chemistry test between the date of the first and last available purchases.

Table 1

Baseline covariates of the SMIL cohort in 2018–2022

Patient and public involvement

Consumers actively consent to participate in the cohort. Furthermore, during the enrolment process, cohort members have the option to subscribe to an annual report detailing research findings and project updates. It is important to note that while the cohort members are not directly engaged in establishing the cohort, shaping research questions, or the design of studies, their participation in the project forms a crucial foundation.

Findings to date

Characteristics of the cohort

Table 1 displays the characteristics of the open cohort for each year; accordingly, there is an overlap of individuals across these years.

In summary, table 1 shows that the cohort increased from 4079 individuals in 2018 to 11 214 in 2022. In 2022, the majority were males (62.4%) and between 45 and 55 years (24.4%). Notably, 20.7% of participants had prevalent cardiovascular disease, 13.8% had hypertension and 8.1% had diabetes. Furthermore, 10.4% of the cohort redeemed prescriptions for statins, 11.8% redeemed prescriptions for renin–angiotensin–system acting agents, and fewer redeemed prescriptions for the remaining types of pharmacotherapies. The median amount spent in 2022 was DKK7671 (IQR DKK2753–DKK15 341) (in euro: €572, IQR €205–€1143, and the median number of shopping trips was 63 (IQR: 26–116). Coinciding with the period between the first and last purchase, 30.2% of participants had at least one creatinine measurement, 61.2% had at least one glycated hemoglobin (HbA1c) measurement and 31.0% had at least one cholesterol measurement (figure 1).

Figure 1

Percentage of the cohort with biomarker measurements coinciding with donating receipts, by type of biomarker, creatinine, HbA1c and LDL cholesterol, respectively, during 2018–2022. HbA1c, glycated hemoglobin; LDL, Low-Density Lipoprotein.

Purchases by sex, age and hour of the day

Through a combination of register data and the SMIL cohort, it is possible to explore various factors, such as the distribution of monetary percentage spent on food items by age group and sex. The choice of a monetary metric over alternatives like weight or kcal is primarily dictated by the type of data available from the receipts. Using a relative metric in this context is relevant, as it accounts for the absolute variations in spending, providing a more meaningful comparison that reflects proportional expenditure rather than sheer volume. We categorised purchases into 11 food groups and calculated the monetary percentage spent on each specific food group out of the total amount spent on all food groups, stratified by age group (below or above 30 years of age) and sex. These findings are presented in figure 2, revealing that while all food groups are purchased across age and sex categories, men tended to spend a higher monetary percentage on alcoholic beverages and a lower monetary percentage on fruits and vegetables. Additionally, the age group below age 30 years spent a higher monetary percentage on sugary drinks and a lower monetary percentage on red meat compared with the older age group, regardless of sex. The availability of the exact timing of purchase in the dataset enables investigation of how monetary percentage is distributed throughout the day. To examine this, we calculated the monetary percentage spent on six predetermined food groups of interest by the hour and sex relative to the total amount spent on all food groups using all available receipts. The findings are displayed in figure 3, which illustrates that a higher percentage was spent on healthy food items earlier in the day and a lower percentage later in the day, regardless of sex.

Figure 2

Monetary percentage spent on specific food groups by sex and age group in 2022.

Figure 3

Monetary percentage spent on alcoholic drinks, confections, fruits, salty snacks, sugary drinks and vegetables, respectively, by hour and sex in 2022.

The SMIL cohort is a recent establishment and one study based on its data has been published so far,30 which investigated differences in the monetary percentage spent on groceries in 2019 in households with and without at least one individual with diabetes, respectively. Several studies are ongoing, including an investigation of purchases in households that include an individual who had their body mass index measured in childhood, categorised as either underweight/normal weight or overweight/obesity.

Future plans

The SMIL cohort comprises a substantial number of individuals, providing the opportunity to examine hypotheses about consumer purchases and their association with both prevalent and some incident diseases, along with objectively measured demographic characteristics. Future objectives involve building on recent developments in the field, particularly the development of a healthy purchase index,31 32 which addresses challenges associated with the analysis of high-dimensional data, such as consumer purchase data. The cohort also has the potential for investigations relating to non-edible goods, such as tobacco or cosmetics. Furthermore, a central challenge in using consumer purchase data to approximate diet has been to disentangle individual from household exposure. Previous research has demonstrated that consumer purchases most accurately reflect dietary intake in single-person households.10 However, the SMIL cohort, in conjunction with the detailed household composition data from the Danish registries, offers the potential to better isolate individual dietary and other exposures within households. Data collection began in 2018 and undergoes annual updates, with the most recent one including data up to 31 December 2022. Data collection is continuing, and we are taking several steps to increase the cohort. Each annual update entails using algorithmic methods to cleanse the raw consumer purchase data and enrich it with information from Danish registries.

Strengths and limitations

The strengths of the SMIL cohort include its large sample size and reliance on objective measures rather than self-reported assessments to evaluate dietary habits. Additionally, the inclusion of repeated and longitudinal assessments of grocery purchases over an extended period from three out of the five largest retailers in Denmark adds to the cohort’s robustness.

Limitations include the fact that although the receipt data received from the receipt collection service covers purchases made in some of the largest Danish supermarkets, we do not have access to the entirety of the household’s grocery purchases, nor do we have access to purchases made by all household members. Moreover, purchases alone do not capture dietary intake when eating out. However, in Europe, a significant proportion of dietary intake is derived from foods consumed at home.31 33 Although grocery shopping is usually done by the household and may not directly reflect individual intake, it is a reasonable assumption that purchases in supermarkets are representative of the home food environment, which in turn shapes what is ingested by the individual.34 35 The home food environment is a complex concept, encompassing several dimensions such as food availability, accessibility, meal preparation as well as the physical and sociocultural context.36 It is important to note, however, that consumer purchase data primarily capture only one aspect of the home food environment, namely the availability of food items. Furthermore, we urge careful consideration in selecting the appropriate outcome measure. Given the significant absolute variations in spending, a relative metric might offer more relevant insights. Importantly, while relative nutrient estimates have been validated as reliable indicators of dietary composition,7 the use of monetary percentage as a marker, to our understanding, has not yet been substantiated with similar validation. Another limitation is that the project experienced data delivery issues involving specific months, resulting in very few receipts from these months. Additionally, data delivery issues, such as duplicated receipts, have required decisions during data cleaning that are prone to errors and are not easily repeatable. Finally, it is important to acknowledge the possibility of external generalisation limitations. As the receipt collection service is used by a selected population, the study’s findings may not be generalisable to the broader population. In conclusion, this cohort holds promise for investigations of hypotheses regarding health and consumer purchase patterns.

Data availability statement

Data may be obtained from a third party and are not publicly available. According to the Danish Data Protection Act and the General Data Protection Regulation, data that contains personally identifiable information cannot be made publicly available. The data used are solely available through the state organisation Statistics Denmark but can be accessed in collaboration with authorised Danish researchers. Requests for statistical code and inquiries regarding collaborations can be directed to the corresponding author who has full access.

Ethics statements

Patient consent for publication

Ethics approval

This study involves human participants and Danish registry-based studies that are performed for the sole purpose of statistics and scientific research do not require patient consent nor ethical committee approval, as stated in The Danish Data Protection Act. However, all individuals gave informed consent to use the receipt data. Additional approval to use the data sources for research purposes was granted by the data responsible institute of the Capital Region of Denmark (approval numbers P-2019-263 and P-2019-280) in accordance with the General Data Protection Regulation (GDPR). Participants gave informed consent to participate in the study before taking part.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors Guarantor: KKS; Conceptualisation: KKS, CT-P and MPA; Methodology: BZ, TGJ and CE; Formal analysis and investigation: KKS; Writing–original draft preparation: KKS; Writing–review and editing: MPA, FTM, CE, TGJ, BZ and CT-P; Funding acquisition and resources: CT-P; Supervision: MPA, FTM and CT-P.

  • Funding This study was supported by the Danish Heart Foundation (grant number R116-A7517) and Helsefonden (grant number 20-B-0195). Frederik Trier Møller, Caroline Eves and Thor Grønborg Junker have received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 874662.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.