Phenotypes of non-alcoholic fatty liver disease (NAFLD) and all-cause mortality: unsupervised machine learning analysis of NHANES III

Objectives Non-alcoholic fatty liver disease (NAFLD) is a non-communicable disease with a rising prevalence worldwide and with large burden for patients and health systems. To date, the presence of unique phenotypes in patients with NAFLD has not been studied, and their identification could inform precision medicine and public health with pragmatic implications in personalised management and care for patients with NAFLD. Design Cross-sectional and prospective (up to 31 December 2019) analysis of National Health and Nutrition Examination Survey III (1988–1994). Primary and secondary outcomes measures NAFLD diagnosis was based on liver ultrasound. The following predictors informed an unsupervised machine learning algorithm (k-means): body mass index, waist circumference, systolic blood pressure (SBP), plasma glucose, total cholesterol, triglycerides, liver enzymes alanine aminotransferase, aspartate aminotransferase and gamma glutamyl transferase. We summarised (means) and compared the predictors across clusters. We used Cox proportional hazard models to quantify the all-cause mortality risk associated with each cluster. Results 1652 patients with NAFLD (mean age 47.2 years and 51.5% women) were grouped into 3 clusters: anthro-SBP-glucose (6.36%; highest levels of anthropometrics, SBP and glucose), lipid-liver (10.35%; highest levels of lipid and liver enzymes) and average (83.29%; predictors at average levels). Compared with the average phenotype, the anthro-SBP-glucose phenotype had higher all-cause mortality risk (aHR=2.88; 95% CI: 2.26 to 3.67); the lipid-liver phenotype was not associated with higher all-cause mortality risk (aHR=1.11; 95% CI: 0.86 to 1.42). Conclusions There is heterogeneity in patients with NAFLD, whom can be divided into three phenotypes with different mortality risk. These phenotypes could guide specific interventions and management plans, thus advancing precision medicine and public health for patients with NAFLD.


Supplementary Methods
All the analyses were conducted with R and Python. The codes, either R scripts or Python Jupyter notebooks, are available as supplementary materials together with this manuscript.

Data preparation
We used individual-level data from the National Health and Nutrition Examination Survey (NHANES) III conducted between 1988-1994. All datasets were downloaded from the NHANES website on March 3 rd , 2022; the mortality dataset was download on May 27 th , 2022. The authors did not have privileged access.
Data were pre-processed with R (NHANESiii_extraction.R). We used these datasets: adult.dat.txt (n=20,050); lab.sas.txt (n=29,314); exam.sas.txt (n=31,311); examdr.sas.txt (n=30,818); and HGUHS (n=14,797). The adult dataset was used as the prime; that is, the other datasets were merged to the adult dataset: adult + lab (n=20,050); + exam (n=20,050); + examdr (n=20,050); + HGUHS (n=20,050). We kept only the variables of interest, checked each variable, and recoded them as needed; for example, 9999 or equivalent values were set to missing, and 2 were coded to 0 when referred to 'no'. We merged the mortality dataset (n=33,994) with the pooled dataset and the sample size was kept at 20,050 (i.e., the pooled dataset was used as the prime) with 79 variables.

Study sample
We followed these criteria to select the study sample (Flowchart 1). First, we only included people in the age range from 20 to 74 (inclusive) years. Second, we only kept observations with hepatic imaging data. Third, we excluded people with positive evidence of Hepatitis B (HBsAG) or Hepatitis C (antiHCV). Fourth, we excluded people Flowchart 1. Study population. with high alcohol consumption. We used two questions to define high alcohol consumption: i) In the past 12 months, how many days of the year did you drink any alcoholic beverages? and ii) On the average, on the days that you drank alcohol, how many drinks did you have a day? The answers to the second question were divided by 365.25 to compute the average number of drinks per day, which was then multiplied by the first question (how many days they drunk alcohol) to compute the number of daily alcohol drinks in the last year. For example, a person who drank alcoholic beverages 52 days in the last year, and on each occasion they drank three beverages, their daily consumption would be 52 x 3/365.25 = 0.43 alcoholic beverages per day in the last year. Men with more than two and women with more than one alcoholic beverages per day in the last year were excluded. Fifth, we only included observations which the hepatic imaging was deemed 'confident' or 'absolute'; this, to secure the highest quality of the outcome of interest (NAFLD); in addition, we only include people whose hepatic imaging revealed hepatic steatosis 'moderate-severe'.
Sixth, we only kept the 10 predictors of interest and dropped all missing observations; in other words, we were to conduct a complete-case analysis. Seventh, we excluded observations outside the following plausibility ranges to secure high-quality data: BMI below and above 10 kg/m 2 and 80 kg/m 2 ; waist circumference below and above 30 cm and 200 cm; systolic blood pressure below and above 70 mmHg an 270 mmHg; fasting plasma glucose below and above 45 mg/dL and 540 mg/dL; total cholesterol below and above 20 mg/dL and 773 mg/dL; and triglycerides below and above 17 mg/dL and 1771 mg/dL. Finally, we included 1,652 observations in the analysis. For further details about this selection process please refer to the Jupyter notebook 1.Cleaning_data.ipynb.

Number of clusters
Selecting the ideal number of clusters in an unsupervised machine learning model is informed by both the data and expert knowledge. In here, we describe in detail the process we followed to reach the final number of clusters used in the analysis. Details about the analytical steps are presented in the Jupyter notebooks presented along the paper (2.Number_clusters.ipynb). First, we displayed a dendrogram with Euclidean distances; this plot suggested there were five clusters. Of note, two of the six clusters grouped (very) few observations. Second, we displayed the Elbow plot ( Figure   1) for one through 10 clusters. The ideal number of clusters would be that after which the Cost function does not change substantially and is the smallest. The table below shows the Cost function for each number of clusters, and the absolute arithmetic difference between two consecutive Cost functions. According to these figures (Table 1), the optimal number of clusters could be between five and six clusters; moreover, five clusters appeared to be ideal because the cost function was smallest and almost constant (~1.4) thereafter.  (Table 2), two clusters would be ideal (i.e., highest Silhouette score), closely followed by three and four clusters. Fourth, we calculated the Jaccard index for three, four and five clusters (Table 3). So far, three, four and five clusters appear to be the best options. Jaccard scores ³0.80 suggest good reproducibility of the cluster and would thus be preferred. According to these figures (Table below), three clusters had the highest Jaccard scores, all of which were ³0.88. The Jaccard analysis was conducted in R (2.1.Jaccard.R).