Table 2

Psychometric tests and criteria used in the evaluation of the BREAST-Q Utility module

Psychometric property	A priori hypothesis	Tests and criteria
Reliability The extent to which a measurement is consistent and free from error
Test–retest reliability—the degree to which repeated measurements in stable individuals (ie, no clinical/life change) provides similar answers.33 Measurement error—the systematic and random error of a patient’s score that is not due to true changes in the construct to be measured.33	The BREAST-Q Utility module will demonstrate high test–retest reliability, that is, the responses between the first and second administration (1 week later) will be similar.	Weighted kappa ≥0.7033 41 Percentage of positive and negative agreement.
Construct validity The degree to which scores of an instrument are consistent with the hypotheses, if the new instrument validly measures the construct of interest
Hypothesis testing—the degree to which the scores of an item/scale are consistent with a priori hypothesis.33	Direction and magnitude of the correlation between BREAST-Q Utility module and the comparison instruments—We hypothesise that The BREAST-Q Utility module score will show positive (≥0.3) correlation with similar domains on EQ-5D-5L, EORTC-QLQ-C30 and SF-12. Known groups validity—Based on published evidence on HRQOL outcomes in breast cancer,42–45 we hypothesise that the BREAST-Q Utility module score will be: Higher (ie, worse HRQOL) in women currently undergoing (neo)adjuvant treatment(s) compared with women who have not had/ had neoadjuvant treatment(s) in the past for breast cancer. Lower for women who are had breast cancer surgery alone as compared with women who had breast cancer surgery and (neo)adjuvant treatments. Lower for women diagnosed with early versus advanced stage breast cancer.	ANOVA or Kruskal-Wallis depending on the distribution of the data for differences in mean scores (p<0.05). Pearson’s r or Spearman’s r depending on the distribution of the data: ≥0.5 will be considered strong correlation, 0.3–0.49, moderate and 0.10–0.29 small.33 46 47
Acceptability and data quality
Response distributions of the instruments and missing data Floor and ceiling effects: >15% of33 respondents scoring the lowest or highest possible score.	We hypothesise that the Utility module will have less than 15% missing data. We hypothesise that the responses of the Utility module will be evenly distributed across the response categories (ie, no floor or ceiling effect).	Distribution of responses by instrument, item-level, stage of cancer and type of treatment will be summarised using descriptive statistics (mean, SD, % of item-level missing data).

ANOVA, analysis of variance; EQ-5D-5L, EuroQol-5 dimension-5 level; HRQOL, health-related quality of life; SF-12, Short Form 12.