A longitudinal observational study of home-based conversations for detecting early dementia: protocol for the CUBOId TV task

Introduction Limitations in effective dementia therapies mean that early diagnosis and monitoring are critical for disease management, but current clinical tools are impractical and/or unreliable, and disregard short-term symptom variability. Behavioural biomarkers of cognitive decline, such as speech, sleep and activity patterns, can manifest prodromal pathological changes. They can be continuously measured at home with smart sensing technologies, and permit leveraging of interpersonal interactions for optimising diagnostic and prognostic performance. Here we describe the ContinUous behavioural Biomarkers Of cognitive Impairment (CUBOId) study, which explores the feasibility of multimodal data fusion for in-home monitoring of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). The report focuses on a subset of CUBOId participants who perform a novel speech task, the ‘TV task’, designed to track changes in ecologically valid conversations with disease progression. Methods and analysis CUBOId is a longitudinal observational study. Participants have diagnoses of MCI or AD, and controls are their live-in partners with no such diagnosis. Multimodal activity data were passively acquired from wearables and in-home fixed sensors over timespans of 8–25 months. At two time points participants completed the TV task over 5 days by recording audio of their conversations as they watched a favourite TV programme, with further testing to be completed after removal of the sensor installations. Behavioural testing is supported by neuropsychological assessment for deriving ground truths on cognitive status. Deep learning will be used to generate fused multimodal activity-speech embeddings for optimisation of diagnostic and predictive performance from speech alone. Ethics and dissemination CUBOId was approved by an NHS Research Ethics Committee (Wales REC; ref: 18/WA/0158) and is sponsored by University of Bristol. It is supported by the National Institute for Health Research Clinical Research Network West of England. Results will be reported at conferences and in peer-reviewed scientific journals.

I only have the following two points: It is not clear what are the multimodal data that will be fused ? Objectives 2 and 3 in the introduction are a bit confusing given that the notion behind the protocol is to mainly leverage speech analysis. Please clarify more especially in the objectives section of the introduction.
It is not clear how, in the opinion of the authors, the minimum dataset for the optimization of meaningful power calculation and for efficient monitoring will be determined, as claimed in the introduction as well.

REVIEWER
Liang, Xiaohui University of Massachusetts System REVIEW RETURNED 07-Oct-2022

GENERAL COMMENTS
The authors may want to add some more explanation of the content of the conversation. I am not aware of "Gogglebox" and have to google it to understand the task.
This task gathers rich audio and video data about user from home, but may raise privacy concerns.
The authors may want to refer to the following paper.

CUBOId is a longitudinal study of which data collection is still ongoing until 31/12/2022. Interestingly, prior to the study, given the importance of co-design and "co-creation" in the domain of digital medicine [1], the authors performed workshops involving study participants in the pilot experiments of the study.
• We thank Dr. Alfalahi for pointing out this systematic review, which is relevant background for our paper due to its discussion of keystroke monitoring and other modalities as passive diagnostic sensing technologies for detecting dementia in naturalistic environments. We have cited it in the Introduction (p.4, paragraph 2; reference [18]).
The methodology behind the study is fully explained in an excellent and clear way.
• We thank Dr. Alfalahi for these comments I only have the following two points: It is not clear what are the multimodal data that will be fused ? Objectives 2 and 3 in the introduction are a bit confusing given that the notion behind the protocol is to mainly leverage speech analysis. Please clarify more especially in the objectives section of the introduction.
• The three analysis stages of the protocol all involve speech analysis, and indeed for Objective 1 we intend to develop baseline diagnostic models based on speech only, but for Objectives 2 and 3 we will also include data from SPHERE to explore how speech and SPHERE data covary (Objective 2), and how best to fuse data from the different modalities (Objective 3). We agree that this could be made clearer. We have made changes to the Objectives section of the Introduction (pages 5-6). To better define the scope of the paper before listing the objectives (page 5), we have added the text: o Here we report on the data acquisition and analysis protocol for CUBOId's novel conversational speech task, the "TV task". We give an overview of CUBOId's in-home sensing methods and outline how concurrent sensor readings from environmental sensors, cameras and wearable accelerometers can be combined, and further fused with conversational speech features, to improve diagnosis from speech alone.
• The text of Objective 2 (page 6) has been edited to clarify/emphasise that obtaining measures of linguistic proficiency is the initial prime focus of the TV Task study, and to remove some potentially confusing text that is repetitive with Objective 3. The text now reads: o 2. To investigate how temporal variability in linguistic proficiency is reflected in clinically relevant variability in ADL behaviours. For example, we hypothesise that periods of agitation, wandering and sleep disturbances will precede disturbed speech as revealed by acoustic and prosodic changes, hesitations, mnemonic search markers and endogenous (e.g., rapid speech) and exogenous (e.g., calming conversational interactions) linguistic features. • The text of Objective 3 (page 6) has been edited to reflect that data fusion will follow naturally from the investigations of covariance in speech and ADL behaviours planned for Objective 2, and to emphasise that data fusion will be between speech and data from multiple SPHERE sensor readouts. The text now reads: o To use insights from Objective 2 and data fusion techniques to leverage our multimodal data streams by deriving latent relational embeddings between speech and multiple sensor readouts for optimisation of diagnostic and predictive performance from speech alone. • We have also added some text to the final paragraph on page 11, to expand on how speech and SPHERE sensor fusion will be explored for optimising diagnostic performance. This section now reads: o We will project these novel features onto unseen speech samples after exploring the best subset of SPHERE sensors to combine with our paralinguistic and linguistic models, and also whether diagnostic performance is better at this low level of SPHERE sensor integration compared with, for example, combining speech with actograms derived from pre-integrated sensor data. Fusion techniques will include multiview and transfer learning.
It is not clear how, in the opinion of the authors, the minimum dataset for the optimization of meaningful power calculation and for efficient monitoring will be determined, as claimed in the introduction as well.
• We thank Dr. Alfalahi for this observation. We have now provided additional information on how this will be done at the beginning of the data analysis section (page 10), and have provided an additional reference: o After standardisation of data by the within-population standard deviation, effect sizes for power calculations in future studies will be estimated from differences between a) controls (partners) and patients; b) houses with MCI and AD patients; and c) baseline performance and follow-up testing or analysis (for progression over time) [58].