Increasing access to cognitive screening in the elderly: applying natural language processing methods to speech collected over the telephone.

Barriers to healthcare access are widespread in elderly populations, with a major consequence that older people are not benefiting from the latest technologies to diagnose disease. Recent advances in the automated analysis of speech show promising results in the identification of cognitive decline associated with Alzheimer’s disease (AD), as well as its purported pre-clinical stage. We utilized automated methods to analyze speech recorded over the telephone in 91 community-dwelling older adults diagnosed with mild Alzheimer’s disease (AD), amnestic mild cognitive impairment (aMCI), or cognitively healthy. We asked whether natural language processing (NLP) and machine learning could more accurately identify groups than traditional screening tools and be sensitive to subtle differences in speech between the groups. Despite variable recording quality, NLP methods differentiated the three groups with greater accuracy than two traditional dementia screeners and a clinician who read transcripts of their speech. Imperfect speech data collected via a telephone is of sufficient quality to be examined with the latest speech technologies. Critically, these data reveal significant differences in speech that closely match the clinical diagnoses of AD, aMCI and healthy control.


INTRODUCTION
Unequal access to health services is a growing problem for the elderly, and is attributable to vast differences in financial resources, geographical location and the obvious physical challenge of attending in-person to clinics when old and frail [1].Such inequity in terms of access to basic services thus compounds the effects of aging and results in the elderly population not being uniformly able to take advantage of proactive health monitoring services and some advances in diagnostic methods.Therefore, this study assessed cognitive function using brief conversational tasks administered via the telephone, thereby obviating the need for in-person attendance.Detecting cognitive decline as early as possible is important to enable planning for the future, increase quality of life, reduce care costs and potentially gain added benefit from therapeutic drug trials [2].However, current screening methods typically fail to detect it until such time when decline in memory and other cognitive functions are clearly evident.
Early signs of cognitive decline may be evident in speech [3,4] such that it is possible to differentiate cognitively healthy from individuals with Alzheimer's disease (AD) via speech alone [5,6] in highly controlled settings, but it is unknown whether this generalizes to naturalistic settings.The value of remote screening could be enormous, both in terms of earlier detection and for increased access.This study sought to establish if it is possible to detect these early signs in speech in a group who are at high risk for later conversion to AD [7,8] and often remain undiagnosed due to poor sensitivity of traditional screening tests (i.e., amnestic mild cognitive impairment (aMCI)), by leveraging natural language processing (NLP) methods on speech collected in naturalistic settings.Specifically, we asked three questions: (1) What are the clinically relevant language features that best differentiate mild AD, aMCI and healthy controls?
We hypothesized that speech coherence or intelligibility would differentiate control participants from AD participants, with the aMCI group intermediate between the two [9][10][11].(2) How well do NLP features and machine learning methods classify the three groups?We hypothesized that our models would be able to separate the three groups from one another, with the healthy controls most obviously different from the individuals with AD, and the aMCI group intermediate between the other two.Previous studies in controlled settings [5,6] have reported models capable of distinguishing healthy, aMCI and AD groups with good measures of separability (i.e., area under the receiver operating characteristic curve (AUC) in the 0.70 range.The higher the AUC, the better the model is at predicting who is in which group).We expected to find similar measures of separability using our real-world data, collected under significantly less controlled conditions.(3) Can automated methods provide more accurate diagnostic predictions than traditional dementia screening tools or expert humans?We hypothesized more accurate group categorizations than traditional screening tools and experts (with no contextual knowledge beyond a speech sample), due to highly sensitive machine learning techniques.

MATERIALS & METHODS
Here we report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.No part of the study procedures or analyses was pre-registered prior to the research being conducted.

Participants
Participants (N = 91)1 were community-dwelling English speakers, recruited via the Memory Disorders Program (MDP) at Georgetown University (Table 1).One third carried a diagnosis of mild AD, one third amnestic MCI, and one third were cognitively healthy.Clinical diagnoses of AD [12] and mild cognitive impairment [13] followed established criteria.Since mild cognitive impairment is a heterogeneous clinical syndrome, individuals with aMCI (single or multiple domain [14]) were included to reduce clinical variability as this subgroup is at greatest risk for conversion to AD [7,8].
Control participants had no significant medical history or subjective cognitive complaint.[15]) scores within the last 6 months, had adequate hearing, and no self-reported history of neurological disease (e.g., Parkinson's disease or epilepsy), drug or alcohol abuse, psychiatric hospitalization, current cancer treatment, or stroke or heart attack within the last year.Individuals with minor physical ailments (e.g., diabetes with no serious complications, essential hypertension) were included.Participant recruitment, written informed consent (with authorized representatives also providing consent for participants in the mild AD group), medical history and administration of the MMSE were conducted in the Georgetown University MDP prior to the telephone interview.Only contact details for each participant were shared with the telephone interviewer, who remained blind to participant diagnostic group.The study was approved by the institutional review boards of Marymount University and Georgetown University (MU IRB#260).All inclusion/exclusion criteria were established prior to participant recruitment.

Materials & Procedures
The telephone interview (approx.20 mins) (i) collected speech samples and (ii) administered a screening test for cognitive decline (in counterbalanced order).
Telephone Screener: A modified version of a telephone based screening instrument for cognitive decline -the Telephone Interview for Cognitive Status [16] -was employed (TICS-M).The TICS-M is modeled after the MMSE in providing a brief, global measure of cognitive functioning, and has good sensitivity and specificity to detect dementia [17], but its utility to screen for milder cognitive syndromes is unknown [18,19].Legal copyright restrictions prevent public archiving of the TICS-M and MMSE which can be obtained from the copyright holders in the cited references.
Speech Samples: Participants generated as many 'animal' words as possible in one minute (semantic word fluency), and described a favorite memory from childhood (free speech) (Table 1).
Participants were telephoned at home via the Cisco Jabber interface on a laptop computer, and the semantic word fluency and free speech portions of the interview were recorded onto the device and later uploaded to a secure cloud-based application.
Spouses/companions were asked to remove visual memory aids (e.g., calendars) and turn off audible distractors prior to the interview.The speech samples were digitally recorded and transcribed by the first author or a trained research assistant (intraclass correlation coefficient = 0.988) to check for accuracy and screen for personally identifying information.(The conditions of our ethics approval do not permit public archiving of anonymized study data.Readers seeking access to the data should contact the corresponding author.Access will be granted to named individuals in accordance with ethical procedures governing the reuse of sensitive data.Specifically, requestors must meet the following conditions to obtain the data: completion of a formal data sharing agreement; approval by the Marymount and Georgetown University IRBs).

Data Analysis
A range of natural language features were extracted from participant responses in the free speech task and the semantic fluency task.In general, features were extracted automatically using custom written Python code and various packages for data management, statistical calculations, NLP analyses, and word vector creation.For each classification setting, the most predictive and clinically relevant features were chosen to train and test machine learning models.The best performing models are reported in the Results.

NLP Feature Selection
For the free speech task, three classes of NLP features (a set of 73 total) were extracted, namely (i) word-level (lexeme), (ii) sentence-level (syntactic), and (iii) meaning (semantics) of expressions [20].The first class of language features included simple counts of word tokens and word types (i.e., unique words), and slightly more sophisticated metrics (type to token ratio (TTR; a measure of lexical richness), content density (a measure of actual information spoken, as opposed to filler words), Brunét's Index (a measure of lexical richness less affected by text length), Moving Average Type Token Ratio (a version of TTR that is calculated on a sliding window of the text and is less affected by text length) Honoré's Statistic (emphasizes words that are only spoken once), and counts and frequencies of specific parts of speech that were computed with NLTK's standard TreeBank tagger (https://www.nltk.org/).Some of these features are more impacted by text length as longer utterances will receive a higher score.Since poverty of speech is a common symptom in conditions such as AD, features that take this into account tend to be more highly discriminable of the AD group than those that do not have such an effect.
The second class of language features were syntactic features, or those that seek to measure the complexity and arrangement of sentences.These included measures extracted from dependency parses or speech graphs.Examples of such metrics are distances of dependencies in parses or the number of nodes, edges, or loops in speech graphs.
The third class, semantic features, were computed in a few different ways.Generally, semantic analyses are performed using high-dimensional vector space word embeddings of text.These embeddings operate under the premise that the meaning of a word is derived from the context in which it tends to appear.Words that tend to appear in similar contexts are semantically related and thus should be close to each other in a derived vector space.Examples of word embedding techniques are Latent Semantic Analysis (LSA; [21]), word2vec [22], Embeddings from Language Models (ELMo; [23]), and Bidirectional Encoder Representations from Transformers (BERT; [24]).LSA performs a singular value decomposition on a sparse type-to-document matrix to obtain lower dimensional vectors of each of the types.Word2vec is a neural network-based word embedding model trained on a large corpus of text with the goal of predicting either a word given its context or the context surrounding a word given the word.ELMo and BERT are deep neural language models that are built on long short term memory neural language models and transformers, respectively.Metrics are computed on the cosine distances between consecutive embeddings or windows of embeddings, or by calculating the slope of coherence through the text.For end-to-end models like BERT, the entire network can be harnessed and subsequently tuned with a new layer to produce predictions.
For the semantic fluency task, a task-specific feature set (comprising 26 features) was extracted from participant responses.Traditionally, the semantic fluency task is administered and scored by trained humans who count the number of unique items (in this case, animals) spoken.More detailed analyses of responses to this task have been proposed by researchers that can provide additional insights into human cognitive performance [25][26][27][28].Classically, Troyer et al. [29] proposed two metrics that measure important components of the animal fluency task -clustering (i.e., producing words within the same subcategory of animals, like safari animals or house pets) and switching (i.e., changing between clusters).This approach can be implemented with hand-coded categories of animals or by using semantic distances.Using semantic distances entails computing the cosine distance of the word embedding of each exemplar to the next and setting a threshold of belonging to a category or not.
A number of features were extracted from the semantic fluency task, namely the number of unique animals spoken, the number of categories produced (employing both the hand-coded Troyer categories as well as a BERT [24] word embedding-based thresholding method where cosine distances between consecutive BERT representations of animal words are computed and those distances that fall below a predetermined threshold are considered a jump to a new category of animals), the average number of animals per category, the average cosine similarity between successive animals and successive categories, and the average vector length of each exemplar's word embedding.
Each time the average was computed, so too was the standard deviation, minimum, and maximum.The length of the animal vectors has been shown previously to be an indicator of the "usualness" of the animals spoken [30].Researchers in NLP have shown that words that occur in many different contexts, and thus have less meaning (such as stop words or other commonly used words), move around in vector space during computation and are shortened with each move due to an averaging computation.Thus, the longer the vector representation is, the more unusual the word tends to be [31].
The discriminability of each feature was determined by multivariate statistical analyses (specifically f-statistics) and feature importance in machine learning models.
Specifically, the NLP features with an f-statistic greater than 5.0 (range of f-statistics for features in all prediction scenarios: free speech, 0.00 -11.72; animal fluency, 0.00 -35.01) were initially chosen for experimentation and the machine learning models further narrowed down this choice by eliminating those features that were not critical for increasing model performance (e.g., due to multicollinearity with other features).
Features were computed for the entire dataset, but each prediction setting followed its own distinct feature selection process on its corresponding labels and data.

Classification
We sought to answer how well contemporary NLP methods can differentiate the three groups and whether machine learning methods can inform further about the relative importance of language variables in different stages of decline.When performing these experiments, some models overfitted the samples' idiosyncratic characteristics such that some features were statistically important in differentiating groups, but lacked clinical relevance (e.g., amount of numbers used in free speech) and thus were omitted to improve potential generalizability.
For each classification setting, we first performed a feature selection process that narrowed down our feature set to those that had the highest discriminability, yet were also clinically relevant (detailed in the first section of the Results).Then we used a grid search methodology to optimize the hyperparameters, and investigated 7 different machine learning model architectures (specifically a Decision Tree Classifier, Extra Trees Classifier, Gradient Boosting Classifier, K Neighbors Classifier, Logistic Regression Classifier, Random Forest Classifier, and Support Vector Classifier), including those with 0-4 tunable hyperparameters with 1-13 options each.The grid search was performed with the goal of not just building the best model, but rather understanding the relevance of features and how they may be used for the detection of dementia.If certain features were consistently implicated in each model, it would be clear that they were not simply idiosyncratic to a particular algorithm.Furthermore, it was important to explore which model and hyperparameter combinations tended to work best with the distribution of the chosen features.Decision Tree Classifiers, Extra Trees Classifiers and Gradient Boosting Classifiers worked particularly well, and placed consistently in the top 10% of model architectures for each scenario, as they assume no prior distribution of the data, do not depend on probability distribution assumptions, and allow the data to be partitioned on different combinations of the chosen features.They also tend to have excellent accuracy with high-dimensional datasets.
In the Results sections, we report statistics of the accuracy of not only the top performing model, but also the top 10% of the models tested so as to offer transparency around the level of accuracy consistency in the overall results of the grid search.We used leave-one-out cross-validation in each setting as this type of cross-validation allowed us to simulate how the model would predict a new participant after being fully trained on our initial dataset.
Codes for feature extraction and model training can be accessed at: https://github.com/ckchandler/Increasing-Access-to-Cognitive-Screening

What are the clinically relevant language features that best differentiate between the three groups?
Aberrations in meaning and language have been identified as critical indicators of cognitive decline in both aMCI and AD [9,19], thus we focused on text-based analyses.
All NLP features with significant f-values in group comparisons for both the free speech task and the semantic fluency task are listed in Table 2.
For the free speech task, certain word-level features consistent with poverty of speech (raw count of nouns, determiners, present participle verbs, and modals) had statistically significant f-values when comparing the AD group to the cognitively healthy group and to the aMCI group.(A modal is a type of verb that is used to indicate modality such as likelihood, requests, suggestions, and so on -for example, can, could, may, and might; the frequency is computed by dividing the number of modals spoken by the total number of words spoken).
Other word-level features (frequency of modals, past participle verbs, non third person singular verbs, and all verb types) had statistically significant f-values only when comparing cognitively healthy participants to those with an aMCI diagnosis.For syntactic features, the mean distance of all dependencies between words in a sentence in a participant response served as a discriminable feature for the AD group when comparing to both the cognitively healthy group and the aMCI group, but did not significantly differentiate the cognitively healthy group from the aMCI group.The semantic feature that proved to be most discriminable in our dataset was the mean coherence of a 4 word sliding window of the 300-dimensional word2vec word embeddings based on 3 million words from the Google News corpus.The window size (4 words) is a hyper-parameter that is generally tuned to be whatever size produces the most accurate representation of pieces of text; at a high level, each window should represent a distinct phrase so as to smooth out the noise that would be produced if comparing consecutive words.We found that this feature discriminated the AD group from both the cognitively healthy group and the aMCI group, but failed to do the same with discriminating the aMCI group from the participants labeled as cognitively healthy.
For the semantic fluency task, the number of unique animals spoken was the most discriminable feature overall; it was the highest for separating AD from cognitively healthy, fairly high when comparing aMCI to AD, and less high -yet still significant -for separating cognitively healthy from aMCI.The same can be said for the number of categories, based on the hand-coded Troyer et al. [29] categories, but with slightly less discriminatory power.Finally, with even less discriminatory power, yet still a significant amount, the maximum number of animals spoken per category (i.e., the maximum number of animals spoken consecutively from one category), discriminated all groups fairly well.

Task Specificity
Another finding of the feature extraction portion of our work is that there is no single language feature that is consistent between differing tasks that may be discriminable for varying levels of cognitive decline.As an example, we discuss coherence in language and how it varies between tasks.Previous work by our team has found that coherence in recalling a short story is generally lower in a group of individuals with varying levels of mental illness (see [32] and [33] for an overview of this work) and that higher coherence in story recalls generally received higher expert ratings of recall as well.In the current study, we found the opposite: that lower coherence actually belonged to the cognitively healthy group, then the aMCI group, and finally the AD group had the highest cohesion.
The methodology was the same in both approaches -the average cosine distance between consecutive windows of size 4 was computed for each response.There certainly will be differences when coherence is operationalized in different ways (see [34] for an overview of different approaches to computing coherence), but this is not a factor here.The only difference between the two experimental settings was the task.
One is a constrained task where the participants try to remember specific details of a short story recently told to them, and the other is free speech where the response is given in a narrative manner, relying on long-term autobiographical memory, and likely retold with greater emotion and enthusiasm.The higher coherence of the AD group in this dataset could be attributed to more repeated words and less detail overall.To illustrate, we include portions of text from a participant from the AD group with the highest computed coherence to show how a wordy response that consists of repeated statements would generate a high coherence: "...And uh, it wasn't a project actually it was a it was a um.It, it was a, it was a house.Um.It was a not a house, it was a it was a um, a development… She, she never learned to read and write.Um, but she um uh, I don't, I don't think she ever learned to read and write, but she um, uh she may have.I, I think she may have learned to, to read and write..." Since this phenomenon is between two different studies and thus two different participant pools, we also explored whether the coherence between individual exemplars in participants' animal fluency responses correlated with the coherence of participants' childhood memory response.As noted earlier, these are two dissimilar tasks, tapping quite different cognitive processes, and it is perhaps unsurprising that we found a low correlation between the two features (Pearson r correlation of 0.26).Thus, we conclude that there is little commonality across tasks and advocate for task-specific measures and methods of computing such measures (e.g., either using larger window sizes to account for long, drawn out phrases, or removing verbatim repeated clauses in the case that repeated words and more verbosity in general are expected).We further advocate here that researchers working with computational methods must be explicit in reporting the manner in which their metrics were computed, especially in the cases where there do not yet exist standardized methodologies.

Classifying Cognitively Healthy, aMCI, and AD
In the setting of classifying the three groups together as an assay of level of cognitive decline, the top three features chosen for machine learning modeling were the average coherence in free speech, and the number of unique animals and categories spoken in semantic fluency.
We used these three features to train a model for classifying cognitively healthy, aMCI, and AD participants.The best model in this experimental setting was a Decision Tree Classifier with a maximum tree depth of 3.This model was 62% accurate overall when performing leave-one-out cross-validation (Table 3; Appendix A). Figure 1 shows the ROC curve of each of the three groups.

INSERT FIGURE 1 & TABLE 3 HERE
The model was most accurate at predicting cognitively healthy, then aMCI, and was least accurate in the AD setting.
As a method to visualize the diagnostic groups based solely on these three features, we applied Principal Component Analysis (PCA) to the data.Figure 2 (top left panel) shows density plots of the first dimension of this reduction, separated by diagnosis.The distributions are ordered as expected, with the cognitively healthy and AD groups at the two extremes.The left edge of the aMCI peak aligned with the peak of cognitively healthy group, while the right edge of the aMCI peak aligned with the peak of the AD group.If an aMCI participant was incorrectly predicted as cognitively healthy, that is because they were within the healthy range for this sample.Similarly, if an aMCI participant was incorrectly predicted as a member of the AD group, that is because they performed more within the AD range.For these individuals, this prediction "error" could signpost a future conversion to AD.
Of the models tested, the top 10% of the models were on average 53% accurate (SD 0.03, minimum 51%, maximum 62%).

Classifying Cognitively Healthy against "Cognitive Decline"
Next, we tested the setting of classifying the cognitively healthy group against cognitive decline in general (i.e., aMCI and AD participants were treated as belonging to the same group).Since coherence was not a significant indicator for differentiating cognitively healthy from aMCI, in this setting, we replaced coherence with the frequency of modals in the language.This feature, plus the number of unique animals spoken and the number of categories spoken were used to train a machine learning classifier to classify cognitively healthy against cognitive decline.The best model found in this experimental setting was an Extra Trees Classifier (a classifier comprising 32 Decision Trees) with a maximum tree depth of 10 and an entropy criterion for separating the data into subsets with more homogeneity within individual groups (Table 3; Appendix A).This model was 87% accurate overall when performing leave-one-out cross-validation and had an AUC of 0.86.
Of the models tested, the top 10% of the models were on average 84% accurate (SD 0.02, minimum 81%, maximum 87%).

Classifying Cognitively Healthy and aMCI
The next machine learning model implemented was that of distinguishing cognitively healthy from aMCI.This setting is where the dichotomy between the most accurate model and the most clinically relevant model was apparent.When allowing the machine learning model to choose the best features to differentiate the two groups, it overwhelmingly favored features such as cardinal (number) counts and frequencies, the frequency of non-third person singular verbs and wh-adverbs (e.g., when, where, why), and the number of unique animals spoken in semantic fluency.This model achieved an accuracy of 87% and an AUC of 0.88, which is high compared to other studies (e.g., [6]).These features are not backed by literature or clinical relevance, so we report another model that, while less accurate on this dataset, has the potential to be more generalizable and have greater translational value.
The clinically relevant model for this comparison was one based only on the number of unique animals and categories generated in the fluency task.Interestingly, the clinically relevant features extracted for the cognitively healthy and aMCI from the free speech task did not add additional information that was distinct from the differences derived from the fluency task.The best model in this setting was a Decision Tree Classifier with a maximum tree depth of 4. It achieved an accuracy of 80% and an AUC of 0.78 (Table 3; Appendix A).The model correctly predicted 86% of the cognitively healthy group and 75% of the aMCI group.Figure 2 (top right panel) shows the PCA dimensionality reduction of these two groups performed on their top features as determined by the fvalue.
Of the models tested, the top 10% of the models in the clinically relevant feature setting were on average 76% accurate (SD 0.02, minimum 71%, maximum 80%).

Classifying Cognitively Healthy and AD
The best model for differentiating cognitively healthy from AD was based on the number of turns spoken by the participant in the free speech task (broken up by prompts to continue speaking by the interviewer; some participants had one turn, but others needed to be asked many follow-up questions to continue talking), and the number of repeated animals, categories, and the average and maximum length of the animal word vectors spoken during the semantic fluency task.The best model was Extra Trees Classifier with 32 estimators, a maximum tree depth of 5, and an entropy criterion.It achieved an accuracy of 88% and an AUC of 0.90 (Table 3; Appendix A).Out of the 29 cognitively healthy participants, 2 were predicted as AD and of the 30 AD participants, 5 were predicted as cognitively healthy.Figure 2 (bottom left panel) shows the PCA dimensionality reduction of the data used in the machine learning model.
Of the models tested, the top 10% of the models were on average 87% accurate (SD 0.01, minimum 87%, maximum 88%).

Classifying aMCI and AD
Finally, we discuss the setting of differentiating aMCI from AD.This is of critical interest for clinical translational value as those with aMCI are at an increased risk to convert to AD.Thus, incorrect predictions for the aMCI group may indicate people who are more likely to convert to AD.A Decision Tree Classifier with a maximum depth of 3 based on the mean coherence in free speech and the number of unique animals, categories, and maximum coherence between successive animals in semantic fluency resulted in an impressive 79% accuracy and 0.74 AUC (Table 3; Appendix A). Figure 2 (bottom right panel) shows the PCA dimensionality reduction of these two groups alone performed on their top features as determined by the f-value.

INSERT FIGURE 2 & TABLE 4 HERE
Of the models tested, the top 10% of the models were on average 75% accurate (SD 0.03, minimum 69%, maximum 79%).

Can natural language processing models provide more accurate diagnostic
predictions than traditional dementia screening methods?

Comparison to human judgement
Blind to diagnosis, co-author R.S.T., a neurologist specializing in the diagnosis of dementia, labeled the transcript of each participant's free speech response as belonging to one of the three groups.The resulting labels assigned to each participant were 49.45% accurate (Appendix B).We present two comparisons of the human classification to our machine learning models.The first used our best machine learning model (based on both the free speech task and the fluency task; labeled ML in Figure 3), whereas the second used a separate model based only on the free speech task (labeled ML (fs) in Figure 3) to more accurately compare to the resources available for R.S.T's labeling.
The human classifications were less accurate than our best model in predicting who was cognitively healthy (58.6% accurate versus 75.8% in the machine learning model) and aMCI (34.4% accurate versus 65.6% in the machine learning model).However, the human was more accurate in predicting AD than the machine learning model (56.7% accurate versus 43.3% in the machine learning model).

The Modified Telephone Interview for Cognitive Status
We used the Knopman et al. [18] thresholds to differentiate our three groups.The point cutoffs for the AD group is between 0 and 27 (inclusive), the aMCI group is between 27 and 31 (inclusive), and then the cognitively healthy group is any score above 31.
The confusion matrix for this cutoff on our participants' scores is contained in Appendix B. This shows a high skew towards cognitively healthy predictions.
As our sample was highly educated, we sought to control for this effect on predictions with an education scaling factor.Following the guidelines set forth by Knopman et al. [18], we adjusted for participant education level by not adding any points to the raw TICS-M score for participants with between 11 and 15 years of education and subtracting 2 points for subjects with 16 or more years of education.Surprisingly, accuracy declined for the AD group when scaling for education as three previously correctly predicted AD participants were mislabeled into the aMCI group.The cognitively healthy and aMCI groups were unaffected.
Our best machine learning model for classifying cognitively healthy, aMCI, and AD was more accurate than the TICS-M test.The TICS-M test was 45% accurate on our participants, even when scaled for education effects.The TICS-M test overwhelmingly classified the participants as cognitively healthy, whereas there was a more even spread of predictions in the machine learning approach.speech of variable quality, recorded remotely from telephone conversations conducted from participants' homes.
We do note however, that the participants in our study were recruited from a database of research volunteers who were relatively homogeneous in terms of important demographics such as race and education (being predominantly White and highly educated).This is a widespread characteristic of clinical research [38], which notably limits the generalizability of findings.Furthermore, machine learning models tend to learn from and propagate societal biases between demographic groups [39] and off-theshelf NLP models themselves are known to inflate disparities [40].This is a critical issue that the entire field is facing [41], thus we are careful not to make generalization claims and acknowledge that further widespread work must be done to decrease this inequity.
Another limitation of this study is the small sample size.In many computational studies in the clinical domain, only rarely is there a dataset large enough to have a separate evaluation set distinct from the training and validation sets used to train the models [42].
In our case, we chose to implement leave-one-out cross-validation as these results will be the best estimate of how the model would behave when fully trained and applied to new data.Although this sort of cross validation is a common approach with small datasets (e.g., [6] [43]), we do note that it has issues with overfitting and variable test results.
Our machine learning predictions, based on a small amount of speech data recorded in suboptimal conditions, showed strong AUCs for classifying groups, and outperformed the judgements of an expert clinician and two traditional screening tests.Our intent with these analyses was not to pit "human versus machine," but rather to demonstrate that a machine learning approach can detect subtle diagnostic differences from just small samples of speech, and as such, could be a potential adjunct screener in a clinician's battery to reach those individuals who might not otherwise be seen.As with traditional screening tools, a concerning result would signal the need for a comprehensive dementia workup.
Despite the proof of concept for this approach, much remains to be done.The next steps are to measure the predictive ability of our models to identify at an early stage who will go on to decline over time.Of particular interest in these analyses would be the cognitively healthy participants identified in our current models as belonging to the aMCI group, and the aMCI individuals identified as belonging in the AD group.If these participants were indeed to subsequently display cognitive decline and convert to aMCI and AD respectively, then the predictive ability of our models would be confirmed.
Further, it remains unclear how well these models will perform with different speech and written language tasks, and this has important implications for future protocol development and to answer the question of whether a single ideal task can elicit the most accurate predictions.Finally, to address possible etiological heterogeneity in participant groupings, future studies would be strengthened by comparing clinical diagnostic categories against validated biomarkers of disease.
is that speech can be recorded countless times without the confounding influence of practice effects or interrater variability.Hence, in future work, intra-individual variability should be measured via repeated testing at various time intervals to tease apart the effects of comorbidities, medications and the like on cognition.By demonstrating that state of the art automated methods can successfully be applied to suboptimal speech data, we address both the issue of early identification of cognitive decline, and accessibility of health care.Consequently, we are one step closer to the development of a remote, low cost, sensitive and highly accessible tool for cognitive screening on a large scale.Top left panel: Using the top 3 features, applied to all three groups.The three groups are ordered as expected, with the peaks of cognitively healthy, aMCI, and AD ordered left to right with some overlap between each.
two groups showed much overlap and tended to be difficult to differentiate from one another based on even their most discriminable features.
Bottom left panel: Using the top 6 features, applied to cognitively healthy vs AD.The two groups show some overlap, especially right at the peak of the AD group, but generally have distinct distributions with more discriminability.
Bottom right panel: Using the top 6 features, applied to aMCI vs AD.Again, the two groups show much overlap and tend to be difficult to differentiate from one another based on even their most discriminable features.

Figure 3 :
Figure 3: Accuracy of each model type (ML: the machine learning model based on the

Table 1 :
Demographic & descriptive characteristics of the (N=91).† the equivalent of a high school education is 12 years.‡ possible range of scores = 0-50 § possible range of scores = 0-30

Table 2 :
Summary of significant features extracted by NLP across two speech tasks.Fvalues are reported in brackets (all p-values <.05).

Table 3 :
Confusion matrices.First panel: the cognitively healthy, aMCI, AD classifier; Second panel: the cognitively healthy vs aMCI/AD classifier; Third panel: cognitively healthy vs. aMCI in the most clinically-relevant model; Fourth panel: cognitively healthy vs AD in the most clinically-relevant AND accurate model; Fifth panel: aMCI vs AD in the most clinically relevant model.