science_health11095 wordsRead on Arc Codex

Remote monitoring of heart failure exacerbations using a smartwatch

Abstract Heart failure (HF) involves cycles of remission and exacerbation, which are poorly characterized by static disease measures. Consumer wearables have an understudied potential for daily monitoring of HF symptoms. Here we report results from an observational cohort of free-living patients over a median of 94.5 d with HF in the Ted Rogers Understanding Exacerbations of HF (TRUE-HF) study. The study measured the ability of Apple Watch data to predict peak oxygen uptake (pVO2) as measured using in-clinic cardiopulmonary exercise testing (CPET). A deep learning model was trained with data from 154 patients (46 women, 108 men) and validated on a held-out set of 63 patients (24 women, 39 men) for determining wearable-derived daily pVO2, which correlated strongly with CPET-measured pVO2 (Pearson’s correlation = 0.85). Each 10% drop in wearable-derived daily pVO2 was associated with a 3.62-fold increased hazard ratio (HR) for unplanned healthcare events (95% confidence interval (CI), 1.37–9.55; P < 0.01), which occurred at a median of 7.4 d after the first 10% drop in wearable-derived pVO2. These findings were externally validated in an independent external cohort from the All of Us Research Program using a crossplatform model that accounted for the reduced-sensor capacities available in this external cohort. Using this reduced-sensor variant of the model, drops in wearable-derived daily pVO2 were associated with unplanned healthcare utilization (HR 1.32, 95% CI 1.03–1.69; P = 0.03), which occurred at a median of 21 d after the first 10% drop in wearable-derived pVO2. These results indicate that wearable-derived daily pVO2 provides earlier and improved risk discrimination compared with existing wearable fitness estimates and established clinical markers and offers a scalable and generalizable approach for longitudinal HF research and monitoring. Similar content being viewed by others Main Heart failure (HF) is a global health crisis affecting an estimated 64.3 million individuals worldwide1. The annual global cost of HF is estimated at US$346 billion, with a substantial portion attributed to hospitalizations, missed work and healthcare service utilization2,3. HF is also associated with reduced life expectancy, with a median survival of 3.2 years for women and 1.7 years for men4. Despite recent medical advances, patients with HF continue to experience a high risk of adverse outcomes, indicating a critical need for better prognostic markers to enhance risk stratification and enable more effective and timely interventions1. HF is characterized by periods of relative stability interspersed with acute exacerbations requiring timely intervention to prevent hospitalization5. In contrast, risk evaluation in HF traditionally relies on intermittent and static clinical evaluation tools, making accurate and timely risk stratification a challenge6. Many cardiac centers use cardiopulmonary exercise testing (CPET) to estimate 1-year prognosis, enabling stratification of patients based on established exercise parameters7,8. However, although serial CPET trends are clinically informative, its high cost, limited geographic accessibility and patient burden make daily or weekly monitoring impractical9. The clinically supervised 6-min walk test (6MWT) offers an alternative measurement of exercise intolerance in patients with HF10. However, evidence supporting its prognostic value compared to CPET is limited. In addition, the accuracy of unsupervised 6MWT compared to supervised is uncertain, restricting its widespread use in free-living remote monitoring10. The New York Heart Association (NYHA) functional classification is commonly used in practice, but is subjective and fails to capture a patient’s fluctuating clinical status11. Thus, although these traditional tools for HF risk stratification provide essential insights into patient health, they cannot also account for the episodic course of HF. This highlights the need for better free-living prognostic markers to improve risk assessment and enable more effective and timely intervention. Remote patient monitoring modalities shift from static and intermittent measurements of patient status to daily measurements and have shown up to a 15% reduction in total hospitalizations12. The effectiveness of remote patient monitoring has been demonstrated in various studies, which report improved patient outcomes and reduced healthcare utilization when these systems are proactive and correctly implemented. Nevertheless, the demands of manual data entry for patients and the constant oversight by clinicians hinder adherence and broader implementation of these programs13. Advances in wearable technology and artificial intelligence offer promising avenues for addressing these challenges. Consumer-wearable devices, like Apple Watch, can provide near-continuous capture of a wide range of physiological data, including heart rate and physical activity, offering a promising avenue for passive, noninvasive, real-time health monitoring14,15,16,17. Wearables have been investigated to monitor heart rate and exercise for outpatient patients with HF and to detect atrial fibrillation18. Concerning cardiopulmonary fitness, the Apple cardio fitness estimate on Apple Watch can accurately estimate a critical prognostic metric from CPET: the peak oxygen consumption rate (pVO2) in a healthy population19. However, it remains unclear whether wearable-derived algorithms validated in nondiseased populations can be reliably translated to patients with HF, whose distinct pathophysiological profiles may require tailored assessment approaches. To investigate this knowledge gap, we initiated the Ted Rogers Understanding Exacerbations of Heart Failure study (TRUE-HF), a prospective 3-month observational cohort study of patients with HF using an Apple Watch in free living20. We hypothesized that: (1) consumer-wearable data can be used to estimate cardiopulmonary fitness of patients with HF in free-living conditions; and (2) daily fitness monitoring markers would predict worsening patient health as indicated by unplanned healthcare utilization. Results Wearable TRUE-HF cohort The primary goals of this observational study were to investigate the relationships among widely available consumer-wearable data, cardiopulmonary fitness and worsening HF in a free-living environment. Between December 2019 and April 2024, 217 patients with HF were enrolled in the TRUE-HF study (NCT05008692)21. Patients were provided with Apple Watch Series 6 and/or compatible iPhones for the study duration or allowed to use their existing compatible devices. All participants were instructed to capture pedestrian activities using their Apple Watch if they were going to partake in an activity; however, they were not instructed to perform additional pedestrian activities. Each patient completed an in-person study-entry clinic visit and 3-month end-of-study clinic visit, where they underwent formal CPET testing, bloodwork, clinical examination, a supervised 6MWT and a Tecumseh cube test. All demographic and clinical measurements recorded at the study-entry and 3-month end-of-study clinic visits were captured. Participants had a median observation period from study entry to end of study of 94.5 d. Apple Watch data were collected via HealthKit during clinic visits and free-living observations. During this window, daily self-administered surveys about patient health status and unplanned healthcare utilization were conducted (Fig. 1). We defined unplanned healthcare utilization as hospital admissions, unscheduled clinic visits or intravenous furosemide treatment. These unscheduled events were manually verified using electronic health records (EHRs) and physician notes, where available. Of the 217 enrolled patients, 191 completed the study, of whom 187 completed an end-of-study CPET. During the study period, 25 patients experienced decreased CPET pVO2 from study entry to end of study and 32 had unplanned healthcare utilization. Among the 32 total events in our cohort, 19 were confirmed via clinical records to be directly HF related through agreement of 2 board-certified cardiologists, 4 were confirmed as noncardiac related and 9 were self-reported hospitalizations. The nine self-reported hospitalizations were patient-reported outcomes that occurred in patients without centralized linking of healthcare records and thus could not be verified against EHRs or physician notes; 30 patients were excluded from the primary analysis because they died (n = 2), received a left ventricular assist device (n = 2), withdrew (n = 7), could not complete the end-of-study assessment process (n = 4) or had insufficient device usage (n = 15) (Fig. 2). Insufficient device usage was defined as wearing their Apple Watch <10 d throughout the 90-d free-living period (with a median of 4 d of usage for the 15 participants). Patient demographics are summarized in Table 1. We pre-allocated the last 50 successful end-of-study assessments for a held-out test to be conducted for evaluation only at study completion, reducing the risk for test-set overfitting bias from repeated model refinement22. Successful end-of-study assessments were defined as those completing an end-of-study CPET. This resulted in the first 154 patients being utilized for model development and training using cross validation (n = 137 successful assessments) and the last 63 patients for a held-out test (n = 50 successful assessments). All model performance results reported herein were calculated from that held-out test set. In our held-out test cohort (n = 63), 5 patients were excluded from evaluation due to insufficient device usage. Of the remaining 58 held-out patients, 50 completed the study and an end-of-study CPET. Eight experienced unplanned healthcare utilization events: five unexpected hospital admissions, two unscheduled clinic visits and one instance requiring intravenous furosemide treatment. Among these, four events were adjudicated as HF related, three were self-reported and one was unrelated to HF. Demographic details are highlighted in Extended Data Table 1. TRUE-HF wearable-based prediction model We developed an autoregressive transformer framework, termed the TRUE-HF model (Extended Data Fig. 1), which combines wearable data with patient-specific clinical information to forecast daily predictions. Our approach processes 30 d of wearable data, initially aggregated at 90-min intervals and progressively pooled into larger time windows, allowing the model to learn relationships at multiple resolutions, as detailed in Methods. Throughout our study, we considered predicted daily TRUE-HF pVO2 as a naturally continuous measure. We prespecified a clinically meaningful drop in CPET pVO2 as a ≥10% reduction, based on prior literature observing a ≥6% to 10% decrease in CPET pVO2 being associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23,24,25,26. Therefore, hereafter, a ≥10% drop in TRUE-HF pVO2 as predicted by our model is considered test positive. External validation in the All of Us cohort We conducted external validation on an independent patient cohort from the National Institutes of Health (NIH) All of Us Research Program which uses Fitbit wearables27,28. We identified 193 patients who closely matched our TRUE-HF cohort inclusion criteria, including confirmed diagnosis of HF documented in EHRs, sufficient wearable data coverage (defined as wearable use on at least 40% of study days, nonconsecutively) and detailed EHR records available for event adjudication (Extended Data Fig. 2). However, regarding HF severity, the NYHA class was missing in the EHRs for 1,653 of 1,664 (99%) all possible All of Us patients with HF and wearable data and BNP were missing in 917 of 1,664 (55%). Therefore, to more closely reflect the clinical severity observed in the TRUE-HF population—where 77% were NYHA class ≥II—we added the inclusion criteria of at least one previously recorded unplanned healthcare event before enrollment. To establish a consistent starting point across patients in the All of Us cohort, we defined the study-entry time point as the first point at which both the index event had occurred and wearable data were available for each patient. Furthermore, due to data availability in the All of Us Research Program, we applied a more stringent definition of unplanned healthcare utilization, restricting events to inpatient hospital admission or intravenous furosemide administration. Outpatient visits were excluded because we could not reliably determine whether they were scheduled or unscheduled and self-reported outcomes were not available in this dataset. Patient demographics are summarized in Extended Data Table 2. The All of Us external validation cohort had a median observation period of 120 d, during which 20 patients experienced at least one unplanned healthcare utilization event. Note that, to provide a calibration and adjustment period for patients initiating wearable use and to account for a refractory interval after the initial clinical encounter, we introduced a 7-d buffer after the study-entry date, which was not used in TRUE-HF. To ensure timely model predictions aligned with TRUE-HF, which require only 30 d of wearable data before pVO2 predictions, while accommodating the necessary 7-d buffer period, we allowed predictions to begin after collecting 20 d of wearable data. To enable compatibility with data available in the All of Us Research Program, we trained and tested a reduced-sensor variant of the TRUE-HF pVO2 model, herein referred to as TRUE-HF-RS, which relies only on HeartRate, StepCount and a reduced clinical feature set (Methods). For fair comparison, this model was trained exclusively on the same training set as the TRUE-HF model (n = 154). Analysis of model prediction of absolute CPET pVO2 in the TRUE-HF cohort We trained the TRUE-HF framework to predict absolute CPET pVO2 (l min−1) measurements and performed regression analysis of this TRUE-HF pVO2 model against the end-of-study absolute CPET pVO2 measurement as a validated performance indicator in the held-out test set (n = 50 successful end-of-study assessments). Absolute pVO2 (l min−1) was chosen because it removes the need for linear indexing to body weight and alleviates the confounding effects of body mass on the model’s predicted outcomes. We compared TRUE-HF pVO2 performance as a baseline to the Apple Watch model: Apple cardio fitness measure VO2max algorithm. The TRUE-HF model delivers daily predictions by evaluating the preceding 30-d data and provides near-continuous remote assessment of cardiopulmonary fitness fluctuations, averaging one prediction per day over the observation period. Conversely, Apple’s algorithm produced fewer predictions given design requirements for minimal required physical activity levels, averaging 4.6 predictions for patients with CPET pVO2 < 1 l min−1, 6.3 for CPET pVO2 between 1 l min−1 and 2 l min−1 and 18.0 for patients with CPET pVO2 > 3 l min−1 over the observation period. Model performance was evaluated using Spearman’s coefficient (ρ), Pearson’s r and mean absolute error (m.a.e.) on the held-out test set, comparing observed and predicted values for the end-of-study CPET test. Strong concordance was observed between the TRUE-HF pVO2 predictions the day before the patients’ end-of-study date and the gold-standard end-of-study CPET pVO2 measurements (Fig. 3a). TRUE-HF achieved a Pearson’s r of 0.85 and an m.a.e. of 0.25 l min−1 in the held-out test set. Spearman’s coefficient measures (Fig. 3b) also suggest the enhanced capability of the TRUE-HF model to rank patients with HF accurately based on their pVO2. Note, however, that a model might have a high correlation with a static clinical measurement but poor sensitivity to changes in the clinical measurement over time if the patients with higher prediction errors undergo the largest changes. Thus, we measured each model’s ability to detect clinically meaningful declines at the end-of-study clinic visit, measured against study-entry clinical measurements. Positive patients were defined as those with a prespecified ≥10% decline in CPET pVO2 from study-entry to end-of-study clinic visit. For each model and patient, we calculated a percentage difference between the first and last wearable-derived pVO2 on study for comparison. The area under the receiver operating characteristic (AUROC) metric was used to evaluate the diagnostic accuracy29,30. The TRUE-HF model pVO2 predictions showed significant accuracy for detecting end-of-study declines in CPET pVO2, achieving an AUROC of 0.82 (95% CI 0.69–0.92) compared to Apple’s algorithm with an AUROC of 0.52 (95% CI 0.39–0.70), based on Delong’s two-tailed test (P < 0.01) (Fig. 3c). TRUE-HF daily pVO2 monitoring in unplanned healthcare utilization in TRUE-HF We performed a secondary exploratory analysis to investigate an unexamined potential relationship between wearable-derived daily pVO2 and near-term unplanned healthcare utilization (defined above) in both the TRUE-HF held-out and All of Us cohorts. The TRUE-HF pVO2 model requires 30 d of data and, hence, our retrospective risk-stratification analysis for each patient in the held-out TRUE-HF test set commenced on day 31. In the TRUE-HF held-out cohort, all held-out test patients with sufficient data, including those unable to complete an end-of-study clinic visit, were included, resulting in 58 patients being analyzed. All patients were censored at the time of unplanned healthcare utilization or end-of-study visit. We calculated the AUROC using the maximum percentage drop in the TRUE-HF model, wearable-derived, daily pVO2 from first model prediction to the prediction on that day for each patient before censoring, and compared them to canonical clinical static measures at the time of the study-entry clinic visit (CPET pVO2, 6MWT distance (6MWTD), NYHA class and NT-proBNP) and clinical HF models: Seattle Heart Failure Model (SHFM), Meta-Analysis Global Group in Chronic (MAGGIC) and PRognostic Evaluation During Invasive CaTheterization for Heart Failure (PREDICT-HF)31,32,33. Patients with drops in wearable-derived daily pVO2 were at significantly higher risk of unscheduled healthcare utilization on study (Fig. 4). The predictive power of daily pVO2 predictions was measured by AUROC at 0.77 (95% CI 0.62–0.90) for predicting these events. TRUE-HF model’s ability to predict unplanned healthcare utilization was statistically higher than baseline clinical metrics: CPET pVO2 (P = 0.04, adjusted P (Padj) = 0.05), NT-proBNP (P < 0.01, Padj = 0.01), 6MWTD (P = 0.05, Padj = 0.05), NYHA (P = 0.01, Padj = 0.02) based on Delong’s one-tailed test and Padj values from the Benjamini–Hochberg correction for the false discovery rate (0.05). Our risk-stratification analysis, focused on our prespecified ≥10% drop in TRUE-HF daily pVO2 (see above), is represented as a specific point on the ROC curve (indicating the sensitivity and specificity at this threshold) in Fig. 4a. At this ≥10% drop, TRUE-HF sensitivity was 88% and specificity was 62% across the complete observation period. Time-varying, extended Simon and Makuch’s Kaplan–Meier cumulative risk curves were drawn for an observed ≥10% drop in their TRUE-HF predicted daily pVO2 (Fig. 4c). During free living, a TRUE-HF wearable-derived daily pVO2 drop was found in 26 of 58 patients. Of the 26 patients, 7 (26.9%) in the drop group experienced unplanned healthcare utilization, compared with 1 (3.1%) of the 32 participants in the no-drop detected group. Drops detected by the TRUE-HF model preceded unplanned healthcare utilization by a median of 7.4 d (interquartile range (IQR), 4.5–8.5 d). Longitudinal qualitative analysis of forward-filled Locally Weighted Scatterplot Smoothing (LOWESS)-smoothed, TRUE-HF-predicted trajectories demonstrated that patients experiencing an unplanned healthcare event exhibited a progressive decline in daily TRUE-HF pVO2 compared to event-free patients, who maintained stable or slightly increasing pVO2 values (Fig. 4c). Extended Cox’s proportional hazards models and two-sided Wald’s test quantified the unadjusted HR for the association between continuous-change TRUE-HF daily pVO2 covariate and subsequent unplanned healthcare utilization, demonstrating that each 10% drop in pVO2 was associated with a significantly elevated risk (HR 3.62, 95% CI 1.37–9.55, P < 0.01; Table 2). Sensitivity analyses adjusting for independent covariates: age, sex, ethnicity, body mass index, smoking or left ventricular ejection fraction confirmed that the association remained significant (Extended Data Table 3). Further analysis of common clinical risk measurements and traditional clinical HF models (measured at study-entry clinic visit) did not achieve a significant HR (Table 2). Using the reduced-sensor model, TRUE-HF-RS, on the internal held-out (n = 63), we observed modest discrimination for unplanned healthcare utilization, with an AUROC of 0.57 (95% CI 0.35–0.79) and a positive but nonsignificant association in time-to-event analysis (HR = 1.79, 95% CI 0.77–4.16, P = 0.18; Extended Data Table 4). All of Us external validation of unplanned healthcare utilization External validation in the All of Us cohort using the TRUE-HF-RS model (described above) reveals positive trends in risk characterization of unplanned healthcare utilization. Consistent with the trends observed in the TRUE-HF test set, each 10% drop in predicted daily pVO2 was associated with a higher risk of unplanned healthcare use in All of Us (HR 1.32, 95% CI 1.03–1.69, P = 0.03; Table 2). The TRUE-HF-RS model predicted unplanned healthcare utilization a median of 21 d (IQR 9.25–67.25) before events. Repeating sensitivity analyses for All of Us, we found that our HR remained significant against each covariate (Extended Data Table 5). Across the observational period in patients with HF in the All of Us cohort, the TRUE-HF-RS model at our prespecified threshold of ≥10% drop in daily pVO2 showed a sensitivity of 50% and a specificity of 59% for unplanned healthcare use (Fig. 4d). Qualitatively, we also observed similar trends in patients with and without events, as shown in Fig. 4e, where predicted pVO2 decreases over time in patients with events compared to those without events. Aligning with these observed trends in the All of Us Research Program, the time-dependent landmark AUROC evaluated over month-long windows was 0.51 (95% CI 0.33–0.68) at day 10 versus 0.72 (95% CI 0.39–0.92) by day 80 (Extended Data Fig. 3). Extension of method to 6MWT distance in the TRUE-HF cohort We extended our framework and methods to 6MWTD as an additional variable that could be predicted using wearable devices and further examined its prediction of unplanned healthcare utilization. A separate TRUE-HF model was trained for 6MWTD and compared to Apple’s algorithm, the sixMinuteWalkTestDistance. As with the CPET pVO2 assessment, a drop in 6MWTD (10% reduction in 6MWTD from baseline to end-of-study follow-up) and healthcare utilization was evaluated. Extended Data Fig. 4 highlights the 6MWTD prediction using biometrics from Apple Watch. The sixMinuteWalkTestDistance in HealthKit performed comparably to the 6MWTD study model across all correlation metrics and effectively detected drops in 6MWTD performance (Extended Data Fig. 5). However, neither achieved statistical significance for unplanned healthcare utilization. Feature analysis of the TRUE-HF model To evaluate the contribution of structured exercise sessions, we masked wearable data collected during the monthly unsupervised 6MWT and Tecumseh cube tests. Exclusion of these structured sessions resulted in negligible differences in predictive performance (m.a.e. change of 0.00575 in predicted pVO2; Supplementary Fig. 2). Specifically, regression performance and classification accuracy for detecting 10% drops in predicted pVO2, as well as predicting unplanned healthcare utilization, remained unchanged, with AUROCs of 0.82 (95% CI 0.69–0.92) and 0.77 (95% CI 0.62–0.90), respectively. We also examined the variation in TRUE-HF prediction performance with shorter input window durations (20 d and 10 d) during both training and evaluation, where we observed reduced model performance (unplanned healthcare utilization at 20-d windows: AUROC of 0.61 (95% CI 0.40–0.80); at 10-d windows: AUROC of 0.61 (95% CI 0.39–0.82)); compared to the complete 30-d input window (AUROC of 0.77 (95% CI 0.62–0.90)) (Supplementary Tables 1 and 2), indicating that longer trends may provide more reliable information. Saliency-based analyses revealed that wearable sensor data, especially AppleExerciseTime, StepCount and OxygenSaturation, contributed to model predictions, with higher activity levels generally associated with a higher predicted pVO2 (Supplementary Fig. 3). Clinical variables (age, sex, ethnicity and furosemide dosage) also influenced predictions but exhibited context-dependent effects (Supplementary Fig. 4), with no consistent directional trends observed independently. Importantly, as wearable technologies have shown reduced performance on darker skin, subgroup analysis between white and non-white participants in both cohorts showed comparable model performance across both the TRUE-HF and the All of Us validation sets34 (Supplementary Table 4). Finally, ablation analyses assessing the independent and combined contributions of wearable and clinical inputs (Supplementary Table 3) demonstrated that wearable data alone provided substantial predictive capacity (unplanned healthcare utilization AUROC of 0.60 (95% CI 0.42–0.78)). In contrast, predictions based solely on clinical variables were weaker (unplanned healthcare utilization AUROC of 0.52 (95% CI 0.32–0.71)). However, integration of both wearable and clinical data (that is, TRUE-HF model) yielded the strongest performance across all outcomes. Discussion The results of this study provide strong evidence to support the potential of consumer-wearable data in estimating cardiorespiratory fitness and building remote monitoring methods for HF. Consumer-wearable data in a free-living environment, analyzed using a transformer model, demonstrated strong concordance with gold-standard CPET measurements. Wearable-derived daily cardiopulmonary fitness assessments enabled new insights: clinically meaningful drops in daily pVO2 forecast imminent unplanned healthcare utilization events in outpatient patients with HF, with external validation of a reduced-sensor model that maintains a positive risk relationship. Finally, ancillary analysis found that wearable data showed strong concordance with 6MWTD. The findings herein are an essential addition to risk evaluation in HF for several reasons. First, our findings demonstrated the promise of wearables for accurate serial pVO2 assessment in an outpatient HF population, expanding the ability to assess pVO2 from the clinic to a daily, free-living environment. The TRUE-HF model achieved high concordance with gold-standard CPET measurements of pVO2, with an m.a.e. within the previously reported CPET test–retest CI for patients with HF35. This is further exemplified by the TRUE-HF model’s high accuracy in predicting decreases in CPET pVO2 (10% drops) between study-entry and end-of-study clinic visits. The high accuracy of wearable-derived pVO2 using TRUE-HF enables a potential shift from static, infrequent clinical CPET to frequent, dynamic, daily pVO2. Second, shifting from infrequent clinical CPET to daily pVO2 monitoring allows detection of previously unrecognized clinically meaningful declines in cardiopulmonary fitness (10% daily drops in pVO2) that preceded unplanned healthcare utilization. The TRUE-HF model enhances early detection of deterioration, acting as an early warning signal to identify patients at risk before clinical symptoms appear. At a 10% drop, the TRUE-HF model demonstrated a sensitivity of 88% and specificity of 62% in predicting unplanned healthcare utilization. The sensitivity underscores the TRUE-HF model’s effectiveness in accurately identifying low-risk patients, thereby ruling out individuals unlikely to require immediate intervention and supporting early low-risk intervention (a clinical touchpoint) for higher-risk individuals. Prioritizing sensitivity is crucial, because the early identification of at-risk patients enables proactive measures, such as expedited clinical visits, which can reduce the risk of unplanned healthcare utilization36. Continuous wearable-derived changes in daily pVO2 were strongly associated with unplanned healthcare utilization, with a significant HR per 10% drop. Static measures inherently capture a single snapshot of physiological status, whereas our model captures evolving physiological changes closer to the time of clinical events, potentially enhancing predictive performance. The near-continuous monitoring of TRUE-HF pVO2 represents a more dynamic and sensitive risk-stratification method or a potential dynamically monitored endpoint in future trials. Thus, although practical, conventional static metrics, such as NT-proBNP levels, CPET pVO2, 6MWTD or NYHA functional class, failed to provide the same proactive early warning signal as our model, which leverages a 30-d window of wearable data to predict unplanned healthcare utilization a median of 7.4 d before it occurred36. External validation analysis in the All of Us Research Program reaffirmed that a drop in daily wearable-derived pVO2 remains a meaningful predictor of short-term unplanned healthcare utilization. This association held despite reduced-sensor and demographic differences. Specifically, the All of Us cohort provided fewer Fitbit per min measurements, limited to HeartRate and StepCount, compared to the more comprehensive Apple Watch metrics available in the original cohort. There was no AppleExerciseTime or OxygenSaturation, two measurements highlighted as necessary in our feature analysis. Although predictive performance decreased modestly with fewer wearable sensors, the consistent directionality, the significance of HR and the predictive lead time support the practical feasibility of wearable-derived daily pVO2 in real-world heterogeneous settings. Furthermore, our findings indicate that reliance on specific exercise thresholds to trigger health predictions poses a barrier to tracking health in sicker patients, whereas use of longer-term trends yields more consistent measurements. Patients with HF often suffer from exercise intolerance due to impaired pulmonary vasodilatation, reduced cardiac reserves, skeletal muscle dysfunction and other comorbidities that may limit their ability to trigger intensity-based exercise thresholds. For instance, at the time of the study, Apple’s pVO2 algorithm required a patient to reach at least 60% of their maximum heart rate and certain activity parameters before observations were reported. Thus, we noted far fewer predictions in sicker patients with HF. In contrast, focus on trends over a longer time window, as with the TRUE-HF pVO2 model, yielded consistent daily predictions for all patients in our cohort, including those with more advanced HF. The 6MWTD predictions were also assessed, allowing for a separate comparison of wearable-derived remote measurements. Although our findings indicate that wearables can reliably track daily physical activity and predict the 6MWTD measured in the clinic, the direct correlation between changes in 6MWTD and risk of unplanned healthcare utilization was less pronounced. It should be noted that the HR for remote 6MWTD measurements exceeds that for static 6MWTD measurements. However, this difference did not reach statistical significance. Differences in downstream unplanned healthcare utilization may stem from inherent distinctions between the 6MWT and CPET as clinical endpoints. The 6MWT reflects different physiological constructs to CPET. The CPET pVO2 captures maximal volitional capacity (validated by achieving a respiratory exchange ratio of ≤1.1 during CPET). In contrast, 6MWT may be more influenced by submaximal efforts. Thus, although 6MWTD indicates overall health status and functional ability, it may not independently predict acute clinical events in the HF population. Despite the promise of remote monitoring with wearables, challenges remain. Remote monitoring with wearables can generate over 5 gigabytes of data per patient per week. Analysis of these data to identify biomarker signatures is best suited for modern artificial intelligence techniques, particularly deep learning (DL). We used a transformer-based backbone to interpret wearable data sequences and assessed functional changes in cardiorespiratory fitness among patients with HF. Our new TRUE-HF DL model also employs pooling-based methods, allowing it to assess 30 d of wearable data and aggregate them into increasingly larger periods for analysis while retaining temporal relationships37. TRUE-HF looks at how each hour fits into the rest of that day, how that day fits into the week and how that week fits into 30 d. In contrast, other artificial intelligence methods treat each moment independently; thus, they struggle to understand how a Tuesday workout might affect activities on Thursday38. Here we used the TRUE-HF transformer model to understand wearable data sequences and their ability to measure functional changes in cardiorespiratory fitness in patients with HF, which has not been previously explored. The study has some limitations that merit consideration. First, the necessity of dividing our TRUE-HF cohort into a development and final held-out test set reduced the sample size available for testing. Although the TRUE-HF model’s predictive power for unplanned healthcare utilization is highly encouraging, and our patient enrollment was highly diverse, the low incidence of events among our patient cohort limited our ability to perform subanalyses of our results for specific subgroups of patients with HF. For example, in patients with HF with preserved ejection fraction, only six were present in the test set. Accordingly, the present study was not powered to support phenotype-specific inference and subgroup analyses were considered exploratory. Consequently, our sample size constrained the feasibility of fully adjusted multivariate analyses without risking overfitting. Future studies with a larger validation set will be needed to fully assess the independent prognostic value of TRUE-HF pVO2 drop when comprehensively adjusted for potential confounders. Traditional prognostic markers, including NT-proBNP, baseline pVO2 and 6MWT distance, showed limited prognostic utility in our analyses. Although point estimates were generally consistent with established prognostic relationships, the corresponding HRs did not reach statistical significance, likely reflecting limited statistical power given the small number of clinical events. Similar patterns were observed for established static risk scores (SHFM, MAGGIC and PREDICT-HF), which are typically calibrated for 1-year to 2-year risk horizons and offer only modest discrimination in many settings. Finally, Apple’s pVO2 estimates may be biased upward under conditions common in this cohort. Heart rate-limiting medications can blunt exercise heart-rate responses and may lead to overestimation by Apple’s pVO2 algorithm. In addition, not all participants engaged in sufficient outdoor activity, which may further reduce Apple’s pVO2 accuracy19. The findings from the TRUE-HF observational cohort study provide compelling evidence for using consumer-wearable devices to remotely monitor the daily cardiopulmonary fitness of patients with HF in an outpatient, free-living environment. Specifically, drops in wearable-derived daily pVO2 are early indicators of unplanned HF-related healthcare utilization. The potential of the TRUE-HF model to revolutionize remote patient monitoring in cardiology is a cause for optimism. This observed association between detected pVO2 deteriorations and an increased risk of unplanned healthcare utilization underscores the need for further research. Methods Ethics statement The TRUE-HF study (NCT05008692) was conducted under approved protocols from the University Health Network (UHN) Research Ethics Board (Toronto, Canada; REB no. 20-5205). Written informed consent was obtained from all participants before enrollment. Details of the study protocol are available21 and a summary is provided herein. The statistical analysis plan is included in Supplementary information. Retrospective analyses of the All of Us Research Program used de-identified data accessed through its controlled-access platform. The program operates under a centralized institutional review board and all participants provided written informed consent for data collection and secondary research use. Analyses complied with program policies and applicable ethical and regulatory requirements. Study participants of the TRUE-HF observational study The TRUE-HF study (NCT05008692) enrolled outpatients with HF receiving care at the UHN (Fig. 1). Eligible participants were aged ≥18 years and could adequately comprehend English independently or with a caregiver’s help. Research coordinators provided study information at the time of informed consent and contacted patients for follow-up. During the initial enrollment visit (study entry), we provided patients guidance on setting up their Apple Watch. During this session, we also educated patients on how to use Apple Watch. The appropriate electronic case report form (eCRF) and electrocardiogram (ECG) applications were downloaded and available on iPhone. Apple Watch ECG has only been validated for patients aged >22 years; therefore, only participants older than 22 years were asked to download the Apple ECG app. Participants interacted with an eCRF iOS mobile application. The data gathered in the application were not used to treat the patient. All patients underwent CPET, comprehensive bloodwork, clinical examination and a supervised 6MWT during study-entry and end-of-study clinic visits. All demographic and clinical measurements recorded during the study-entry and end-of-study clinic visits were captured in Research Electronic Data Capture, ensuring data transcription for the study cohort. In collaboration with Apple developers, an eCRF iOS mobile application was developed using Swift to communicate study information, conduct daily surveys and gather HealthKit wearable data from patients securely and de-identified during the study duration. The wearable-derived data were not used to inform clinical decision-making. Daily surveys captured unplanned healthcare utilization events during the free-living observation period. Patients were instructed to complete daily surveys assessing the following symptoms: increasing shortness of breath, leg swelling, palpitations, chest pain, light-headedness and fainting. In addition, the daily surveys assessed whether a patient required the following in the last 24 h: changes to medication, intravenous furosemide, unscheduled health visit, emergency room visit and/or hospital admission. Every month, we also conducted monthly fitness tests, where patients were instructed to partake in a monthly unsupervised 6MWT and Tecumseh cube test. Instructional videos could be used asynchronously to support these tests. We excluded patients with nonadherence, defined by wearing their Apple Watch <1 d throughout the 90-d free-living period. In deriving prediction models, there is no hypothesis test to guide sample size calculation. Therefore, we aimed to calculate sample size requirements for various CI widths around clinically acceptable measures of model discrimination. The study’s sample size was determined using a classification-based approach to predict a >10% decline in pVO2, a threshold previously associated with worse outcomes in patients with HF23,24,25,26. With an assumed AUROC of 0.70, a sample size (n = 200 with completed CPETs) was estimated to provide a robust lower bound of the CI for the study objective, including model development (n = 150) and held-out test (n = 50). To prevent analytical bias, model evaluation on the held-out test set was prespecified and conducted only after the study was completed and all participants had exited follow-up. External validation cohort: All of Us The external validation cohort was constructed using data from the NIH All of Us Research Program v8 to validate the unplanned healthcare utilization experiments. Initially, 1,664 participants within the All of Us dataset had a documented diagnosis of HF and were available for analysis using wearable data from Fitbit devices. Among these, 400 individuals were excluded due to incomplete or missing EHR data necessary for event adjudication and accurate cohort characterization (Extended Data Fig. 2). To align the All of Us cohort’s clinical severity with the TRUE-HF population, we restricted the cohort to include only patients with documented prior unplanned healthcare utilization. We defined unplanned events as inpatient hospitalizations (excluding planned procedures) or intravenous furosemide administration40. For each participant, the study-entry date was set as the later of the following two dates: the discharge date of their qualifying unplanned healthcare event or the first day of available wearable sensor data collection. To ensure adequate data availability, we excluded participants with <30 d of wearable data after study entry. Participants were then filtered to ensure adequate wearable data coverage, defined as at least 40% daily measurement coverage during the observational period after study entry, resulting in the exclusion of additional participants. Finally, we defined the observation endpoint (‘end-of-study’ visit) as either the occurrence of a second unplanned healthcare utilization event within 120 d or the completion of a follow-up period that matched the TRUE-HF cohort median duration (approximately 94.5 d), with a maximum of 120 d. Participants who did not meet either endpoint criterion were excluded, resulting in a final external validation cohort comprising 193 individuals. Patient demographics are summarized in Extended Data Table 2. Wearable data and feature engineering The following data were collected, with informed consent from study participants, from Apple Watch through HealthKit during the approximate 90-d free-living period (that is, excluding days of baseline and end-of-study follow-up clinic visits): step count, exercise time, distance traveled, stand time, active energy burned, basal energy burned, heart rate, heart rate variability and O2 saturation. These variables were selected because they were the most frequently and consistently recorded during the interim analysis of the monitoring period and details of each are defined in Apple HealthKit41. A standardized summarization protocol addressed the varying temporal resolutions across data types. First, abnormal data record errors were removed using an outlier approach. Records with values >3 s.d. from the population mean for each data type were removed. Next, we constructed representation of the wearable data that could integrate large-scale data for downstream usage. Specifically, first, we normalized the disparate data streams by synthesizing 90-min aggregated metrics (mean, median, minimum, maximum and s.d.) of HealthKit variables (defined above); the sum was used instead of the mean to more accurately capture the overall exercise quality of these HealthKit data types during the 90-min time window. To maintain an estimate of variable trajectories during periods of sensor nonrecording, we employed intrapatient forward-filling imputation to address gaps in the 90-min summary data, thereby preserving the autoregressive integrity of the time series. We used intrapatient forward-filling imputation to maintain the autoregressive integrity of the time series and conservatively estimated variable trajectories during sensor nonrecording. TRUE-HF framework details With wearable data and incorporating patient-specific clinical information such as sex, race, age, prescription dosages, weight and height, our model predicts an individual patient’s cardiopulmonary fitness and changes in their fitness over time. All nine wearable-derived features presented in Fig. 1b were included as model inputs. Our new method leveraged a contextualized DL model to retain and analyze temporal trends across 30 d of patient-wearable data, providing near-continuous daily monitoring through next-day predictions. To achieve this, our model incorporated three distinct components: (1) it contextualized temporal representations of the data using a bottom-up approach, extending from 90-min intervals to full-day aggregation; (2) it integrated patient-specific clinical information directly into the wearable data features, allowing for adaptive feature calculations; and (3) it explicitly considered the temporal constraints, recognizing that daily activities are influenced by preceding days, and used this to make ongoing predictions for each day. The TRUE-HF model (Extended Data Fig. 1) processed 30 d of 90-min summaries of wearable data, starting with sequential 90-min summaries and assembling them into larger time windows, allowing the model to learn temporal relationships at different temporal resolutions while improving processing efficiency. This approach is embodied in the model architecture, a bottom-up variant of the transformer model, which optimizes the feature map and reduces the temporal resolution through pooling37,42,43. HealthKit data are first tokenized through one-dimensional convolutions to collapse features along the temporal domain44. The input is then processed through a TRUE-HF block consisting of a transformer layer followed by a pooling layer. The transformer layer learns relationships within the temporal resolution in each TRUE-HF block. Subsequently, the pooling layer aggregates pairs of consecutive time points (that is, 90–180 min). Using four consecutive TRUE-HF blocks, we effectively analyzed time resolutions of 90 min, 180 min, 360 min, 720 min and 1,440 min (daily resolution). The final prediction layer then aggregated information across the previous 30 d, inclusive, to predict the current day’s measurement. To enhance the model’s understanding of wearable data, TRUE-HF incorporates patient-specific clinical information (described above), enabling it to learn different operations based on input attributes rather than treating all inputs equally. We achieved this by augmenting TRUE-HF with a feature-wise linear modulation block that modulates activations in the neural networks based on clinical details45. The model used only demographic information from the baseline clinic visit to maintain temporal causality. Finally, by incorporating a causal self-attention mechanism, we explicitly constrained temporal learning to move forward only (autocorrelation) in the TRUE-HF framework42,46,47. This mechanism introduces autoregressive properties into temporal learning. It safeguards predictions for any given day being influenced only by data from that day and preceding days. We leveraged casual attention to enable additional semi-supervised training, as described below47. Model training All iterative model training was performed exclusively within the first 154 patients, whereas the final 63 patients were used solely for held-out testing. We excluded the days of the study-entry and end-of-study clinic visits from training. To mitigate the lack of daily clinical CPET and 6MWT outcome labels, we utilized semi-supervised learning targeting linear approximated values of each test across the study48,49,50. In our study, explicit daily labels were absent, and only baseline and end-of-study clinic visits provided clinical status for our target outcomes (CPET pVO2 or clinical 6MWT). To establish a reasonable approximation of our target outcome (pVO2 or clinical 6MWT) for each of these 30-d windows, we used linear interpolation of clinical outcomes recorded at the initial study-entry assessment and follow-up visits, yielding daily outcomes (Supplementary Fig. 2). The final TRUE-HF model was an ensemble of ten models, each trained using the same TRUE-HF framework but with different random seeds. The TRUE-HF model uses exclusively wearable data and clinical data to predict future states, never past states. The average prediction from these models was used to derive TRUE-HF predictions. Extending the TRUE-HF model to All of Us The All of Us dataset provided per-min wearable measurements exclusively for HeartRate and StepCount, whereas critical features from the original TRUE-HF model of ActiveEnergyBurned, BasalEnergyBurned, Distance, AppleStandPlusTime, AppleExerciseTime, OxygenSaturation and HeartRateVariabilitySDNN were unavailable. To address these differences, we employed a knowledge-distillation approach, specifically a teacher–student training strategy51,52. In this approach, predictions generated by a more comprehensive teacher model (TRUE-HF model) served as training targets (pseudo-labels) for a streamlined student model that accommodated the reduced feature set. Given the substantial feature gap between the original TRUE-HF model and the All of Us-compatible variant, we introduced a ‘teacher-assistant’ model to facilitate knowledge transfer and mitigate performance degradation53. This teacher-assistant model retained all original wearable features but used a reduced clinical feature set aligned with the All of Us cohort. For each training batch, pseudo-labels were generated from a randomly selected member of the original TRUE-HF ten-model ensemble. Subsequently, the ensemble of trained teacher-assistant models, which provided pseudo-labels to train the final All of Us-compatible TRUE-HF-RS model, relies exclusively on HeartRate, StepCount and the reduced clinical feature set. All TRUE-HF and TRUE-HF-RS models were trained exclusively on the training set of the TRUE-HF cohort (n = 154). The All of Us Research Program was used exclusively for external validation. Model validation and outcomes We compared our TRUE-HF pVO2 and TRUE-HF 6MWTD against clinically measured CPET pVO2 and 6MWTD, as well as to Apple VO2Max and sixMinuteWalkTestDistance, respectively. To predict the CPET value from the end-of-study clinic visit, we used wearable data collected over the 30 d preceding it. This ensured that all model inputs reflected free-living conditions, unaffected by structured CPET or tests conducted on the visit day. Apple VO2Max and Apple sixMinuteWalkTestDistance were collected through our iOS mobile application. Note, certain conditions or medications that limit heart rate may cause an overestimation of the Apple VO2Max algorithm—as communicated in Apple’s user interface19,54. To further assess our predictions’ accuracy in detecting pVO2 changes over time, we measured the model’s ability to detect declines at the end-of-study clinic visit, measured against the study-entry clinical measurements. For CPET pVO2, a end-of-study drop in CPET pVO2 was defined as a ≥10% reduction from study-entry visit to end-of-study clinic visit and was chosen because a ≥6% to 10% decrease in pVO2 is associated with an increased risk of medium-to-long-term hospitalization or death in patients with HF23,24,25,26. To classify patients as a ≥10% drop in pVO2, we calculated a percentage difference between the last model prediction (that is, TRUE-HF prediction the day before the clinic visit) to study-entry CPET pVO2. The same assessment method was used for the TRUE-HF 6MWTD model, where we tested correlation measures and drops in 6MWTD and classified a decline in 6MWTD (10% reduction in distance walked), respectively. Our secondary objective was to evaluate the association between TRUE-HF-predicted declines in daily pVO2 and unplanned healthcare utilization during the 90-d follow-up period. This association was compared to associations of traditional static risk factors measured at baseline and clinical models (MAGGIC, SHFM and PREDICT-HF). We defined unplanned healthcare utilization as hospitalization, unscheduled clinical visits or urgent intravenous furosemide treatment taking place between the study-entry and the end-of-study visits. The first prediction made by our TRUE-HF pVO2 model required 30 d of data. Hence, this objective was evaluated only among patients free from unplanned healthcare utilization in the first 31 d of our study. Explainability and feature analysis We examined the impact of removing structured monthly exercise sessions. All wearable data collected during these sessions were masked by excluding measurements within a 90-min window around the start and end times. A 30-d window was chosen before final analyses (described in the statistical analysis plan). We also evaluated how input window length affected accuracy by comparing shorter windows (10 d or 20 d) with the 30-d window using retrained models and zero-shot inference. Saliency analyses on the combined TRUE-HF model quantified feature importance, averaging saliency values daily for visualization. To assess feature contributions, we trained and compared (1) a fully connected neural network with clinical baseline variables and (2) a model using only wearable data. Statistical analysis The analyses and methods were prespecified before data evaluation and performed only on the held-out test set (n = 63), consisting of the last 50 patients who successfully completed an end-of-study clinic visit. We followed the STROBE and MI-CLAIM reporting guidelines55,56. The primary analysis compared our model’s observed and predicted CPET pVO2 values. We used Spearman’s coefficient to measure the rank-based association between observed and predicted values. Pearson’s r was employed to quantify linear correlation. Together, these two correlation measures capture complementary aspects of the model’s predictive fidelity. The m.a.e. was chosen for its interpretability in the same units as the outcome. AUROC was used to evaluate the diagnostic accuracy of correctly predicting a meaningful drop in outcome measurements. To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified data resampling techniques. Based on Delong’s test, a two-sided P < 0.05 was considered significant between models. The secondary analysis used time-varying, extended, Simon and Makuch’s Kaplan–Meier cumulative risk curves to assess the association between an observed drop of ≥10% in their TRUE-HF daily pVO2 and unplanned healthcare utilization57. In this analysis, the cohort was stratified into two groups: (1) patients with a ≥10% reduction in daily pVO2; and (2) patients with <10% reduction in daily pVO2. We constructed cumulative incidence curves for unplanned healthcare utilization and accounted for the time-varying nature of our independent variable. To assess the probability of event-free survival between these groups, we constructed cumulative risk curves to account for the continuous change in our daily pVO2 covariate, facilitating a longitudinal comparison of clinical outcomes related to unplanned healthcare utilization. An extended Cox’s proportional hazards model quantified the HR and the strength of association between time-varying occurrence of pVO2 drops (scaled in 10% drops) in TRUE-HF daily pVO2 and subsequent clinical events58. Sensitivity analyses involving minimal covariate adjustment were performed. Due to the limited sample size, these analyses used single covariate adjustments rather than comprehensive multivariate adjustments. Secondary analysis AUROC was computed as the maximum percentage drop in model prediction from the first model prediction to the prediction on the day for each patient, preceding any event or end-of-study clinic visit (censor). DeLong’s one-tailed test was performed to evaluate the a priori hypothesis that continuous monitoring methods would surpass static measures in discriminative performance. Multiple testing correction was applied using the Benjamini–Hochberg method (target false discovery rate = 0.05). Landmark-based, time-dependent AUROC analysis was performed during external validation to assess the model’s ability to predict outcomes based on the largest percentage drop identified up to each landmark time t59. This analysis was performed in 10-d intervals, with 30-d outcome windows repeated. Qualitative results for the trends observed with TRUE-HF predictions were created using LOWESS-smoothed trajectories, generated by forward-filling censored or missing data across the entire study window using the last observed value before censoring or event. Data preprocessing and model development were performed in Python using Pandas (v1.5.3), NumPy (v1.21.3) and PyTorch (v2.0.0). Correlation analyses, m.a.e. calculations and Simon and Makuch’s Kaplan–Meier estimations were conducted with SciPy (v1.7.1) and lifelines (v0.28.0). AUROC, DeLong’s test, the Benjamini–Hochberg correction and Cox’s proportional hazards models were implemented in R using pROC (v1.18.5) and survival (v3.6-4) packages and the timeROC package (v0.4). TRUE-HF model architecture definitions, wearable data preprocessing and inference workflows have been made publicly available at https://github.com/mcintoshML/TRUEHF. Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Data availability The data that support the findings of this study include a clinical wearable dataset from the TRUE-HF study and de-identified participant data from the All of Us Research Program. TRUE-HF data were collected under research ethics board-approved protocols with written informed consent that do not permit public release of individual-level data. Access to de-identified TRUE-HF data may be considered for qualified researchers, subject to approval by the UHN Research Ethics Board and execution of a data-usage agreement restricting use to noncommercial research purposes and prohibiting re-identification or onward sharing. Requests should be directed to the corresponding authors and will be reviewed on a case-by-case basis, with a response typically provided within 4–6 weeks of receipt of a complete request. All of Us data are available through the program’s controlled-access research platform to registered researchers who meet eligibility requirements and agree to the All of Us data use policies, which restrict re-identification and unauthorized data sharing. Access requests are reviewed by the All of Us Research Program in accordance with its governance framework, with review timelines determined by the program. Code availability Model code has been made available at https://github.com/mcintoshML/TRUEHF. Model weights will not be publicly released, because they were trained on sensitive clinical wearable data subject to data use and consent restrictions. A secure interface for running the TRUE-HF model weights may be made available to qualified researchers for external validation, subject to institutional approvals at the requesting institution and approval by the UHN Research Ethics Board, data-usage agreements and applicable privacy regulations. Requests should be directed to the corresponding authors and will be reviewed on a case-by-case basis and a response can be expected within 4–6 weeks. References Foroutan, F. et al. Global comparison of readmission rates for patients with heart failure. J. Am. Coll. Cardiol. 82, 430–444 (2023). Lippi, G. & Sanchis-Gomar, F. Global epidemiology and future trends of heart failure. AME Méd. J. 5, 15 (2020). Heidenreich, P. A. et al. 2022 AHA/ACC/HFSA Guideline for the Management of Heart Failure: Executive Summary: a report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 145, e876–e894 (2022). Heart disease. Heart and Stroke Foundation of Canada https://www.heartandstroke.ca/heart (2024). Arrigo, M. et al. Acute heart failure. Nat. Rev. Dis. Prim. 6, 16 (2020). Heidenreich, P. A. et al. 2022 AHA/ACC/HFSA Guideline for the Management of Heart Failure: a report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 145, e895–e1032 (2022). Arena, R., Myers, J. & Guazzi, M. The clinical and research applications of aerobic capacity and ventilatory efficiency in heart failure: an evidence-based review. Heart Fail. Rev. 13, 245–269 (2008). Guazzi, M. et al. 2016 Focused update: Clinical recommendations for cardiopulmonary exercise testing data assessment in specific patient populations. Circulation 133, e694–e711 (2016). Balady, G. J. et al. Clinician’s guide to cardiopulmonary exercise testing in adults: a scientific statement from the American Heart Association. Circulation 122, 191–225 (2010). Guazzi, M., Dickstein, K., Vicenzi, M. & Arena, R. Six-minute walk test and cardiopulmonary exercise testing in patients with chronic heart failure: a comparative analysis on clinical and prognostic insights. Circ. Heart Fail. 2, 549–555 (2009). Raphael, C. et al. Limitations of the New York Heart Association functional classification system and self-reported walking distances in chronic heart failure. Heart 93, 476 (2007). Scholte, N. T. B. et al. Telemonitoring for heart failure: a meta-analysis. Eur. Heart J. 44, 2911–2926 (2023). Kim, B. Y. & Lee, J. Smart devices for older adults managing chronic disease: a scoping review. JMIR mHealth uHealth 5, e69 (2017). Dunn, J. et al. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nat. Med. 27, 1105–1112 (2021). Friend, S. H., Ginsburg, G. S. & Picard, R. W. Wearable digital health technology. N. Engl. J. Med. 389, 2100–2101 (2023). Attia, Z. I. et al. Prospective evaluation of smartwatch-enabled detection of left ventricular dysfunction. Nat. Med. 28, 2497–2503 (2022). Gill, S. K. et al. Consumer wearable devices for evaluation of heart rate control using digoxin versus beta-blockers: the RATE-AF randomized trial. Nat. Med. 30, 2030–2036 (2024). Perino, A. C. et al. Arrhythmias other than atrial fibrillation in those with an irregular pulse detected with a smartwatch: findings from the Apple Heart Study. Circ.: Arrhythmia Electrophysiol. 14, e010063 (2021). Using Apple Watch to Estimate Cardio Fitness with VO2 max (Apple, 2021); https://www.apple.com/healthcare/docs/site/Using_Apple_Watch_to_Estimate_Cardio_Fitness_with_VO2_max.pdf Ross, H. J. Apple-CPET Ted Rogers Understanding Exacerbations of Heart Failure (TRUE-HF). ClinicalTrials.gov https://clinicaltrials.gov/study/NCT05008692 (2021). Moayedi, Y. et al. Developments in digital wearable in heart failure and the rationale for the design of TRUE-HF (Ted Rogers Understanding of Exacerbations in Heart Failure) Apple CPET Study. Circ. Heart Fail. 18, e012204 (2025). Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning. Springer Ser. Stat. https://doi.org/10.1007/b94608_3 (2008). Frankenstein, L. et al. Prognostic impact of peakVO2-changes in stable CHF on chronic beta-blocker treatment. Int. J. Cardiol. 122, 125–130 (2007). Hearn, J. et al. Neural networks for prognostication of patients with heart failure. Circ. Heart Fail. 11, e005193 (2018). Corrà, U. et al. Cardiopulmonary exercise testing in systolic heart failure in 2014: the evolving prognostic role. Eur. J. Heart Fail. 16, 929–941 (2014). Grigioni, F. et al. Serial versus isolated assessment of clinical and instrumental parameters in heart failure: prognostic and therapeutic implications. Am. Heart J. 146, 298–303 (2003). Bailey, C. P. et al. Fitbit physical activity and sleep data in the All of Us Research Program: data exploration and processing considerations for research. Med. Sci. Sports Exerc. https://doi.org/10.1249/mss.0000000000003804 (2025). The All of Us Research Program Investigators The ‘All of Us’ Research Program. N. Engl. J. Med. 381, 668–676 (2019). Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006). Levy, W. C. et al. The Seattle Heart Failure Model. Circulation 113, 1424–1433 (2006). Rich, J. D. et al. Meta-Analysis Global Group in Chronic (MAGGIC) heart failure risk score: validation of a simple tool for the prediction of morbidity and mortality in heart failure with preserved ejection fraction. J. Am. Heart Assoc. 7, e009594 (2018). Cyrille-Superville, N. et al. PREDICT HF: risk stratification in advanced heart failure using novel hemodynamic parameters. Clin. Cardiol. 47, e24277 (2024). Shcherbina, A. et al. Accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort. J. Pers. Med. 7, 3 (2017). Barron, A. et al. Test–retest repeatability of cardiopulmonary exercise test variables in patients with cardiac or respiratory disease. Eur. J. Prev. Cardiol. 21, 445–453 (2014). Stehlik, J. et al. Continuous wearable monitoring analytics predict heart failure hospitalization: the LINK-HF multicenter study. Circ. Heart Fail. 13, e006513 (2020). Heo, B. et al. Rethinking spatial dimensions of vision transformers. In Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (eds Berg, T. et al). 11916–11925 (IEEE, 2021). Marvasti, T. B. et al. Unlocking tomorrow’s health care: expanding the clinical scope of wearables by applying artificial intelligence. Can. J. Cardiol. https://doi.org/10.1016/j.cjca.2024.07.009 (2024). Rørth, R. et al. Comparison of BNP and NT-proBNP in patients with heart failure and reduced ejection fraction. Circ. Heart Fail. 13, e006541 (2020). Zhou, H., Della, P. R., Roberts, P., Goh, L. & Dhaliwal, S. S. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: an updated systematic review. BMJ Open 6, e011060 (2016). Apple Data Types. Apple Developer https://developer.apple.com/documentation/healthkit/data_types (2024). Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017) https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017). Wu, H., Zhou, H., Long, M. & Wang, J. Interpretable weather forecasting for worldwide stations with a unified deep model. Nat. Mach. Intell. 5, 602–611 (2023). Wu, H. et al. CvT: introducing convolutions to vision transformers. Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV) 22–31 (2021). Perez, E., Strub, F., de Vries, H., Dumoulin, V. & Courville, A. FiLM: visual reasoning with a general conditioning layer. Proc. of the AAAI Conference on Artificial Intelligence 32, 3942–3951 (2018). Radford, A. et al. Improving language understanding by generative pre-training. Preprint at Mike Captain https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (2018). Das, A., Kong, W., Sen, R. & Zhou, Y. A decoder-only foundation model for time-series forecasting. In Proc. 41st International Conference on Machine Learning 10148–10167 (PMLR, 2024). Ma, Q. et al. A survey on time-series pre-trained models. IEEE Trans. on Knowl. and Data Eng. https://doi.org/10.1109/TKDE.2024.3475809 (2024). Nie, Y., Nguyen, N. H., Sinthong, P. & Kalagnanam, J. A time series is worth 64 words: long-term forecasting with transformers. In Eleventh International Conference on Learning Representations https://openreview.net/pdf?id=Jbdc0vTOcol (ICLR, 2023). Kotei, E. & Thirunavukarasu, R. A systematic review of transformer-based pre-trained language models through self-supervised learning. Information 14, 187 (2023). Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015). Zhang, L., Bao, C. & Ma, K. Self-distillation: towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44, 4388–4403 (2021). Ding, Z., Jiang, G., Zhang, S., Guo, L. & Lin, W. How to trade off the quantity and capacity of teacher ensemble: learning categorical distribution to stochastically employ a teacher for distillation. Proc. of the AAAI Conference on Artificial Intelligence 38, 17915–17923 (2024). Brawner, C. A., Ehrman, J. K., Schairer, J. R., Cao, J. J. & Keteyian, S. J. Predicting maximum heart rate among patients with coronary heart disease receiving β-adrenergic blockade therapy. Am. Heart J. 148, 910–914 (2004). Elm, E. V. et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J. Clin. Epidemiol. 61, 344–349 (2008). Norgeot, B. et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat. Med. 26, 1320–1324 (2020). Schultz, L. R., Peterson, E. L. & Breslau, N. Graphing survival curve estimates for time-dependent covariates. Int. J. Methods Psychiatr. Res. 11, 68–74 (2002). Therneau, T. M. & Grambsch, P. M. Modeling survival data: extending the Cox model. Stat. Biol. Heal. https://doi.org/10.1007/978-1-4757-3294-8_9 (2000). Bansal, A. & Heagerty, P. J. A comparison of landmark methods and time-dependent ROC methods to evaluate the time-varying performance of prognostic markers for survival outcomes. Diagn. Progn. Res. 3, 14 (2019). Acknowledgements We thank all patients for participating in this study and for their dedication to enabling research. The study was funded by the Ted Rogers Centre for Heart Research. This study was also funded in part by CM Natural Sciences and Engineering Research Council of Canada (grant nos. RGPIN-2022-05117 and DGECR-2022-00137). C.M. holds the Chair in Medical Imaging at the Joint Department of Medical Imaging at the UHN and the Department of Medical Imaging at the University of Toronto. H.R. is the Loretta Rogers Chair in Heart Function. Y.G. was funded in part by a Canadian Institute of Health Research Canadian Graduate Scholarship Doctoral Award and the University of Toronto’s TRANSFORM HF Trainee Award. We gratefully acknowledge participants in the All of Us Research Program for their contributions, without whom this research would not have been possible. We also thank the NIH’s All of Us Research Program for making available the participant data examined in this study. This study used data from the All of Us Research Program’s Controlled Tier Dataset v8, available to authorized users on the Research Workbench. The funders had no role in study design, data analysis, data interpretation or the decision to submit the paper for publication. Apple Incorporated provided iPhones and Apple Watch devices for the study, collaborated on building the study-specific mobile application and provided feedback on the manuscript. No other funders had a role in data collection or manuscript preparation. Author information Authors and Affiliations Contributions All authors contributed to the study design, investigation, data verification and result interpretation and reviewed, edited and approved the manuscript for publication. H.R. conceptualized the study. Y.M., H.R. and F.F. wrote the protocol. Y.G., J.D. and C.M. processed and verified the wearables data. Y.G. and C.M. conceptualized, developed and validated the machine learning model and wrote the first draft of the manuscript. B.V. conducted the analysis of the All of Us data. Y.G., B.V., C.M. and F.F. performed the statistical analysis. E.D.L., M.B. and B.K. conducted study-entry and end-of-study non-clinic assessments and verified clinical and demographic data. B.K. and A.S. coordinated and collected patient partner feedback. H.R., Y.M. and D.H.B. provided clinical interpretation of the data and results. H.R., C.M. and Y.M. supervised the study. Corresponding authors Ethics declarations Competing interests Apple Incorporated provided 200 iPhones and Apple Watch devices for the study, provided feedback on the manuscript and collaborated with all authors to build the study-specific mobile application. All authors are investigating patenting the TRUE-HF model described herein. Peer review Peer review information Nature Medicine thanks Yogesh Reddy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Michael Basson, in collaboration with the Nature Medicine team. Additional information Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Extended data Extended Data Fig. 1 TRUE-HF transformer model framework. TRUE-HF model processes 30-days of data comprising HealthKit features from nine prevalent HealthKit data types in 90-min summaries to predict outcomes. First, the TRUE-HF model uses a TRUE-HF Transformer block to learn temporal relationships in the data sequence and reduces the representations and temporal resolutions (more oversized windows). The length of time is reduced deeper in the model through pooling. TRUE-HF integrates demographic and drug (clinical features) data with time-series through affine transformation, where these clinical details can affect the interpretation of the HealthKit variables. A linear layer generates the final prediction of outcomes (that is, daily pVO2 regression). Extended Data Fig. 2 Consort diagram for All of Us external validation set. Extended Data Fig. 3 Landmark analysis of time-dependent performance of TRUE-HF-RS for predicting unplanned healthcare utilization events in All of Us. Solid lines indicate mean estimates of the AUROC, while dashed lines represent Specificity and dot–dash lines represent Sensitivity. Shaded regions denote 95% confidence intervals derived from bootstrapping. Extended Data Fig. 4 Biometric data from Apple Watch can be used to estimate 6MWTD. (A) Scatterplot representation of predicted 6MWTD compared to the measured ground truth of clinical 6MWTD at the end-of-study clinical visit (median 94.5 d post study-entry). Points represent individual participants (n = 49). The solid line indicates the fitted linear relationship between predicted and measured 6MWTD. Shaded bands denote the 95% confidence band around the fitted linear relationship (B) Spearman’s correlation of the TRUE-HF 6MWTD model (n = 49) and Apple’s SixMinuteWalkDistance model (n = 44) to the clinical 6MWTD measured at the end-of-study clinical visit. 6MWTD according to Apple’s model was not measured for 5 participants. (C) AUROC curve in clinical 6MWTD measurements from study-entry to end-of-study clinical visit. AUROCs were calculated by comparing each of the model’s prediction of 6MWTD decline (based on comparing the last, temporally closest to the end-of-study clinical visit date, prediction to the study-entry clinical 6MWTD) against the actual observed declines in clinical 6MWTD. Solid lines represent the mean ROC curve, and shaded bands indicate the 95% confidence interval, estimated by bootstrap resampling. No difference was observed in AUROC between 6MWTD predicted by the TRUE-HF 6MWTD model and the Apple SixMinuteWalkDistance algorithm for detecting declines in 6MWTD (DeLong’s two-tailed test, p = 0.4782). Extended Data Fig. 5 TRUE-HF 6MWTD and Apple Algorithm 6MWTD predictions of declines in daily 6MWTD prior to unplanned healthcare use. (A–B) Experiments were conducted in the full TRUE-HF held-out test set (n = 63). (A) AUROC of unplanned healthcare utilization prediction of the TRUE-HF 6MWTD model against the Apple 6MWTD model. AUROC was calculated using the maximum decrease in model 6MWTD between the first model prediction and drops before the event. Solid lines represent mean estimates, and shaded regions indicate 95% confidence intervals. (B) Forest plot of hazard ratios from Cox proportional hazards models evaluating the association between declines in daily 6MWTD and time to unplanned healthcare utilization for TRUE-HF and Apple estimates. Points denote HR estimates, horizontal lines indicate 95% confidence intervals, and p-values correspond to two-sided Wald tests. The vertical dashed line denotes HR = 1. Supplementary information Supplementary Information (download PDF ) Supplementary Figs. 1–4, Tables 1–4 and Statistical analysis plan. Rights and permissions Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. About this article Cite this article Gao, Y., Moayedi, Y., Foroutan, F. et al. Remote monitoring of heart failure exacerbations using a smartwatch. Nat Med 32, 924–933 (2026). https://doi.org/10.1038/s41591-026-04247-3 Received: Accepted: Published: Version of record: Issue date: DOI: https://doi.org/10.1038/s41591-026-04247-3

How it works

Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content — general knowledge won't be enough. Score 70+ to count toward your certificate.

Questions are cached — you'll always get the same 5 for this article.