Predicting heart failure with Teradata

Saving Lives, Saving Costs: Predicting Heart Failure with Teradata

2023年1月12日 4 最小阅读

According to the Centers of Disease Control and Prevention (CDC), heart disease is one of the leading causes of death in the United States, accounting for nearly 20% of all deaths in 2020. In 2017 and 2018, the estimated cost of health care services, medication to treat heart disease, and missed days of work accumulated to a staggering $229 billion per year.

For our customer, a U.S. healthcare insurance company, a sample analysis of heart-related chronic conditions showed that a fraction of the population contributed to high costs because of heart-related illnesses. If an intervention was targeted towards high-risk patients on the path of chronic illnesses, a significant proportion of cost could be recovered. The analysis showed that in California alone the potential savings based on the sample of approx. 5,000 patients with heart failure related illness is $515 million.

Hence, our customer wanted to undertake a proof-of-concept on their data to identify high-risk patients 6 months in advance to allow doctors enough time to develop intervention plans and improve their patients’ health. A team of Teradata data scientists at Global Delivery Center (GDC) Pakistan and industry experts working alongside the customer aimed to develop a solution that would predict the onset of heart failure 6 months in advance with high accuracy to avert the high claims costs that arise after the onset. In addition to identifying high-risk patients, the customer also wanted to understand the driving factors behind patients’ predicted outcome.

Data Prep and Feature Engineering Using Teradata

We sampled two groups of patients based on the claims data: those who went through heart failure (cases) and those with similar demographics but no heart failure (controls). The age range was between 40 and 85 years.

Patient data was anonymized (NOPHI - No Personal Health Information) and complied with HIPAA standards.

We extracted patients’ visit records consisting of diagnoses, medications, procedures, and demographics. In addition, we also added a temporal aspect to the medical features. We differentiated between events occurring 1-3 months before heart failure, 3-6 months, and 6-12 months, before the onset. 

To reduce the number of features for model building, we used official medical groupers to aggregate the codes into broader categories.  In addition, we applied regression-based feature reduction technique to eliminate uninformative features. These two steps reduced total number of features by manifold, which reduces the complexity of our model while improving its predictive power.

Model Development and Results

Using the remaining time-based variables, the best model that accurately identified heart failure patients was selected after comparing several models with different hyperparameter settings. Using Teradata Vantage in-database function for data preparation of the customer claims data, we were able to build a model to correctly predict ~70% of heart failure occurrences with ~90% accuracy, 6 months in advance, for tens of thousands of patients at scale.

Moreover, we estimated how much each variable increases the odds of heart failure on average, and these temporal variables were displayed as a sequence of events leading up to heart failure.

Translating Model Accuracy to Financial Savings

According to our analysis, the cost per patient nearly triples after the onset of heart failure, serving as substantial impetus to avert the outcome as much as possible.

If health practitioners can intervene at the right time based on the prediction of our model and prevent even a fraction of patients from having heart failure, we can not only improve and extend patients’ lives but also potentially save millions of dollars in cost of care. 

Interpreting Model Decisions

For clinicians and healthcare analysts, it is important that the predictive model is both accurate and interpretable. They need to know the reasoning behind the predictions for the following reasons:

1.    To ascertain that the predicted outcome is reasonable and logically follows from the driving factors (explanation) given by the model.
2.    To inform clinical decisions and design appropriate intervention plans based on the explanation.

Our model is a tree-based learner that is nonparametric, meaning that it does not output inherently meaningful feature contributions or parameters. To achieve interpretability, we estimated feature contributions using a technique known as Shapley (SHAP) values. It gauges the contribution for each feature by isolating its effect on the outcome. We were able to infer how much each feature affects the odds of having heart failure. We leveraged our selected tree-based model’s superior predictive accuracy as well as the explainability of a highly interpretable model, without making a tradeoff.

Summing up, our solution is a combination of Teradata Vantage’s scalable technology for advanced analytics and a seamless integration with open-source tools that offer a business-friendly user interface. It is a powerful tool that brings down the effort of sifting through millions of datapoints to a simplified easy to use tool with a few clicks, driving fast actionable insights for clinical decision support at scale. As a next step we have extended our initial use-case to include capability to choose any condition of interest to predict in a period of choice that a doctor or user may decide.

 

*Dr. Bilal Khaliq also contributed to this project and article

About Bilal Kahliq:

Bilal is Principal Data Scientist at Teradata focused on the Healthcare industry. He has been working with U.S. accounts teams to provide advanced analytics solutions for payors & providers. He leads a team of Data Scientists focusing on engagements to analyze and manage cost of care as well as modelling & prediction of medical outcomes and patient behavior.

关于我们 Ahmed Javed

Ahmed is currently working with Teradata in the Data Science Practice (DS) and was previously part of the Managed Services Practice (MS) as an Application Operations and ETL Consultant. He is a Computer Engineer turned into Data Scientist. Having received exposure to various subject areas during his time at Ghulam Ishaq Khan Institute and the School of Mathematics and Statistics at University of Glasgow, he combines expertise from both the business and technology sectors.

查看所有帖子 Ahmed Javed

关于我们 Tehreem Farooqi

With a background in front and back-end web development, Tehreem is currently working at Teradata as an Associate Data Scientist. She has been working with the Healthcare team to build visuals and predictive models showing trends and predicting chronic conditions using the client’s healthcare data.

查看所有帖子 Tehreem Farooqi

关于我们 Tehseen Niaz

Tehseen is a Data Scientist at Teradata GDC Pakistan with specialization in Machine Learning and Statistical Analysis. He is currently pursuing a parttime MS in Analytics from Georgia Tech and has a Bachelor's in Business Administration from Carnegie Mellon University with a focus on Operations Research and domain knowledge across Corporate Finance, Marketing, and Strategy.
 

 

查看所有帖子 Tehseen Niaz

随时了解情况

订阅 Teradata 的博客,获取每周向您提供的见解



我同意作为本网站提供商的Teradata天睿公司可能偶尔向我发送Teradata市场沟通电子邮件,其中包含有关产品、数据分析、活动和网络研讨会邀请的信息。我了解我可以随时通过点击我收到的任何电子邮件底部的取消订阅链接取消订阅。

您的隐私很重要。您的个人信息将根据Teradata全球隐私政策收集、存储和处理,您可以通过单击此隐私链接阅读和打印。

从 Teradata 查看更多信息