Heart Disease Prediction

Heart disease is one of the most popular causes of death for Americans. More than 30 million US adults are diagnosed with heart disease, and about 650,000 Americans die every year (1). In this analysis, the work examined data on heart disease patients to identify factors that may be associated with an increased risk of heart disease and build a model to predict heart disease. The study included 303 patients, with 207 males and 96 females.

About the project

Summary Project

In this project, data exploration and cleaning were performed, comparing disease and non-disease groups and visualizing the data using Matplotlib. A heart disease prediction model was developed using three machine learning algorithms (Logistic Regression, K-Nearest Neighbor Classifier, and Random Forest Classifier), with the Logistic Regression model achieving the highest accuracy at 88% after optimizing hyperparameters. The performance of the tuned classifier was evaluated using metrics like ROC curve, AUC score, confusion matrix, classification report, precision, recall, and F1-score.

Objective

• Conducted data exploration and cleaning by comparing disease and non-disease groups, checking for missing values, and finding heart disease frequency by gender, age, and chest pain. Visualized the data using Matplotlib.

• Developed a heart disease prediction model using three different machine learning algorithms: Logistic Regression, K-Nearest Neighbor Classifier, and Random Forest Classifier. Utilized Randomized Search Cross Validation to optimize hyperparameters and found that the Logistic Regression model achieved the highest accuracy at 88%.

• Evaluated the tuned machine learning classifier using various metrics including ROC curve, AUC score, confusion matrix, classification report, precision, recall, and F1-score.

Visualization

Distribution of the age

Compare Resting Blood Pressure as Per Sex Column

Key Findings

  • The number of people with and without heart disease is higher in males compared to females

  • The age range of 55-65 is associated with a higher prevalence of individuals at high risk for developing the disease

  • Chest pain emerges as the primary indicator of heart disease, carrying significant importance in its identification and diagnosis

  • The heart disease prediction models achieved the following accuracies: Logistic Regression (88%), K-Nearest Neighbor (KNN) Classifier (68.8%), and Random Forest Classifier (83.6%). These results suggest that the Logistic Regression model performed the best among the three algorithms, with the highest accuracy of 88%. The KNN Classifier had an accuracy of 68.8%, while the Random Forest Classifier achieved an accuracy of 83.6%.