The AI Commons Project is a proof of concept of a new methodology of developing Artificial Intelligence solutions that allows anyone, anywhere to benefit from the possibilities that AI can provide. The project aims to increase/improve the accessibility, reproducibility, contextualization and enhancement of Artificial Intelligence solutions globally and especially in emerging markets.

The project aims to demonstrate how a global community of AI experts can learn and co-create mutually beneficial solutions with the opportunity for cross-county incremental enhancement.




Ezekiel Adebayo Ogundepo (Data Science Nigeria)
Prof. W.B. Yahya (University of Ilorin)
Dr Olubayo Adekanmbi (Data Science Nigeria)
Rising Odegua (Data Science Nigeria)
David Olubukola Akanji (Data Science Nigeria)


Heart disease refers to several types of heart conditions. According to research, it is one of the dominant diseases that affects people worldwide and often results to death. Medical research has identified that lifestyles, obesity, eating habits and physical inactivity are major factors that can lead to heart related diseases and other factors like smoking, high blood pressure, gender, ECG rate, cholesterol level, age, weight, height, hypertension can also increase the chances of heart diseases.

The World Health Organization estimates that about 17 million lives are lost yearly due to heart related diseases and that heart disease is prevalent in Asia, India and the United States. Also, research has shown that heart diseases are usually dominant in male than female and mostly affect middle aged and old people.

Various tests are carried out to diagnose heart disease. Tests results, medical history and past symptoms are examined by a doctor to determine whether a patient has a heart disease

Malaria detection using machine learning
Cancer detection using machine learning


The solution used various patients’ health information and cardiovascular statistics to predict the likelihood of heart disease in patients. Ten (10) different classifiers and seven (7) evaluation metrics were used in the development of the solution and the first version was released in 2018
Article: Here

The output of the solution is a prediction of whether or not a patient has heart disease using the best performing model (SVM) out of the 10 trained models.

Health practitioners

Medical health personal: To provide historical data and context useful in training the models

Data scientists: Clean the data and perform various exploratory data analysis (EDA) before training the models and bring out analytical story telling to communicate the results to those that need it


The intended use of the solution is to predict if a patient has a heart disease. A sample use case would be- a patient goes to a hospital with symptoms of a heart disease, necessary patient’s information is taken and fed into the system, in seconds the system should accurately predict whether the patient has a heart disease.

Medical personnel

The input involves measurements on patients’ health and cardiovascular statistics as shown below:

Input Description
Age   –   age in years
Sex    –   sex of the patients
Cp     –    chest pain type
Trestbps   –    resting blood pressure (in mm Hg on admission to the hospital)
Chol  –    serum cholestoral in mg/dl
Fbs    –    fasting blood sugar > 120 mg/dl
Restecg    –     resting electrocardiographic results
Thalach     –     maximum heart rate achieved (beats per minute)
Exang        –     exercise induced angina
Oldpeak    –     ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
Slope          –     the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
ca        –    number of major vessels colored by flourosopy
Thal    –    results of thallium stress test measuring blood flow to the heart
The input is supplied to the system and the output of whether the patient whose information is supplied has a heart disease or not is displayed on the screen.



This study used two independent sources of dataset. The first one was obtained from a study of heart disease from Cleveland database that has been open to the public at the UCI (University of California, Irvine C.A) Machine Learning Repository and the second one was obtained from DRIVENDATA. Both datasets have various measurements on patients’ health and cardiovascular statistics to predict whether or not a patient has heart disease.

The UCI dataset was created by V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

David W. Aha (aha ‘@’ (714) 856-8779


The UCI heart disease dataset contains 13 features and 303 records

The DATADRIVEN heart disease dataset contains 13 features and 180 records

Each instance of both dataset contain medical information for patient with or without heart disease

Yes. The class of label in each dataset indicate present or absent of heart disease.








Model date: 2019. Different classifiers, which included Logistic regression, random forest, Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbour (K-NN), Support Vector Machine (SVM), XGBTree, CForest, Linear Discriminant Analysis (LDA), and Artificial Neural Network (ANN) were implemented using caret package in RStudio.

We created a one-hot encoding of our categorical features and rescale numeric features in the dataset so they can have a similar range of values. We also apply a simple backwards feature selection also known as recursive feature elimination (RFE) using random forest on the dataset.

Models such as KNN, SVM, and neural network required a transformation of features to be centered and/or scaled before being used in such models.



SVM outperforms all other classifiers used. It achieved an accuracy: 83.89%, Sensitivity: 82.5%, Specificity: 85%, Precision: 81.5%, AUC: 90.5%, LogLoss: 0.379, and F1 score: 82%. Several tuning of hyper parameters of models were done with caret package with 10- fold cross-validation to improve the performance of the models.

Evaluation metrics included Accuracy, Sensitivity, Specificity, Precision, F1 score, Area under a ROC curve (AUC), and LogLoss.


We used Windows 10 on 64 bit and statistical analysis was done in RStudio with R version 3.6.3. Packages used included rmarkdown for running the codes chunk by chunk, tidyverse for data analysis and visualization, and caret for classification training.






This project is brought to you by

Copyright © 2020 Data Science Nigeria.