Breast Cancer Survival Prediction with Machine Learning

Breast cancer is a form of cancer that begins in the breast. While it primarily affects women, men can also develop it. It remains the second leading cause of cancer-related deaths among women. With the growing use of data in healthcare, machine learning has become a powerful tool to predict patient outcomes, including the likelihood of surviving serious illnesses like breast cancer. If you’re interested in learning how to predict breast cancer survival using data-driven methods, this article is for you. Here, I’ll guide you through building a machine learning model in Python to predict survival outcomes for breast cancer patients.

Table of Contents

Breast Cancer Survival Prediction with Machine Learning

We have a dataset of over 400 breast cancer patients who underwent surgery for the treatment of breast cancer. Below is the information contained in all the columns of the dataset:

Patient_ID: ID of the patient
Age: Age of the patient
Gender: Gender of the patient
Protein1, Protein2, Protein3, Protein4: Expression levels
Tumor_Stage: Breast cancer stage of the patient
Histology: Infiltrating Ductal Carcinoma, Infiltration Lobular Carcinoma, Mucinous Carcinoma
ER status: Positive/Negative
PR status: Positive/Negative
HER2 status: Positive/Negative
Surgery_type: Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other
DateofSurgery: The date of surgery
DateofLast_Visit: The date of the last visit of the patient
Patient_Status: Alive/Dead

Using this dataset, our task is to predict whether a breast cancer patient will survive after surgery. We hope this gives you a clear overview of the dataset we are working with for the task of breast cancer survival prediction. This dataset was collected from Kaggle, and you can download the dataset from here. In the next section, we will walk you through the process of predicting breast cancer survival with machine learning using Python.

Breast Cancer Survival Prediction using Python

We will begin the task of breast cancer survival prediction by importing the necessary Python libraries and loading the required dataset:

import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

data=pd.read_csv("/Users/rahul_anand/Downloads/BRCA.csv")
print(data.head())

We now proceed to examine whether any of the columns in this dataset contain null values.

print(data.isnull().sum())

Since this dataset contains null values in several columns, we will remove these missing entries.

data=data.dropna()
data.isnull().sum()

Now, let’s take a closer look at the insights from the dataset’s columns

data.info()

Since breast cancer is most commonly found in females, let’s check the Gender column to see how many female and male patients are included.

print(data.Gender.value_counts())

As expected, the dataset shows more females than males in the Gender column. Now, let’s look at the tumor stages of the patients.

# Tumour Stage
stage=data["Tumour_Stage"].value_counts()
transactions=stage.index
quantity=stage.values

fig=px.pie(data, values=quantity,
          names=transactions,
          title="Tumour Stages of Patients")
fig.show()

Most patients in the dataset are in the second stage of breast cancer. Next, let’s explore the histology of the patients. Histology describes a tumour by examining how abnormal the cancer cells and tissues look under a microscope and how fast the cancer might grow and spread.

#Histology
histology=data["Histology"].value_counts()
transactions=histology.index
quantity=histology.values

fig=px.pie(data,
          values=quantity,
          names=transactions,
          title="Histology of Patients")
fig.show()

Now, let’s check the values of ER status, PR status, and HER2 status for the patients.

#ER status
print(data["ER status"].value_counts())
#PR status
print(data["PR status"].value_counts())
#HER2 status
print(data["HER2 status"].value_counts())

Now, let’s take a look at the different types of surgeries the patients underwent.

# Surgery_type
surgery=data["Surgery_type"].value_counts()
transactions=surgery.index
quantity=surgery.values

fig=px.pie(data,
          values=quantity,
          names=transactions,
          title="Type of Surgery of Patients")
fig.show()

We’ve now explored the dataset and noticed that it contains many categorical features. To train a machine learning model, we need to convert these categorical values into a suitable format. Here’s how we can transform them:

data["Tumour_Stage"] = data["Tumour_Stage"].map({"I": 1, "II": 2, "III": 3})
data["Histology"] = data["Histology"].map({"Infiltrating Ductal Carcinoma": 1, 
                                           "Infiltrating Lobular Carcinoma": 2, "Mucinous Carcinoma": 3})
data["ER status"] = data["ER status"].map({"Positive": 1})
data["PR status"] = data["PR status"].map({"Positive": 1})
data["HER2 status"] = data["HER2 status"].map({"Positive": 1, "Negative": 2})
data["Gender"] = data["Gender"].map({"MALE": 0, "FEMALE": 1})
data["Surgery_type"] = data["Surgery_type"].map({"Other": 1, "Modified Radical Mastectomy": 2, 
                                                 "Lumpectomy": 3, "Simple Mastectomy": 4})
print(data.head())

Breast Cancer Survival Prediction Model

Now, let’s move on to building a machine learning model to predict whether a breast cancer patient will survive. Before training, we need to split the dataset into training and testing sets.

# Splitting data
x = np.array(data[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 
                   'Tumour_Stage', 'Histology', 'ER status', 'PR status', 
                   'HER2 status', 'Surgery_type']])
y = np.array(data[['Patient_Status']])
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.10, random_state=42)

Here’s how we can go about training a machine learning model.

model=SVC()
model.fit(xtrain, ytrain)

Now, let’s feed all the features we used to train the machine learning model and predict whether a patient will survive breast cancer.

# Prediction
# features=[['Age', 'Gender', 'Protein1', 'Protein2', 'Protein3','Protein4', 'Tumour_Stage', 'Histology', 'ER status', 'PR status', 'HER2 status', 'Surgery_type']]
features = np.array([[36.0, 1, 0.080353, 0.42638, 0.54715, 0.273680, 3, 1, 1, 1, 2, 2,]])
print(model.predict(features))

Conclusion

This project demonstrates how machine learning can play a vital role in predicting breast cancer survival, offering doctors and patients valuable insights for better decision-making. The techniques applied here, data preprocessing, model training, and evaluation, are not limited to healthcare alone. Similar approaches can be extended to other real-world challenges, such as food delivery time prediction, where accurate forecasts can improve efficiency and user satisfaction. By applying predictive analytics across different domains, we can harness the power of data science to solve problems that directly impact everyday lives.

Breast Cancer Survival Prediction with Machine Learning

Breast Cancer Survival Prediction using Python

Breast Cancer Survival Prediction Model

Conclusion

Leave a Comment Cancel reply