Breast cancer is a form of cancer that begins in the breast. While it primarily affects women, men can also develop it. It remains the second leading cause of cancer-related deaths among women. With the growing use of data in healthcare, machine learning has become a powerful tool to predict patient outcomes, including the likelihood of surviving serious illnesses like breast cancer. If you’re interested in learning how to predict breast cancer survival using data-driven methods, this article is for you. Here, I’ll guide you through building a machine learning model in Python to predict survival outcomes for breast cancer patients.
Breast Cancer Survival Prediction with Machine Learning
We have a dataset of over 400 breast cancer patients who underwent surgery for the treatment of breast cancer. Below is the information contained in all the columns of the dataset:
- Patient_ID: ID of the patient
- Age: Age of the patient
- Gender: Gender of the patient
- Protein1, Protein2, Protein3, Protein4: Expression levels
- Tumor_Stage: Breast cancer stage of the patient
- Histology: Infiltrating Ductal Carcinoma, Infiltration Lobular Carcinoma, Mucinous Carcinoma
- ER status: Positive/Negative
- PR status: Positive/Negative
- HER2 status: Positive/Negative
- Surgery_type: Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other
- DateofSurgery: The date of surgery
- DateofLast_Visit: The date of the last visit of the patient
- Patient_Status: Alive/Dead
Using this dataset, our task is to predict whether a breast cancer patient will survive after surgery. We hope this gives you a clear overview of the dataset we are working with for the task of breast cancer survival prediction. This dataset was collected from Kaggle, and you can download the dataset from here. In the next section, we will walk you through the process of predicting breast cancer survival with machine learning using Python.
Breast Cancer Survival Prediction using Python
We will begin the task of breast cancer survival prediction by importing the necessary Python libraries and loading the required dataset:
We now proceed to examine whether any of the columns in this dataset contain null values.
Since this dataset contains null values in several columns, we will remove these missing entries.
Now, let’s take a closer look at the insights from the dataset’s columns
Since breast cancer is most commonly found in females, let’s check the Gender column to see how many female and male patients are included.
As expected, the dataset shows more females than males in the Gender column. Now, let’s look at the tumor stages of the patients.
Most patients in the dataset are in the second stage of breast cancer. Next, let’s explore the histology of the patients. Histology describes a tumour by examining how abnormal the cancer cells and tissues look under a microscope and how fast the cancer might grow and spread.
Now, let’s check the values of ER status, PR status, and HER2 status for the patients.
Now, let’s take a look at the different types of surgeries the patients underwent.
We’ve now explored the dataset and noticed that it contains many categorical features. To train a machine learning model, we need to convert these categorical values into a suitable format. Here’s how we can transform them:
Breast Cancer Survival Prediction Model
Now, let’s move on to building a machine learning model to predict whether a breast cancer patient will survive. Before training, we need to split the dataset into training and testing sets.
Here’s how we can go about training a machine learning model.
Now, let’s feed all the features we used to train the machine learning model and predict whether a patient will survive breast cancer.
Conclusion
This project demonstrates how machine learning can play a vital role in predicting breast cancer survival, offering doctors and patients valuable insights for better decision-making. The techniques applied here, data preprocessing, model training, and evaluation, are not limited to healthcare alone. Similar approaches can be extended to other real-world challenges, such as food delivery time prediction, where accurate forecasts can improve efficiency and user satisfaction. By applying predictive analytics across different domains, we can harness the power of data science to solve problems that directly impact everyday lives.