In this guest post, Ankit Dixit, a deep learning expert, talks about breast cancer identification using machine learning.
Disease Identification: Breast Cancer
In the simplest form, a cancer can be defined as an uncontrollable mutation of the cells; and when it happens in breasts it is known as breast cancer. In the world, breast cancer is the fifth most common cause of death due to cancer. The first four are lung cancer, stomach cancer, liver cancer, and colon cancer. In 2005, breast cancer caused 502,000 deaths (7% of cancer deaths; almost 1% of all deaths) in the world.
In the United States, breast cancer is the most common cancer in women, and the second most common cause of cancer death in women (after lung cancer). In 2007, breast cancer caused about 40,910 deaths (7% of cancer deaths; almost 2% of all deaths) in the U.S. Women in the United States has a 1 in 8 chance of getting breast cancer in their lives. They have a 1 in 33 chance of death from breast cancer.
There are many more people getting breast cancer since the 1970s. This is because of how people in the Western world live. Because the breast is composed of identical tissues in males and females, breast cancer also occurs in males, though it is less common.
So we got enough introduction of the disease and now you can feel how important the detection of such diseases is in their early stages, so doctors can prevent the disease from spreading further and start the right treatment to save millions of lives per year.
But to reduce the complexity of the problem, we will try to identify the type of cancer instead of the identification of the disease. We will classify an input instance into Malignant or Benign class. For your understanding, a benign tumor is a tumor that does not invade its surrounding tissue or spread throughout the body. A malignant tumor is a tumor that may invade its surrounding tissue.
You can download the data set from UCI Machine Learning Repository; this data is free for use, but it requires some pre-processing. First, the data doesn’t come in any specific file format, so you need to copy-paste it into a text file directly from your browser. Second is the target variable for the input instances—the data set has a categorical target value for M for malignant and B for benign.
So, I am expecting that you guys will copy the data into a text file and convert the file format to .csv, so we can use python’s library to operate on input CSV and load the data set in a matrix form. For your help, I have put the pre-processed CSV file on the following github repo:
You can easily download it from there.
Let’s jump straight into the code; create a python file with the name BreastCancerIdentification.py and start with the import of useful libraries.
BreastCancerIdentification.py and start with the import of useful libraries.
Let’s talk briefly about pandas before proceeding further. Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.
We will work with the Pandas data frame, which gives us the liberty to store a data set in its original form. By original form, I mean if our data set contains any non-numerical values, we need not worry about storing the data matrix into a numpy array; we can keep the attribute name in the data frame for better readability and understanding. Apart from that, we can apply different algebraic operations and indexing on the data frame directly without converting it into numpy arrays.
Now once we have imported the important libraries, we will go ahead to read the data from the repo using the TensorFlow utility and load it in the variable using the read csv function. We will do it in the following steps:
Let’s see the dimensions of our data;
So we can see here that there are 33 attributes in 569 instances, one attribute will be the output variable or the class of the corresponding instance. Let’s see the features present in the data set:
After executing the above line we will get following features as a result;
As you can see, these features are cellular features with the measurement metric of each cell such as the cell’s area, diameter, texture and fractal dimension, etc.
Let’s select some of the features from the above data set and see what values are present there.
As you can see, we have selected some 6 features for 10 random instances to visualize the numeric values. We have selected this subset because it is difficult to visualize all the variables in one page. Let’s see how it looks:
The left most column of the above feature table is the instance number; you can clearly see the effect of shuffling there. We have shuffled the data set so that we can visualize the values for both the classes.
But do you think this is a good way to explore the data set? Certainly not; you need to do a proper exploratory data analysis (EDA) to visualize the feature relationships with the output class.
Exploratory Data Analysis (EDA)
But why do we need to do EDA on our data set? Well if you’re working with image or text-based data, you can easily understand by visualizing the instances, but if you have only numerical instances in your hand (that too with many dimensions), it is impossible for the human mind to understand the data set. To understand it well, we need to plot these values.
Let’s start with plotting the class distribution in the data set. We will use seaborn for creating the plots and matplotlib for showing those plots.
We will first extract the output variable from pandas data frame and then use seaborn to create a count plot; matplotlib will help us to visualize it.
After execution we will get;
Okay, there are 357 cases of benign class and 212 cases of malignant class. But can we understand how each feature of our data set behaves for both the classes? The answer is yes; of course, we can visualize the distribution of each feature with the help of the seaborn violin plots. However, what will we get with the help of it? Well, violin plots helps us visualize statistical measures such as the centre of distribution (mean) and the variance of each feature for different classes in a single plot.
But before plotting these values, we need to remove the column with both the diagnosis variables (diagnosis and diagnosis_numerical) as well as the instance ID from our data set. The following lines create a violin plot for the attributes:
After the execution of the above lines, we will get the modified feature set;
Now before plotting the attributes, we need to normalize the data set using features, mean and standard deviation; this will force the instances in a 0 to 1.
Now we have modified our data set as we want to plot it; our next task is to reshape our data set such that, column for class variable remains same but all other features will arrange row wise instead of column. I know it’s quite difficult to understand. Let me discuss it with an example.
Our original data looks something like this:
And we want to convert it into the following form:
Do you get my point; as you can see, we have arranged all the features against the output class; this conversion will help us to plot our data in fashion of x vs y, where x is our class variable while y is our features. We will do this with the following line of code.
We have done the required modification. We can now plot our feature relationship as follows;
Can you see that? How easy it became to visualize the mods of each feature for both the classes. Now let’s try to interpret this graph. If you look closer for the features texture and perimeter, it clearly shows that the median for both the feature distributions is different, which shows how significant these features could be in the classification task. These plots can help us in feature selection as well where we can remove redundant features on the basis of class overlap.
A similar kind of analysis can be done using box plots too; unfortunately, this kind of analysis can’t tell us the relationship between features. To find out the relationship between features, we need to plot the correlation between different features. Correlation can help us identify redundant features. If two features are having a high correlation, we can easily remove one of them.
To visualize the correlation between two features, we can use pair plot() from seaborn as follows;
Here’s the result of the above execution:
In the above plot, the pearson correlation coefficient between these two variables is 0.86 (1 is the maximum). This means that these two variables are highly correlated, and if required, we can remove one of them.
If you enjoyed reading this article and want to learn more about AI and ML, you can explore Hands-On Artificial Intelligence with TensorFlow to implement useful techniques in ML and DL for building intelligent applications. Taking an in-depth look at the underlying concepts, Hands-On Artificial Intelligence with TensorFlow is a must-read for machine learning developers, data scientists, AI researchers, and anyone who wants to build artificial intelligence applications using TensorFlow.