Predicting the class of breast cancer with neural networks

In this experiment it will be shown how neural networks and Neuroph Studio are used when it comes to problems of classification. Several architectures will be tried out, and it will be determined which ones represent a good solution to the problem, and which ones do not.

Classification is a task that is often encountered in every day life. A classification process involves assigning objects into predefined groups or classes based on a number of observed attributes related to those objects. Although there are some more traditional tools for classification, such as certain statistical procedures, neural networks have shown to be an effective solution for this type of problems. There is a number of advantages to using neural networks - they are data driven, they are self-adaptive, they can approximate any function - linear as well as non-linear (which is quite important in this case because groups often cannot be divided by linear functions). Neural networks classify objects rather simply - they take data as input, derive rules based on those data, and make decisons.

The objective is to train the neural network to predict whether a breast cancer is malignant or benign, when it is given other attributes as input. First thing that is needed in order to do that, is to have a data set. A data set found on http://archive.ics.uci.edu/ml/ will be used here. The name of the data set is Wisconsin Breast Cancer Database (January 8, 1991). The database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

The data set contains 699 instances. The first attribute is the ID of an instance, and the later 9 all represent different characteristics of an instance. Each instance has one of 2 possible classes( benign or malignant). The characteristics that are used in the prediction process are:

Each attribute has the the domain 1-10. The last attribute - class, takes the values 2 and 4 (2 for benign, 4 for malignant) .

The data set can be dowloaded here, however, it can not be inserted in Neuroph in its original form. For it to be able to help us with this classification problem, we need to normalize the data first. The type of neral network that will be used in this experiment is multi layer perceptron with backpropagation.

Any neural network must be trained before it can be considered intelligent and ready to use. Neural networks are trained using training sets, and now a training set will be created to help us with the breast cancer classification problem. As mentioned above, we first need to normalize the data.

The first attribute in our data set is just an ID number, and, as such, is not relevant for the training of the neural network. We won't include it in the training set. The other 9 attributes are integer values, and the first thing that comes to mind is to use the standard Min Max normalization formula:

where B is the standardized value, A the given value, and D and C determine the range in which we want our value to be. In this case, D= 0 and C=1.

However, this formula doesn't need to be used because the range of our given attributes is, conveniently, 1 - 10. So the data can be standardized even more easily, just by mutliplying the values by 0.1. As for the last attribute, the class of the cancer, this method can not be used because it only takes values 2 and 4, and more appropriate method would be to turn the two classes into two outputs. If an instance belongs to the first class, benign, the value of the first output will be 0, and of the second output 1. Similarly, if an instance belongs to the second class (the tumor is malignant), the value of the second output for that instance will be 1, and the value of the first output will be 0. That way, both of the outputs will be values 0 or 1, which fits perfectly in our model where all the data are in the 0-1 range.

Now that all the data are standardized, we just need to put it in a .txt file and everything is set for the creation of a new training set. First, a new Neuroph projects needs to be created by clicking on the 'File' menu, and then 'New project'. The project will be called 'BreastCancerWisconsin'.

After we have clicked 'Finish', a new project is created and it will appear in the 'Projects' window, in the top left corner of Neuroph Studio.

Next, we need to create a new training set by right-clicking on our project, and selecting 'New', and 'Training set'. We give it a name, and then set the parameters. The type is chosen to be 'Supervised' training , because we want to minimize the error of prediction through an iterative procedure. Supervised training is accomplished by giving the neural network a set of sample data along with the anticipated outputs from each of these samples. That sample data will be our data set. Supervised training is the most common way of neural network training. As supervised training proceeds, the neural network is taken through a number of iterations, until the output of the neural network matches the anticipated output, with a reasonably small rate of error. Error rate we find to be appropriate to make the network well trained is set just before the training starts. Usually, that number will be around 0.01.

Next, we set the number of input, which is, clearly, 9, because there are 9 input attributes, and the number of outputs is 2, as explained above.

After clicking next, we need to edit training set table. In this case, we will not write the table ourselves, but rather click a 'Load from file', to select a file from which the table will be loaded. We click on 'Choose file', find the file with data we need, and then select a values separator. In this case, it is tab, but it can also be a space, comma, or semicolon.

Then, we click 'Next', and a window that represents our table of data will appear. We can notice that everything is in order, there are 9 input and 2 output columns, all the data are in the range od 0-1, and we can now click on 'Finish'.

Our newly created training set will appear in the 'Projects' window. After completing this, everything is ready for the creation of neural networks. We will create several neural networks, all with different sets of parameters, and determine which is the best solution for our problem by testing them. This is the reason why there will be several options for steps 4, 5 and 6.

The first neural network we will test will be called NeuralNetwork 1. We will create it by right-clicking our project in the 'Projects' window, and then clicking 'New' and 'Neural Network'. A wizard will appear, where we will set the name and the type of the network. Multi Layer Perceptron will be selected. Multi layer perceptron is the most widely studied and used neural network classifier. It is capable of modeling complex functions, it is robust (good at ignoring irrelevant inputs and noise) ,and can adapt its weights and/or topology in response to environment changes. Another reason we will use this type of perceptron is simply because it is very easy to use - it implements black-box point of view, and can be used with few knowledge about the relationship of the function to be modeled.

When we have selected the type of the perceptron, we can click 'Next'. A new window will appear, where we will set some more parameters that are characteristic for multi layer perceptron. The number of input and output neuron is the same as the number of inputs and outputs in the training set. However, now we have to select the number of hidden layers, and the number of neurons in each layer. Guided by the rule that problems that require two hidden layers are rarely encountered (and that there is currently no theoretical reason to use neural networks with any more than two hidden layers), we will decide for only one layer. As for the number of units in the layer, since it is known that one should only use as much neurons as it is needed to solve the problem, we will choose as little as we can for the first experiment - only one. Networks with many hidden neurons can represent functions with any kind of shape, but this flexibility can cause the network to learn the noise in the data. This is called 'overtraining'.

We have checked 'Use Bias Neurons', and chosen sigmoid transfer function (because the range of our data is 0-1, had it been -1 to 1, we would check 'Tanh'). As a learning rule we have chosen 'Backpropagation with Momentum'. This learning rule will be used in all the networks we create, because backpropagation is most commonly used technique and is most suited for this type of problem. In this method, the objects in the training set are given to the network one by one in random order and the regression coefficients are updated each time in order to make the current prediction error as small as it can be. This process continues until convergence of the regression coefficients. Also, we have chosen to add an extra term, momentum, to the standard backpropagation formulae in order to improve the efficiency of the algorithm.

Next, we click 'Finish', and the first neural network which we will test is completed. The neural network looks like the one shown on the picture above.

Now we need to train the network using the training set we have created. We select the training set and click 'Train'. A new window will open, where we need to set the learning parameters. The maximum error will be 0.03, and learning rate and momentum will be 0.4. Learning rate is basically the size of the 'steps' the algorithm will take when minimizing the error function in an iterative process. We click 'Train' and see what happens.

A graph will appear where we can see how the error is changing in every iteration. The error is minimised very quickly, after 3 iterations, but that is expected because we set the learning rate slightly high.

After the network is trained, we click 'Test', in order to see the total error, and all the individual errors. The results show that total error is 0.028729293065036215. That is not a perfect result, because the objective is to have all individual errors under 0.01. But, after examining all the errors, we notice that there aren't that many extreme values of errors, most of them are under 0.08. However, one error value of -0.91; 0.9085; is found, and one of: -0.9173; 0.9159; and that is a problem for the success of this neural network.

The final part of testing this network is testing it with several input values. To do that, we will select 5 random input values from our data set. Those are:

The network guessed correct in all five instances. After this test, we can conclude that this solution does not need to be rejected. It can be used to give good results in most cases. However, after the error report Neuroph gave, it is clear that in some cases, the network will be wrong. That is why we will try to create a better solution. The logical way to go is to try to train the network again, with the same parameters except maximum error - this time we will set it to 0.1.

After clicking 'Train', the iteration process of minimizing the error function starts. The error funtion minimum drops dramatically in the first 10 iterations, but struggles in the next several thousands. Finally, it stops in iteration 6140.

In the end, after testing this training, we realize that the error still hasn't dropped below 0.01. Now, it is 0.021152557835284996, which doesn't represent any significant improvement to the previous attempt, and there are still three or four values above 0.9.

We can only conclude that the chance for improvement of the results lies in an other approach : we should try and increase the number of hidden neurons.

In our second attempt to create the optimal architecture of neural network for the problem, we will pay more attention to the number of neurons we choose to have in the hidden layer. There is no exact formula that would calculate that number for every possible problem. There are only some rules and guidlines. For example, as stated above, more neurons make the network more flexible, but also make it more sensitive to noise. So the answer is to have just enough neurons to solve the problem appropriately, but no more. Some of the rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers are:

Following theese rules, we now decide for a neural network that contains 8 hidden neurons in one hidden layer. Again, we type in the standard number of inputs and outputs, check 'Use Bias Neurons', choose a Sigmoid Transfer function, and select 'Backpropagation with Momentum' as the Learning rule.

Now, the neural network that will be used as our second solution to the problem has been created. Just like the previous neural network, we will train this one with the training set we created before. We select 'Trainingset1', click 'Train' and a new window appears, asking us to fill in the parameters. This time, since there are more neurons in the hidden layer, we can select the maximum error to be 0.01. We do not limit the maximum number of iterations. As for the learning parameters, the learning rate will be 0.2, and momentum 0.7. After we click 'Train', the iteration process starts. After the initial downfall, the error funtion minimum slowly descends up until the iteration 361, when it finally stops.

Now we can test the network. We click 'Test' and wait for the results.The test results will show the difference between the guess made by the network and the correct value for the output. The test showed that total mean square is 0.008335250564192133. That's a better result than we got in Training attempt 1. But, now we need to examine all the individual errors for every single instance and check if there are any extreme values.

After examining all the errors in test results, we find that there is one value of error that is above 0.9, one above 0.8, and one above 0.7. All the other errors are very low, and a large majority is below 0.01. The neural network made 3 errors out of 699 instances. After calculating the percentage (3* 100/699), we get an error percentage of 0.429184549, and given that, statistically, allowable error is 5%, we can conclude that this neural network passed the test smoothly.

The only thing left is to put the random inputs stated above into the neural network. The result of the test are shown in the table. The network guessed right in all five cases.

This proves that following rule-of-thumb methods for calculating the number of hidden neurons is a good way to start when deciding on parameters of a neural network.

Training attempt 3

We can try to change the parameters and get a better result. This time, we will train the network the same way, but we will lower the momentum to 0.5. The graph that appeared clicking 'Train' showed that the error function a reached its minimum in 320th iteration.

After training, we click 'Test', to see if this set of parameters made the training better. Apparantly, it did not. The total mean square error is higher, and the network made 10 wrong predictions. We did not succeed to get a better result.

Training attempt 4

We will try with the next set of parameters - this time we will set the learning rate higher - to 0.4, and the momentum even lower - 0.3. The error function minimum graph looks similar to the previous one, and the error minimum is found in iteration 384. The test of the network showed even worse results than the previous example, the total error is higher, and the number of wrong predictions is higher - it's 20. We can conclude that the first set of parameters in this solution was the best choice for this network.

This solution will try to give same or better results as the previous one, with using less hidden neurons. The rule is to use only as little neurons as we can, so we will try to lower the number of hidden units from 7 to 5. We will create a new neural network, and set the number of hidden units to five. All other parameters are the same as in the previous solution.

After creating it, we need to train the network. We select TrainingSet1, and click 'Train'. This time, the learning rate will be set to 0.2, and Momentum will be set a little lower than in Training attempt 2, it's value will be 0.4. We leave the 'Limit max iterations' unchecked, and 'Display Error Graph' checked. After clicking 'Train' , total network error graph will appear. It shows that the error reached a desired value (under 0.01) in 317th iteration. Also, we see that after a rapid downfall in the first few iterations, the error slowly descended in the other three hundred iterations.

After training the network, we click 'Test', so that we can how the well the network was trained. A tab showing the results will appear. We can see that the total error is 0.0094071381988423, which is below 0.01, as we wanted. But, other than the total error, we need to examine all the individual errors. Again, there is one value that is above 0.9 - and it for the same instance as it was in Training attempt 2, the one with attribute values 0.8, 0.4, 0.4, 0.5, 0.4, 0.7, 0.7, 0.8, 0.2. Other than that, the network made a few more wrong predictions, but the number of wrong guesses is still under 1 % of the number of all instances, which makes this network a well trained one.

The last thing to do is to test the network again, but this time with the five inputs we randomly selected before. As we can see in the table below, the network guessed right in all five cases, just like in training attempt 2. We can conclude that this network is as good as the one with 8 hidden neurons.

Training attempt 6

We will try with another set of parameters. This time, we will see how the network will behave when it is trained with a higher learning rate. We set it to 0.5. The error minimization graph showed what could be expected - the number of iterations that were needed to find the minimum was a lot lower - only 60. That makes sense, because when we make bigger steps towards something, we will get there faster. But, it is still left to see if this sort of training is adequate.

To check that, we need to test the network. Again, expectedly, the total mean square error is rather high - 0.226550463973492. And none of the individual errors were below 0.3. This is one solution that we can reject immediately. We can conclude that, in this case, it is better to find the error function minimum through a high number of iterations, than to try and lower that number by setting a high learning rate.

Training attempt 7

Next, we will try to see what happens when we set momentum to zero. We put 0.01 as the minimum error, 0.2 as learning rate, and 0 as momentum. It turned out that the minimum error could not be found - even after 12000 iterations. So, we then tried to set it slightly higher - to 0.05. Naturally, the number of iterations is still very high, but the minimum is found at around 1600th iteration.

Next, we test the network, and find out that the total mean square error is 0.010411286345269628. That is a reasonable error, but then we have to examine all the individual errors. After doing that, we see that the network made 9 mistakes, which is just above 1 % of the number of all instances. We can conclude that this set of parameters made an adequate training, but it still did not make better results than the first set of parameters in this solution.

In this attempt, we will try to beat the result of training attempt 2 by considering the option that our problem may require 2 hidden layers. As said before, such problems are rarely encountered, but we will experiment to see if this is one those rare problems. We will create a new neural network, a multi layer perceptron, with 2 hidden layers. The first layer will have 7, and the second will have 2 units. Hidden layers are set in Neuroph Studio by inputting the number of neurons in each layer, and separating those numbers with a space. All the other parameters will remain the same as in previous examples. After setting all of them, we can click 'Finish'.

Now the neural network with two hidden layers is created, and is ready to be trained and tested.

We select the training set we created earlier, and click 'Train'. Next, we fill in the parameters. We set the Maximum error to 0.01, because that is the desired value. Next, we set the Learning rate to 0.2 and the Momentum to 0.1. We click 'Train' and the iteration process can start. After the first iteration, the error function minimum does not undergo any dramatic changes until it finally settles on 110th iteration. Now that the process is finished, our neural network is trained and we can test it.

After we have clicked 'Test', the test results will show in a new tab. The results prove the fact that more layers does not equal a smarter neural network. Total Mean Square Error is 0.2317652571011325, but there are a lot of individual error values that are above 0.7. That result is not satisfactory, because the network made a significant number of wrong predictions.

Now, there is nothing left to do but to test the network with 5 random inputs shown above. The network gave the results shown in table below. It made two wrong predictions out of five. Not a good result. Even more, it showed exactly the same output values in all 5 instances. Again, the problem of using too many hidden neurons in multi layer perceptron has shown.

number	benign	malignant
1.	0.7317661594562251	0.26823384054377486
2.	0.7317661594562251	0.26823384054377486
3.	0.7317661594562251	0.26823384054377486
4.	0.7317661594562251	0.26823384054377486
5.	0.7317661594562251	0.26823384054377486

Training attempt 9

If we try and raise the momentum, the result is a little better, but still not as good as the one in networks with less hidden neurons. We set the momentum to 0.6, and leave all other parameters the same as before. The error function minimum drops dramatically in the first few iterations, and then slowly descends until the iteration number 310, when it is finally below 0.01. Now the network is trained, and we can test it.

The total mean square error is around 0.0126, which seems like an alright value, but after checking all the predictions the network made, and the errors it made, we see that there are as many as 17 values that are above 0.9. And in total, the network made 27 errors. Definitely not a better result than the one previous architectures gave. This proved that making the number of hidden neurons higher did not improve the efficiency of the neural network.

Advanced training techniques

This chapter will show another technique for training a neural network that involves validation and generalization. This type of training is usually used with huge data sets. The idea is to use only a part of the data set when training a network, and then test the network with inputs from the other, unused part of the data set. That way we can determine whether the neural network has the power of generalization. This type of training will be demonstrated in Tranining attempts 10 and 11.

Now, we will try to change the training set used for the training of neural networks. In the existing training set, there are 699 instances. We will now only use 10 percent of those instances, and test whether the result will change. In other words, we will now test if a large number of rows is necessary to make a good data set.

When making this new training set, we will use exactly the same parameters as we did in the first one. We set the name to be 'TrainingSet2', the type of the training set is, again, supervised, and the number of inputs and outputs remains the same - nine and two. After setting those characteristics, we choose a different file to load - the new, short version of the original data set, that contains only sixty nine instances. We finish creating the training set, and then choose a neural network. This time, we will choose the network that gave the good results with the first training set - the network from the training attempt 2. That neural network contains 8 hidden neurons in one hidden layer. We will now train that network with the newly created training set, TrainingSet2, and observe whether the result will worsen.

We open the NeuralNetwork8, and select the TrainingSet2, and then click 'Train'. The parameters that we now need to set will be the same as the ones in training attempt 2 : the maximum error will be 0.01, the Learning rate 0.2, and the Momentum 0.7 We will not limit the maximum number of iterations, and we will check 'Display error graph', as we want the see how the error changes throughout the iteration sequence. After clicking 'Train', a window with the graph showing the function minimizing process will appear. We notice that the error stops at around 800th iteration. That is a lot more iterations than the network needed when it was trained with TrainingSet1, the one that contains all 699 instances, but, in the end, that is understandable, the network will learn faster when it has more data to rely on.

After the network is trained, we can test it, and find out if the result will be as good as the one made with TrainingSet1. When the test results tab appears, we see that the total error is 0.008499272055612413. Pretty close to 0.00833525056419213, the total error that training attempt 2 produced. This result even appears better, because there are no individual errors higher than 0.5. There is only on above 0.4, and one above 0.3. But, we definitely cannot take this result for granted, because it simply means that no problem instances were included in the ten percent of the instances that TrainingSet2 contains.

To prove that this, seemingly better, result is not proper, we will test the network with the one problem instance that caused an error above 0.9 in Solution 2. That instance is given below.

Clump Thickness	Uniformity of Cell Size	Uniformity of Cell Shape	Marginal Adhesion	Single Epithelial Cell Size	Bare Nuclei	Bland Chromatin	Normal Nucleoli	Mitoses	benign	malignant
0.8	0.4	0.4	0.5	0.4	0.7	0.7	0.8	0.2	1	0

The network produced 1.0006209410691472E as the first output, and 0.9998896760851216 as the second output. Clearly, it is wrong. This proves that, while the result may appear better in the test result window, it is not true with real inputs. Naturally, this training attempt can not be better than training attempt 2. However, given the total error, and after testing the new solution with 5 random inputs selected above, we can conclude that this solution gives almost as good results as the one with the same parameters, but a bigger data set. Our network guessed correctly in all five cases, as shown below.

The new training set will consist of 80% of the whole data set - 560 instances. We create this training set exactly the same as we did with the previous one - all the parameters will remain the same, but this time we will load a new file - the one where there are only 560 instances. After creating the training set, we choose an existing neural network. Again, we will choose the network from the training attempt 2 - the one that contains 8 hidden neurons and one hidden layer. We will now train that network with the new training set - TrainingSet3.

We will again choose the same parameters as the ones in the previous training attempt. Then, we click train and the iteration window will appear. This time, it stopped after 208 iterations.

The only thing left to do is to click 'Test' and thus test the network. The window with results will appear and show that the total mean square error is 0.00915. After carefully examining all the individual errors, we can see that the network made 5 wrong predictions, which is still around 1%. So far this network seems properly trained.

Now, it is time to test the network generalization ability. We will test it with inputs from the remaining part of the data set - the one that we did not use for the training. Those inputs will be:

We test the network with each of theese inputs. The results are shown in the table below: the network guessed all of them right. We can conclude that this network has a good ability of generalization, and, thus, the training of this network has been validated.

Five different solutions tested in this experiment have shown that the choice of the number of hidden neurons is crucial to the effectiveness of a neural network. We have concluded that one layer of hidden neurons is enough in this case. Also, the experiment showed that the success of a neural network is very sensitive to parameters chosen in the training process. The learning rate must not be too high, and the maximum error must not be too low. Next, the results have shown that the total mean square error does not reflect directly the success of a network training - it can sometimes be misleading, and individual errors made for every input must be observed. In the end, after including only 10% of instances in the training set, we learned that even that number can be sufficient to make a good training set and a reasonably trained neural network.

Below is a table that summarizes this experiment. The two best solutions for the problem are in bold and have a green background.

Training attempt 1

Training attempt 2

Training attempt 3

Training attempt 4

Training attempt 5

Training attempt 6

Training attempt 7

Training attempt 8

Training attempt 9

Advanced training techniques

Training attempt 10

Training attempt 11