PREDICTING THE CLASS OF HABERMAN'S SURVIVAL WITH NEURAL NETWORKS
An example of a multivariate data type classification problem using Neuroph
by Marija Jovanovic, Faculty of Organizational Sciences, University of Belgrade
an experiment for Intelligent Systems course
Introduction
In this experiment it will be shown how neural networks and Neuroph Studio are used when there are problems of classification. We will work with several architecture, and it will be determined which ones is the best solution to the problem, and which ones do not.
Classification is a task that is often found in every day life. A classification process involves assigning objects into predefined groups or classes based on a number of observed attributes related to those objects. Although there are some more traditional tools for classification, such as certain statistical procedures, neural networks have shown to be an effective solution for this type of problems. There is a number of advantages to using neural networks - they are data driven, they are self-adaptive, they can approximate any function - linear as well as non-linear (which is quite important in this case because groups often cannot be divided by linear functions). Neural networks classify objects rather simply - they take data as input, derive rules based on those data, and make decisons.
For better understanding of our experiment, we suggest that you first look at the links below:
Neuroph Studio-Geting started
Multi Layer Perceptron
Introducing the problem
The goal is to train the neural network to predict whether a patient survived after breast cancer surgery, when it is given other attributes as input. First thing that is needed in order to do that, is to have a data set. A data set can be found here. The name of the data set is Haberman's Survival Dataset (March 4, 1991). The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
The data set contains 306 instances, and number of attributes are 3. The first attribute is the age of patient, the second attribute is year of operation, the third attribute is number of positive axillary nodes detected. Each instance has one of 2 possible classes( the patient survived 5 years or longer or the patient died within 5 year).
Input attributes are:
- Age of patient at time of operation (numerical)
- Patient's year of operation (year - 1900, numerical)
- Number of positive axillary nodes detected (numerical)
Output attribute is: Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
The first three attributes have values from 0 to 100. The last attribute - class, takes the values 1 and 2 (1 the patient survived, 2 the patient died).
When data set dowloaded, it can not be inserted in Neuroph in its original form. For it to be able to help us with this classification problem, we need to normalize the data first. The type of neural network that will be used in this experiment is multi layer perceptron with backpropagation.
Prodecure of training a neural network
In order to train a neural network, there are five steps to be made:
- Preparation the data set
- Create a Neuroph project
- Create a training set
- Create a neural network
- Train the network
- Test the network to make sure that it is trained properly
1.Step Preparation the data set
Any neural network must be trained before it can be considered intelligent and ready to use. Neural networks are trained using training sets, and now a training set will be created to help us with the haberman's survival classification problem. As we sad, we first need to normalize the data.
Values of all attributes are integer values so that we're using the standard Min Max normalization formula, but we use this formula for first three attributes(inputs):
B is the standardized value
A the given value
D and C determine the range in which we want our value to be. In this case, D= 0 and C=1
The last attribute of the class of survival (output) is not used this method of normalization because the attribute takes the value 1 or 2 and more appropriate method would be to turn the two classes into output. That mean, one output column is replaced whit two columns. If an instance belongs to the first class (number 1 is output), the patient survived 5 years or longer, the value of the first output will be 0, and of the second output 1. Similarly, if an instance belongs to the second class (number 2 is output), the patient died within 5 year, the value of the second output for that instance will be 1, and the value of the first output will be 0. That way, both of the outputs will be values 0 or 1, which fits perfectly in our model where all the data are in the 0-1 range.
2.Step Creating a new Neuroph project
When all the data are standardized, we just need to put it in a .csv file and everything is set for the creation of a new training set. First, a new Neuroph projects needs to be created by clicking on the 'File' menu, and then 'New project'.
The project will be called 'HabermanSurvival'.
When we click 'Finish', and new project is created and it will appear in the 'Projects' window, in the top left corner of Neuroph Studio.
3.Step Creating a training set
Now, we need to create a new training set by right-clicking on our project, and selecting 'New', and 'Training set'. We give it a name, and then set the parameters. The type is chosen to be 'Supervised' training , because we want to minimize the error of prediction through an iterative procedure. Supervised training is accomplished by giving the neural network a set of sample data along with the anticipated outputs from each of these samples. That sample data will be our data set. Supervised training is the most common way of neural network training. As supervised training proceeds, the neural network is taken through a number of iterations, until the output of the neural network matches the anticipated output, with a reasonably small rate of error. Error rate we find to be appropriate to make the network well trained is set just before the training starts. Usually, that number will be around 0.01.
Next, we set the number of input, which is 3, because there are 3 input attributes, and the number of outputs is 2, as explained above.
After clicking next, we need to edit training set table. In this case click a 'Load from file', to select a file from which the table will be loaded. We click on 'Choose file', find the file with data we need, and then select a values separator. In this case, it is comma, but it can also be a space, tab, or semicolon.
Then, we click 'Next', and a window that represents our table of data will appear. We can see that everything is in order, there are 3 input and 2 output columns, all the data are in the range od 0-1, and we can now click on 'Finish'.
After we done this, everything is ready for the creation of neural networks. We will create several neural networks, all with different sets of parameters, and determine which is the best solution for our problem by testing them. This is the reason why there will be several options for steps 4, 5 and 6.
Training attempt 1
4.1 Step Creating a neural network
To create the optimal architecture of neural network for the problem, we will pay more attention to the number of neurons we choose to have in the hidden layer. There is no formula that would calculate that number for every possible problem. There are only some rules and guidlines. For example, as stated above, more neurons make the network more flexible, but also make it more sensitive to noise. So the answer is to have just enough neurons to solve the problem appropriately, but no more. Some of the rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers are:
The number of hidden neurons should be between the size of the input layer and the size of the output layer - in this case, the number should be between 2 and 3
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer - applied to this problem: 2/3 * 3 + 2 = 4
The number of hidden neurons should be less than twice the size of the input layer - here, less than 6, but we can put more neurons, is not strictly limited
The first neural network that we are testing, will be called NewNeuralNetwork1. We will create it by right-clicking our project in the 'Projects' window, and then clicking 'New' and 'Neural Network'. We will set the name and the type of the network. Multi Layer Perceptron will be selected. Multi layer perceptron is the most widely studied and used neural network classifier. It is capable of modeling complex functions, it is robust (good at ignoring irrelevant inputs and noise) ,and can adapt its weights and/or topology in response to environment changes. Another reason we will use this type of perceptron is simply because it is very easy to use - it implements black-box point of view, and can be used with few knowledge about the relationship of the function to be modeled.
When we have selected the type of the perceptron, we can click 'Next'. A new window will appear, where we will set some more parameters that are characteristic for multi layer perceptron. The number of input and output neuron is the same as the number of inputs and outputs in the training set. However, now we have to select the number of hidden layers, and the number of neurons in each layer. Guided by the rule that problems that require two hidden layers are rarely encountered (and that there is currently no theoretical reason to use neural networks with any more than two hidden layers), we will decide for only one layer. As for the number of units in the layer, since it is known that one should only use as much neurons as it is needed to solve the problem, we will choose as little as we can for the first experiment - only one. Networks with many hidden neurons can represent functions with any kind of shape, but this flexibility can cause the network to learn the noise in the data. This is called 'overtraining'.
We have checked 'Use Bias Neurons', and chosen sigmoid transfer function (because the range of our data is 0-1, had it been -1 to 1, we would check 'Tanh'). As a learning rule we have chosen 'Backpropagation with Momentum'. This learning rule will be used in all the networks we create, because backpropagation is most commonly used technique and is most suited for this type of problem. In this method, the objects in the training set are given to the network one by one in random order and the regression coefficients are updated each time in order to make the current prediction error as small as it can be. This process continues until convergence of the regression coefficients. Also, we have chosen to add an extra term, momentum, to the standard backpropagation formulae in order to improve the efficiency of the algorithm.
Next, we click 'Finish', and the first neural network which we will test is completed.
If you want to see neural network as a graph, just select 'Graph View'. Right nodes in first and second level are bias neurons that we explained above.
5.1 Step Training the neural network
Now we need to train the network using the training set we have created. We select the training set and click 'Train'. A new window will open, where we need to set the learning parameters. The maximum error will be 0.02, and learning rate 0.3 and momentum will be 0.4. Learning rate is basically the size of the 'steps' the algorithm will take when minimizing the error function in an iterative process. We click 'Train' and see what happens.
The graph appeared and went down to the horizontal asymptote, immediately after 7 iterations. This means that the neural network quickly learned 80 percent of date set.
6.1 Step Testing the neural network
After the network is trained, we click 'Test', in order to see the total error, and all the individual errors. The results show that total error is 0.297020. This is quite a big mistake, and perhaps one reason that the network quickly learned 80 percent of date set.
The final part of testing this network is testing it with several input values. To do that, we will select 5 random input values from our data set.
The output neural network produced for this input is the last two columns.
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes
detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived -
obtained outputs |
patient died -
obtained outputs |
1. |
0 |
0.636364 |
0 |
0 |
1 |
0.212883 |
0.787116 |
2. |
0.264151 |
0.818182 |
0.307692 |
0 |
1 |
0.535963 |
0.464037 |
3. |
0.358491 |
0.454545 |
0.057692 |
0 |
1 |
0.290078 |
0.709921 |
4. |
0.509434 |
0.272727 |
0 |
0 |
1 |
0.236215 |
0.763785 |
5. |
0.811321 |
0.909091 |
0 |
0 |
1 |
0.175651 |
0.824348 |
The network guessed correct in four instances. After this test, we can conclude that this solution does not need to be rejected. It can be used to give good results in most cases. However, after the error report Neuroph gave, it is clear that in some cases, the network will be wrong.
Training attempt 1.1
Let's train the same network with some other different values for learning rate and momentum. We won't be changing the value of the max error - it remains 0.02.
If we enter 0.3 as value for learning rate and 0.5 for momentum, this is what happens:
We see that in this case total net error is much bigger than the set value for max error which means that the training is not complete and after 13800 iterations the network cannot be tested.
Training attempt 1.2
Let's increase the value of learning rate to 0.4, while the value for momentum remains the same. The result is as follows:
We now see that with the increased value of learning rate the total net error is even bigger with the increased number of iterations. After this, we conclude that increasing the value of the learning rate leads to oscilations of the objective error function and the network reaches a state where no useful training takes place.
Training attempt 4
4.4 Step Creating a neural network
Following these rules, we now decide for a neural network that contains 3 hidden neurons in one hidden layer. Again, we type in the standard number of inputs and outputs, check 'Use Bias Neurons', choose a Sigmoid Transfer function, and select 'Backpropagation with Momentum' as the Learning rule.
In this case, we chose three hidden neurons.
Graphical representation of neural network
5.4 Step Train the network
The neural network, that will be used as our second solution to the problem, has been created. Like the previous neural network, we will train this one with the training set we created before, with the entire sample. We select 'NewTrainingSet1', click 'Train' and a new window appears, asking us to fill in the parameters. This time, since there are more neurons in the hidden layer, we can select the maximum error to be 0.01. We do not limit the maximum number of iterations. As for the learning parameters, the learning rate will be 0.2, and momentum 0.7. After we click 'Train', the iteration process starts. The total net error, grows very fast and stops in 23 iteration with error 0.002637448922.
6.4 Step Testing the network
Total Mean Square Error measures the average of the squares of the "errors". The error is the amount by which the value implied by the estimator differs from the quantity to be estimated. An mean square error of zero, meaning that the estimator predicts observations of the parameter with perfect accuracy, is the ideal, but is practically never possible. The unbiased model with the smallest mean square error is generally interpreted as best explaining the variability in the observations. The test showed that total mean square is 0.16207228477883195. The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the mean square error is close to zero relative to the magnitude of at least one of the estimated treatment effects.
After examining all the errors in test results, we find that a lot of error values are about 0.5 and 0.4. All the other errors are very low, and a large majority is 0.0259.
The only thing left is to put the random inputs stated above into the neural network. The result of the test are shown in the table. The network guessed right in all five cases.
The final part of testing this network is testing it with several input values. To do that, we will select 5 random input values from our data set.
The output neural network produced for this input is the last two columns.
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes
detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived -
obtained outputs |
patient died -
obtained outputs |
1. |
0 |
0.636364 |
0 |
0 |
1 |
0.10344 |
0.89639 |
2. |
0.264151 |
0.818182 |
0.307692 |
0 |
1 |
0.39274 |
0.60676 |
3. |
0.358491 |
0.454545 |
0.057692 |
0 |
1 |
0.10799 |
0.89188 |
4. |
0.509434 |
0.272727 |
0 |
0 |
1 |
0.11041 |
0.88954 |
5. |
0.811321 |
0.909091 |
0 |
0 |
1 |
0.13654 |
0.86339 |
The network guessed correct in all five instances. After this test, we can conclude that this solution does not need to be rejected. We see that the output values approximate those found in the date set, only for the 80th sample values are a little different than they should.
Training attempt 4.1
Now, using the same neural network with 3 hidden neurons, let's run a few more trainings keeping the value of the learning rate 0.3 and just changing the momentum.
The total net error is still too big.
Training attempt 4.2
We go back to the NewNeuralNetwork window, click the Randomize button, then click the Train button again and enter 0.5 as a new value for momentum. The result is this:
Training attempt 4.3
Let's decrease momentum once again to see what effect it has on the training. Let new momentum be 0.2:
We see that at the constant learning rate of 0.3 and with graduall decreasing momentum (0.7, 0.5, 0.2) total net error does not change much (it is always around 0.17...) and the network learns slowly due to large number of iterations.
Training attempt 4.4
If we set the momentum at the intial value of 0.7 and with the new value for learning rate of 0.5 the following happens:
At iteration 165 the total net error is smaller than the set max error (0.01), the training is complete and we can test the network.
6.4.4 Step Testing the network
After the network is trained, we click 'Test', in order to see the total error, and all the individual errors.
The test showed that total mean square is 0.18931258619161326. Looking at the individual error, we can observe that most of them are at low level - around 0.0289. However, there are still some cases where those errors are considerably larger (around 0.3622) which means we should try some other neural network.
Values of inputs, outputs and individual errors, in 5 randomly selected observations, are in table below:
|
Inputs |
Outputs |
Individual errors |
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived |
the patient |
1. |
0.6981 |
0.2727 |
0 |
0.0295 |
0.9763 |
0.0295 |
-0.0237 |
2. |
0.7358 |
0.1818 |
0 |
0.3622 |
0.6323 |
0.3622 |
-0.3677 |
3. |
0.7547 |
0.8182 |
0 |
0.0289 |
0.9769 |
0.0289 |
-0.0231 |
4. |
0.8113 |
0.9091 |
0 |
0.0289 |
0.9769 |
0.0289 |
-0.0231 |
5. |
0.8679 |
0.8182 |
0 |
0.0363 |
0.9703 |
0.0363 |
-0.0297 |
The network guessed correct in all five instances. After this test, we can conclude that this solution does not need to be rejected.
Training attempt 6
4.6 Creating new neural network
Next Neural Network will have same number of input and output neurons but different number of neurons in hidden layer. We will use 4 hidden layer neurons. Network in named NewNeuralNetwork3.
And the neuronal network looks like this:
5.6 Step Training the network
We will train the network the same way, with learning rate value 0.5 and momentum 0.7, and max error 0.01.
The error function minimum is first moving horizontally most of the path and suddenly begins to oscillate, and stopping at iteration 117.
6.6 Step Testing the network
We clicked 'Test' after 'Training', to see whether more neurons contribute to a better training. As we see, not to reduce the error. Almost the same as in the first training network. error is approximately the same as in first practice.
Training attempt 6.1
We'll do few more trainings to see how changing the momentum effects the trainiings results. First, at the learniing rate of 0.5 and momentum of 0.6 we have this:
Training attempt 6.2
We decrease momentum to 0.4 now:
Our conclusion is that the decreased momentum can lead to growing number of iterations which means that network learns slower at higher momentum values.
Training attempt 10
5.10 Step Training the network
In our project HabermanSurvival, we create a new network NewNeuralNetwork4.
This neural network will contain 6 neurons in hidden layer, as we see in picture below, and same options as previous networks, or 3 inputs and 2 outputs.
The learning parameters set the maximum error of 0.01, learning rate of 0.2 and momentum of 0.7.
Than we click "Train" and wait.
The total net error slowly descends but with high oscilation and stops when reaches a level lower than the given (0.01) in 69 iteration.
6.10 Step Testing the Neural Network
Total Mean Square Error measures the average of the squares of the "errors". The error is the amount by which the value implied by the estimator differs from the quantity to be estimated. An mean square error of zero, meaning that the estimator predicts observations of the parameter with perfect accuracy, is the ideal, but is practically never possible. The model with the smallest mean square error is generally interpreted as best explaining the variability in the observations. The test showed that total mean square is 0.16044272735952059. The goal of experiment is to construct experiments in such a way that when the observations are analyzed, the mean square error is close to zero relative to the magnitude of at least one of the estimated treatment effects.
Now we need to examine all the individual errors for every single instance and check if there are any extreme values. When you have a large data set, individual testing requires a lot of time. Instead of testing 106 observations we will random choose 5 observations which will be subjected to individual testing.
In introduction we mentioned that result can belong to one of two groups. So if patient survived 5 years or longer output would be 0,1 and if patient died within 5 year output would be 1, 0. After completion of testing would be ideal if the value of output after the test were the same as the output values before testing. As with other statistical methods, and classification using neural networks include errors that arise during the approximation.
Values of inputs, outputs and individual errors, in 5 randomly selected observations, are in table below:
|
Inputs |
Outputs |
Individual errors |
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived |
the patient |
1. |
0.6981 |
0.7273 |
0 |
0.0058 |
0.9942 |
0.0058 |
-0.0058 |
2. |
0.7547 |
0 |
0 |
0.4528 |
0.5472 |
0.4528 |
-0.4528 |
3. |
0.8113 |
0.3636 |
0 |
0.4528 |
0.5472 |
0.4528 |
-0.4528 |
4. |
0.717 |
0.8182 |
0 |
0.0339 |
0.9661 |
0.0339 |
-0.0339 |
5. |
0.8679 |
0.8182 |
0 |
0.0018 |
0.9982 |
0.0018 |
-0.0018 |
The network guessed all of them right. We can conclude that this network has a good ability of generalization, and, the training of this network has been validated.
Training attempt 10.1
Now we will try some variations. Let's set learning rate to 0.4, while momentum remains the same: 0.7. These are the results:
As we can see, the training is not complete because the error is too big. The number of iterations is also huge.
Training attempt 10.2
We can try to increase learning rate to, let's say, 0.6, which could lead to faster learning. This is the result:
As we can see, the number of iterations dropped.
In the following table we will show the results of all the trainings done using this neural network. First, we change values for learning rate from 0.0 to 1.0 while keeping the momentum fixed (0.7). We choose to keep this value for momentum fixed because previosly it has given the best result. Then we will keep the value for learning rate fixed - but the one that has given the best result, and that is 0.2, and then change values for momentum.
Training attempt |
Number of hidden neurons |
Number of hidden layers |
Training set |
Maximum error |
Learning rate |
Momentum |
Total net error |
Number of iterations |
Total mean square error |
5 random inputs test - number of correct guesses
| Network trained |
10 |
6 |
1 |
Full |
0.01 |
0.2 |
0.7 |
0.00865 |
69 |
0.29702 |
5/5 |
yes |
10.1 |
6 |
1 |
Full |
0.01 |
0.4 |
0.7 |
0.17375 |
61143 |
- |
- |
no |
10.2 |
6 |
1 |
Full |
0.01 |
0.6 |
0.7 |
0.17693 |
15696 |
- |
- |
no |
10.3 |
6 |
1 |
Full |
0.01 |
0.0 |
0.7 |
0.09439 |
671428 |
- |
- |
no |
10.4 |
6 |
1 |
Full |
0.01 |
0.1 |
0.7 |
0.15933 |
191993 |
- |
- |
no |
10.5 |
6 |
1 |
Full |
0.01 |
0.3 |
0.7 |
0.17693 |
16596 |
- |
- |
no |
10.6 |
6 |
1 |
Full |
0.01 |
0.5 |
0.7 |
0.18518 |
4125 |
- |
- |
no |
10.7 |
6 |
1 |
Full |
0.01 |
0.7 |
0.7 |
0.19316 |
6568 |
- |
- |
no |
10.8 |
6 |
1 |
Full |
0.01 |
0.8 |
0.7 |
0.19686 |
6960 |
- |
- |
no |
10.9 |
6 |
1 |
Full |
0.01 |
0.9 |
0.7 |
0.19431 |
188454 |
- |
- |
no |
10.10 |
6 |
1 |
Full |
0.01 |
1.0 |
0.7 |
0.19693 |
48296 |
- |
- |
no |
10.11 |
6 |
1 |
Full |
0.01 |
0.2 |
0.0 |
0.06984 |
320496 |
- |
- |
no |
10.12 |
6 |
1 |
Full |
0.01 |
0.2 |
0.1 |
0.06144 |
163163 |
- |
- |
no |
10.13 |
6 |
1 |
Full |
0.01 |
0.2 |
0.2 |
0.00711 |
79 |
0.16942 |
5/5 |
yes |
10.14 |
6 |
1 |
Full |
0.01 |
0.2 |
0.3 |
0.13897 |
121350 |
- |
- |
no |
10.15 |
6 |
1 |
Full |
0.01 |
0.2 |
0.4 |
0.16768 |
147553 |
- |
- |
no |
10.16 |
6 |
1 |
Full |
0.01 |
0.2 |
0.5 |
0.15256 |
32627 |
- |
- |
no |
10.17 |
6 |
1 |
Full |
0.01 |
0.2 |
0.6 |
0.16898 |
28604 |
- |
- |
no |
10.18 |
6 |
1 |
Full |
0.01 |
0.2 |
0.8 |
0.17167 |
16112 |
- |
- |
no |
10.19 |
6 |
1 |
Full |
0.01 |
0.2 |
0.9 |
0.00846 |
57 |
0.22436 |
5/5 |
yes |
10.20 |
6 |
1 |
Full |
0.01 |
0.2 |
1.0 |
0.23856 |
755 |
- |
- |
no |
We see from the table that training attempts 10.13 and 10.19 were successfull (training attempt 10 was presented before). We tested them both on 5 random inputs each, and network guessed right all of them in both cases. Here are the graphic presentations:
Training attempt 10.13:
Five different inputs for training attempt 10.13:
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes
detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived -
obtained outputs |
patient died -
obtained outputs |
1. |
0.47169 |
0 |
0.01923 |
0 |
1 |
0.11496 |
0.89144 |
2. |
0.49057 |
0.72727 |
0.03846 |
0 |
1 |
0.11496 |
0.89145 |
3. |
0.50943 |
0.54545 |
0.17308 |
0 |
1 |
0.40814 |
0.58609 |
4. |
0.71698 |
0.90909 |
0 |
0 |
1 |
0.11496 |
0.89144 |
5. |
0.84901 |
0.36364 |
0.19231 |
0 |
1 |
0.40814 |
0.58609 |
Graphics for training attempt 10.19:
Five random inputs for training attempt 10.19:
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes
detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived -
obtained outputs |
patient died -
obtained outputs |
1. |
0.37736 |
0.54545 |
0 |
0 |
1 |
0.00445 |
0.99552 |
2. |
0.41509 |
1 |
0 |
0 |
1 |
0.00445 |
0.99552 |
3. |
0.50943 |
0.54545 |
0.17308 |
0 |
1 |
0.13442 |
0.89227 |
4. |
0.67924 |
0 |
0 |
0 |
1 |
0.00445 |
0.99552 |
5. |
0.88679 |
0.63636 |
0.05769 |
0 |
1 |
0.13442 |
0.89227 |
Training attempt 11
4.11 Step Creating a neural network
We now decide for a neural network that contains 8 hidden neurons in one hidden layer. Again, we type in the standard number of inputs and outputs, check 'Use Bias Neurons', choose a Sigmoid Transfer function, and select 'Backpropagation with Momentum' as the Learning rule.
In this case, we chose eight hidden neurons.
Graphical representation of neural network
5.11 Step Train the network
The neural network has been created. We will train this one with the training set we created before, actually with the 80% of the sample. We select 'HabermanSurvival80', click 'Train' and a new window appears, asking us to fill in the parameters. This time, since there are more neurons in the hidden layer, we can select the maximum error to be 0.01. We do not limit the maximum number of iterations. As for the learning parameters, the learning rate will be 0.2, and momentum 0.3. After we click 'Train', the iteration process starts. The total net error, grows very fast and stops in 47696 iteration with error 0.2845048425.
Training attempt 11.1
Since learning rate determines how fast a neural network learns, the smaller its value is the more time it will take the network to learn. So now we can increase learning rate to 0.6 and keep momentum as it is, that is 0.3:
As we have previously stated, higher learning rate speeds up the process of learning, which can be seen here - number of iterations dropped from 47696 to only 316. Still, total net error is very high, which means the training is not complete.
Training attempt 11.2
If now we keep the same value for learning rate of 0.6 with increased momentum to 0.5, we see that the network is completely trained after 20 iterations and total net error is smaller than max error. We can test it.
6.11 Step Testing the network
Values of inputs, outputs and individual errors, in 5 randomly selected observations, are in table below:
observation |
age of patient |
patient's year
of operation |
number of positive
axillary nodes
detected |
patient survived
5 years or longer |
patient died
within 5 year |
patient survived -
obtained outputs |
patient died -
obtained outputs |
1. |
0.37736 |
0.54545 |
0 |
0 |
1 |
0.23923 |
0.76102 |
2. |
0.41509 |
1 |
0 |
0 |
1 |
0.37159 |
0.62857 |
3. |
0.50943 |
0.54545 |
0.17308 |
0 |
1 |
0.08768 |
0.91255 |
4. |
0.67924 |
0 |
0 |
0 |
1 |
0.99148 |
0.00855 |
5. |
0.88679 |
0.63636 |
0.05769 |
0 |
1 |
0.08757 |
0.91265 |
We see from the table that network guessed 4 out of 5 instances.
Below is a table that summarizes this experiment. The two best solutions for the problem are in bold and have a yellow background.
Training attempt |
Number of hidden neurons |
Number of hidden layers |
Training set |
Maximum error |
Learning rate |
Momentum |
Total mean square error |
Number of iterations |
5 random inputs test - number of correct guesses
| Network trained |
1 |
2 |
1 |
80% of full date set |
0.02 |
0.3 |
0.4 |
0.29702 |
7 |
4/5 |
yes |
1.2 |
2 |
1 |
80% of full date set |
0.02 |
0.3 |
0.5 |
- |
13800 |
- |
no |
1.3 |
2 |
1 |
80% of full date set |
0.02 |
0.4 |
0.5 |
- |
568412 |
- |
no |
2 |
2 |
1 |
full |
0.01 |
0.2 |
0.7 |
- |
51 |
- |
no |
3 |
2 |
1 |
full |
0.01 |
0.3 |
0.7 |
- |
2876 |
- |
no |
4 |
3 |
1 |
full |
0.01 |
0.2 |
0.7 |
0.16207 |
23 |
5/5 |
yes |
4.1 |
3 |
1 |
full |
0.01 |
0.3 |
0.6 |
- |
26138 |
- |
no |
4.2 |
3 |
1 |
full |
0.01 |
0.3 |
0.5 |
- |
37758 |
- |
no |
4.3 |
3 |
1 |
full |
0.01 |
0.3 |
0.2 |
- |
19988 |
- |
no |
4.4 |
3 |
1 |
full |
0.01 |
0.5 |
0.7 |
0.18931 |
165 |
- |
yes |
5 |
3 |
1 |
full |
0.01 |
0.3 |
0.7 |
- |
7109 |
- |
no |
6 |
4 |
1 |
full |
0.01 |
0.5 |
0.7 |
0.18417 |
117 |
- |
yes |
6.1 |
4 |
1 |
full |
0.01 |
0.5 |
0.6 |
- |
121094 |
- |
no |
6.2 |
4 |
1 |
full |
0.01 |
0.5 |
0.4 |
- |
623369 |
- |
no |
6.3 |
4 |
1 |
full |
0.01 |
0.4 |
0.6 |
- |
10456 |
- |
no |
7 |
4 |
1 |
full |
0.01 |
0.2 |
0.7 |
- |
42300 |
- |
no |
8 |
4 |
1 |
full |
0.01 |
0.3 |
0.7 |
- |
20009 |
- |
no |
9 |
4 |
1 |
full |
0.01 |
0.7 |
0.7 |
- |
9535 |
- |
no |
10 |
6 |
1 |
full |
0.01 |
0.2 |
0.7 |
0.16044 |
69 |
5/5 |
yes |
10.1 |
6 |
1 |
full |
0.01 |
0.4 |
0.7 |
- |
61143 |
- |
no |
10.2 |
6 |
1 |
full |
0.01 |
0.6 |
0.7 |
- |
25551 |
- |
no |
10.3 |
6 |
1 |
Full |
0.01 |
0.0 |
0.7 |
- |
671428 |
- |
no |
10.4 |
6 |
1 |
Full |
0.01 |
0.1 |
0.7 |
- |
191993 |
- |
no |
10.5 |
6 |
1 |
Full |
0.01 |
0.3 |
0.7 |
- |
16596 |
- |
no |
10.6 |
6 |
1 |
Full |
0.01 |
0.5 |
0.7 |
- |
4125 |
- |
no |
10.7 |
6 |
1 |
Full |
0.01 |
0.7 |
0.7 |
- |
6568 |
- |
no |
10.8 |
6 |
1 |
Full |
0.01 |
0.8 |
0.7 |
- |
6960 |
- |
no |
10.9 |
6 |
1 |
Full |
0.01 |
0.9 |
0.7 |
- |
188454 |
- |
no |
10.10 |
6 |
1 |
Full |
0.01 |
1.0 |
0.7 |
- |
48296 |
- |
no |
10.11 |
6 |
1 |
Full |
0.01 |
0.2 |
0.0 |
- |
320496 |
- |
no |
10.12 |
6 |
1 |
Full |
0.01 |
0.2 |
0.1 |
- |
163163 |
- |
no |
10.13 |
6 |
1 |
Full |
0.01 |
0.2 |
0.2 |
0.16942 |
79 |
5/5 |
yes |
10.14 |
6 |
1 |
Full |
0.01 |
0.2 |
0.3 |
- |
121350 |
- |
no |
10.15 |
6 |
1 |
Full |
0.01 |
0.2 |
0.4 |
- |
147553 |
- |
no |
10.16 |
6 |
1 |
Full |
0.01 |
0.2 |
0.5 |
- |
32627 |
- |
no |
10.17 |
6 |
1 |
Full |
0.01 |
0.2 |
0.6 |
- |
28604 |
- |
no |
10.18 |
6 |
1 |
Full |
0.01 |
0.2 |
0.8 |
- |
16112 |
- |
no |
10.19 |
6 |
1 |
Full |
0.01 |
0.2 |
0.9 |
0.22436 |
57 |
5/5 |
yes |
10.20 |
6 |
1 |
Full |
0.01 |
0.2 |
1.0 |
- |
755 |
- |
no |
11 |
8 |
1 |
80% of full date set |
0.01 |
0.2 |
0.3 |
- |
47696 |
- |
no |
11.1 |
8 |
1 |
80% of full date set |
0.01 |
0.6 |
0.3 |
- |
316 |
- |
no |
11.2 |
8 |
1 |
80% of full date set |
0.01 |
0.6 |
0.5 |
0.18884 |
20 |
4/5 |
yes |
12 |
8 |
1 |
90% of full date set |
0.01 |
0.5 |
0.2 |
- |
153 |
- |
no |
13 |
10 |
1 |
full |
0.01 |
0.2 |
0.7 |
- |
538 |
- |
no |
We tried to train a network with more than 10 neurons and the results are bad, the network will not train.
Advanced Training Techniques
We want to check the network performance, when the training is complete. A learning neural network is expected to extract rules from a finite set of examples. It is often the case that the neural network memorizes the training data well, but fails to generate correct output for some of the new test data. Therefore, it is desirable to come up with some form of regularization.
One form of regularization is to split the training set into a new training set and a validation set. After each step through the new training set, the neural network is evaluated on the validation set. The network with the best performance on the validation set is then used for actual testing. Your new training set consisted of the say it for example 80% - 90% of the original training set, and the remaining 10% - 20% would be classified in the validation set. Then you have to compute the validation error rate periodically during training and stop training when the validation error rate starts to go up. Validation error is not a good estimate of the generalization error, if your initial set consists of a relatively small number of instances. Our initial set, we named it man, consists only of 306 instances. In this case 10% or 20%, of the original training set, consisted of the 10 or 20 instances. This is the insufficient number of instances to perform validation. In this case instead validation we will use a generalization as a form of regularization.
One way to get appropriate estimate of the generalization error is to run the neural network on the test set of data that is not used at all during the training process. The generalization error is usually defined as the expected value of the square of the difference between the learned function and the exact target.
In the following examples we will check the generalization error, such as from the example to the example we will increase the number of instances in the training set, which we use for training, and we will decrease the number of instances in the sets that we used for testing.
Training attempt 14
3.14 Step Create a Training Set
We will choose random 90% of instances of training set for training and remaining 10% for testing. First group will be called HebermanSurvival90, and second HebermanSurvival10.
5.14 Step Train the network
Unlike previous training, now there is no need to create new neural network. Advanced Training Techniques consist in the fact that we examine the performance of existing architectures, using a new training and test set of data. Satisfactory results we found using architecture NewNeuralNetwork4. By the end of this article we will use not only this architecture, but also the parameters of the training that we used in this architecture previously which brought us desired results. But before you open an existing architecture, create new training sets. First training set name it HebermanSurvival90 and second one name it HebermanSurvival10.
Now open neural network NewNeuralNetwork4, select training set HebermanSurvival90 and in new network window press button 'Train'. The parameters that we now need to set will be the same as the ones in previous training attempt: the maximum error will be 0.01, the Learning rate 0.2, and the Momentum 0.7. We will not limit the maximum number of iterations, and we will check 'Display error graph', as we want the see how the error changes throughout the iteration sequence. Then press 'Train' button again and see what will happen.
The error function do not fluctuate much, moving in a straight line, horizontally, and in 18604 iteration stops, can not find its optimal solution.
We train the 70,80,90 percent of the date set and test 30,20,10 percent randomly selected as the date set.
We obtain the following table of the results:
Training attempt |
Number of hidden neurons |
Number of hidden layers |
Training set |
Test set |
Maximum error |
Learning rate |
Momentum |
Iterations |
Total Net Error(during training) |
Total Mean Square Error (during testing) |
Network trained |
14 |
6 |
1 |
90% |
10% |
0.01 |
0.2 |
0.7 |
18604 |
0.054564 |
0.161705 |
no |
15 |
6 |
1 |
80% |
20% |
0.01 |
0.2 |
0.7 |
2380 |
0.05456 |
0.16170 |
no |
16 |
6 |
1 |
70% |
30% |
0.01 |
0.2 |
0.7 |
668 |
0.15566 |
0.28574 |
no |
After training 17th attempt we concluded that there are some cases that makes big impact on Total Mean Squared Error as an example of error as 0.8, 0.5 and 0.4.
In 18th i 19th attempts we found most big errors and correctly classified with big error (out of 24). With big error we mean that network classified completly wrong (for example it is 1, 0 but it should be 0, 1) and that error makes huge impact on Total Mean Square Error.
Because all of these network failed to make error less than 0.01 we can say that this network failed to generalize this problem.
Conclusion
Different solutions which were tested in this experiment have shown that the choice of the number of hidden neurons is crucial to the effectiveness of a neural network. Also, the experiment showed that the success of a neural network is very sensitive to parameters chosen in the training process. The learning rate must not be too high, and the maximum error must not be too low. The results have shown that the total mean square error does not reflect directly the success of a network In the end, after including only 10% of instances in the training set, we learned that even that number can be sufficient to make a good training set and a reasonably trained neural network.
DOWNLOAD
Data set used in this tutorial
The prepared date set
Neuroph projects
The samples used for advanced techniques
See also:
Multi Layer Perceptron Tutorial
|