Classification of animal species using neural network

Classification is one of the most frequently encountered decision making tasks of human activity. A classification problem occurs when an object needs to be assigned into a predefined group or class based on a number of observed attributes related to that object. Because of this the aim of cluster analysis is to classify the objects into clusters, especially in such a way that two objects of the same cluster are more similar than the objects of other clusters. The objects can be of various characteristics. It is possible to cluster animals, plants, text documents, economic data etc. In addition to classification by neural networks, there are other statistical methods dealing with the problem of classification, such as discriminant analysis. One major limitation of the statistical models is that they work well only when the underlying assumptions are satisfied. On the other hand advantage of neural networks lies in the following aspects. First, they can adjust themselves to the data without any explicit specification of functional or distributional form for the underling model, because they are data driven self-adaptive methods. Second, neural networks are nonlinear models, which makes them flexible in modeling real world complex relationships. Finally, neural networks can approximate any function with arbitrary accuracy.

The purpose of this experiment is to study the feasibility of classification animal species using neural networks. An animal class is made up of animal that are all alike in important ways. So we need to train a neural network to make it able to predict which species belong to a particular group. Once we have decided on a problem to solve using neural networks, we will need to gather data for training purposes. The training data set includes a number of cases, each containing values for a range of input and output variables. The data set that we use in this experiment can be found at http://archive.ics.uci.edu/ml/datasets.html under the category classification. In this category there are many sets of data but for the purposes of this experiment we will use the data set named Zoo. This set of data was published by Richard Forsyth (date donated: 1990-05-15).

This database includes 101 cases. Each case is the name of animal. It was found that each of these animals belonged to one of seven classes.

This variable, named type, represents the output variable. Except the output variable, there were 17 input variables for each animal species. Information of input variables:

Each variable is type of Boolean, except variable animal name which is nominal variable and variable legs is a numeric variable (set of values: {0, 2, 4, 6, 8}).

Handling non-numeric data, such as Boolean = {true, false}, is more difficult. However, nominal-valued variables can be represented numerically. Value true will be replaced with value 1, and value false will be replaced with value 0. We wil not use variable animal name in experiment, because this variable is unique for each case.

Once the most appropriate raw input data has been selected, it must be preprocessed; otherwise, the neural network will not produce accurate forecasts.

Transformation and normalization are two widely used preprocessing methods. Transformation involves manipulating raw data inputs to create a single input to a network, while normalization is a transformation performed on a single data input to distribute the data evenly and scale it into an acceptable range for the network. Acceptable rang of value, in Neuroph studio, varies in the range between zero and one. In our experiment we will use normalization, as preprocessing method, to normalized variable legs, because this variable have set of values 0, 2, 4, 6, 8. Set of values of other input variables are zero or one, so we will not apply this method over them. However, we will have to normalize the values of output variable type, because this variable have set of values between 1 and 7.

In order to train a neural network to predict the class of animal species using this data set, there are procedure that has to be made.

Linear scaling of data is one of the methods of data normalization. Linear scaling requires that a minimum and maximum values associated with the facts for a single data input be found. Let's call these values Dmin and Dmax, respectively. The input range required for the network must also be determined. Let's assume that the input range is from Imin to Dmax. The formula for transforming each data value D to an input value I is:

Our desired range varies in the interval between zero and one, so that the constant value of Imin is 0 and the constant value of Imax is 1. Dmin and Dmax must be computed on an input-by-input basis. The original data set indicate that the variable legs is associated with a set of values {0, 2, 4, 6, 8}, so the constant value of Dmin is 0 and the constant value of Dmax is 8. This method of normalization will scale input data into the appropriate range.

Variable type is the output value and it represents the seven different classes of animals. We will not use previous formula to normalize this variable. In this case, it is suitable to represent seven of the various classes as seven output variables. If an animal belongs to the first class, output values, for that animal, will be respectively: 0, 0, 0, 0, 0, 0, 1. If an animal belongs to the second class, output values, for that animal, will be respectively: 0, 0, 0, 0, 0, 1, 0 and so on for all seven groups.

After Normalization of original data, data should be retained as .txt file. This file can be saved under the name zoo.normalized.data.txt and it will be essential for creating training set.

A new project is created and it will appear in the 'Projects' window, in the top left corner of Neuroph Studio.

To teach the neural network we need training data set. The training data set consists of input signals assigned with corresponding target (desired output). The neural network is then trained using one of the supervised learning algorithms, which uses the data to adjust the network's weights and thresholds so as to minimize the error in its predictions on the training set. If the network is properly trained, it has then learned to model the (unknown) function that relates the input variables to the output variables, and can subsequently be used to make predictions where the output is not known.

We are now ready to create new training set. To create training set do the foolowing:

Enter training set name. Select the type of supervised. In general, if you use a neural network, you will not know the exact nature of the relationship between inputs and outputs – if you knew the relationship, you would model it directly. The other key feature of neural networks is that they learn input/output relationship through training. There are two types of training used in neural networks, with different types of networks using different types of training. These are supervised and unsupervised training, of which supervised is the most common. In supervised learning, the network user assembles a set of training data. The training data contains examples of inputs together with the corresponding outputs, and the network learns to infer the relationship between the two. For an unsupervised learning rule, the training set consist of input training patterns only. Our, normalized, data set, that we create above, consists input and output values. Therefore we choose supervised learning. In field Number of inputs enter 16 and in field number of outputs enter 7 and click next:

Then you can create set in two ways. You can either create training set by entering elements as input and desired output values of neurons in input and output label, or you can create training set by choosing an option load file. The first method of data entry is time consuming, and there is also a risk to make a mistake when entering data. Therefore, choose second way and click load from file.

Click on Choose File and find file named zoo.normalized.data.txt. Then select tab as values separator. In our case values have been separated with tab. In some other case values of data set can be separated on the other way. When you finish this, click on Load.

A new window will appear and table to the window is our training set. We can see that this table has a total of 23 columns which is fine because the 16 columns represents input values and other 7 columns represents output values. We can also see that all data are in the certain range, range between 0 and 1. Click Finish and new training set will appear in the Projects window.

Now we need to create neural network. In this experiment we will analyze several architecture. Each neural network which we create will be type of Multi Layer Perceptron and each will differ from one another according to parameters of Multi Layer Perceptron.

This is perhaps the most popular network architecture in use today: the units each perform a biased weighted sum of their inputs and pass this activation level through a transfer function to produce their output, and the units are arranged in a layered feedforward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity.

Select desired project from Project drop-down menu, Neuroph as category, Neural Network file type and click next.

In new Multi Layer Perceptron dialog enter number of neurons. The number of input and output units is defined by the problem, so you need to enter 16 as number of input neurons and 7 as number of output neurons.

The number of hidden units to use is far from clear. If too many hidden neurons are used, the network will be unable to model complex data, resulting in a poor fit. If too few hidden neurons are used, then training will become excessively long and the network may over fit.

How about the number of hidden layers? For most problems, one hidden layer is normally sufficient. Therefore, we will choose one hidden layer. The goal is try to quickly find the smallest network that converges and then refine the answer by working back from there. Because of that, we will start with 2 hidden neurons and if the network fails to converge after reasonable period, we will restart training up to ten times, thus ensuring that it has not fallen into local minimum. If the network still fails to converge we will add another hidden neuron and repeat procedure.

Further, we check option Use Bias Neuron. Bias neurons are added to neural networks to help them learn patterns. A bias neuron is nothing more than a neuron that has a constant output of 1. Because the bias neurons have a constant output of one they are not connected to the previous layer. The value of 1, which is called the bias activation, can be set to values other than 1. However, 1 is the most common bias activation.

If your values in the data set are in the interval between -1 and 1, choose Tanh transfer function. In our data set, values are in the interval between 0 and 1, so we used Sigmoid transfer function.

As learning rule choose Backpropagation With Momentum. Backpropagation With Momentum algorithm shows a much higher rate of convergence than the Backpropagation algorithm. Choose Dynamic Backpropagation algorithm if you have to training dynamic neural network, which contain both feedforward and feedback connections between the neural layers.

Appearance of the neural network, we just created, you can see in the figure bellow. Just select Graph View.

Figure shows the input, the output and hidden neurons and how they are connected with each other. Except for two neurons with activation level 1 (bias activation), all other neurons have an activation level 0. These two neurons represent bias neurons, as we explained above.

Now that we have created a neural network it is time to do some training. To start network training procedure, in network window select training set, named TS1,and click Train button. In Set Learning parameters dialog use default learning parameters. When the Total Net Error value drops below max error, which is by default 0.01, the training is complete. If the error would be smaller we would get a better approximation.

Next thing we should do is determine the values of learning parameters, learning rate and momentum.

Learning rate is one of the parameters which governs how fast a neural network learns and how effective the training is. Let us assume that the weight of some synapse in the partially trained network is 0.2. When the network is introduced with a new training sample, the training algorithm demands the synapse to change its weight to 0.7 (say) so that it can learn the new sample appropriately. If we update the weight straightaway, the neural network will definitely learn the new sample, but it tends to forget all the samples it had learnt previously. This is because the current weight (0.2) is a result of all the learning that it has undergone so far. So we do not directly change the weight to 0.7. Instead, we increase it by a fraction (say 25%) of the required change. So, the weight of the synapse gets changed to 0.3 and we move on to the next training sample. This factor (0.25 in this case) is called Learning Rate. Proceeding this way, all the training samples are trained in some random order. Learning rate is a value ranging from zero to unity. Choosing a value very close to zero, requires a large number of training cycles. This makes the training process extremely slow. On the other hand, if the learning rate is very large, the weights diverge and the objective error function heavily oscillates and the network reaches a state where no useful training takes place.

The momentum parameter is used to prevent the system from converging to a local minimum or saddle point. A high momentum parameter can also help to increase the speed of convergence of the system. However, setting the momentum parameter too high can create a risk of overshooting the minimum, which can cause the system to become unstable. A momentum coefficient that is too low cannot reliably avoid local minima, and can also slow down the training of the system.

During the testing we unsuccessfully trained the neural network. The summary of the results are shown in the Table 1.

Network fails to converge after reasonable period and training is not complete after 19540 iterations. Training is not completed so we can’t test the network.

So let we try something else. We will update the weight of learning rate and increase it by 25%. In network window click Randomize button and then click Train button. Value of 0.2 in learning rate label replace with a new value 0.3 and click Train button.

During the testing same network with another parameters we unsuccessfully trained the neural network. Results of this training, together with the previous case, are shown in the Table 1.

Increasing the value of learning rate we conclude that the objective error function heavily oscillates and the network reaches a state where no useful training takes place.

First 4000 iteration leads to large shifts. At 12000 iterations total net error has a tendency to fall, after that again total net error begin to increase.

In the table below for the next three sessions we will present the results of other trainings for the first architecture. For other trainings is not given graphic.

Based on data from Table 1 can be seen that regardless of the parameters of training error do not falls below a specified level, even if we train the network through a different number of iterations.

This all may be due to the small number of hidden neurons. In the following solution we will increase the number of hidden neurons.

In this section we put more emphasis on studying the dependence between the number of iterations, on the one hand, and the learning rate and momentum, on the other hand. We will use the same training set, TS1, as above. Before proceeding to examine dependence, we first create a new neural network. I called it MultiLayerPerceptron3. It is necessary to select the same options as in the previous architecture just as we label the hidden neurons, instead of the two (hidden neurons) enter the number four.

First training course, of second architecture, we will start with extremely low values of learning rate and momentum. First click on button Train. In Set Learning parameters dialog, field set Stopping criteria enter 0.01 as max error. In the same field set check on option Limit max iterations and in text field next to this label enter number 2000. We limit the number of iterations to 2000 in order to graphically display, the training of this network, was clearer. In field set Learning parameters, enter 0.001 for Learning rate and 0.05 for Momentum. After entering this values click on button Train.

During the testing we unsuccessfully trained the neural network named MultiLayerPerceptron3. The summary of the results are shown in the Table 2.

From the graphics can be seen from iteration to iteration there are no large shifts in the prediction. More accurate in predicting, fluctuations are very small and the values are around 0.1. Reason for such a small fluctuation is that the learning rate is very close to zero. Also because such a small coefficient, of the learning rate, neural network has no the ability to learn quickly. On the other hand small value of momentum also slow down the training of the system.

Now in this training, of the second architecture, we will try to rely on extremely high values of learning rate and momentum. Compared to previous training, just replace the values of learning rate and momentum. For learning rate enter 0.9 and for momentum also enter 0.9. Leave the other options the same as in the previous training.

During the testing we unsuccessfully trained the neural network named MultiLayerPerceptron3. The summary of the results are shown in the Table 2.

That picture says a thousand words, see the chart below and you will see a clear distinction between small and large values of the parameters of training. We set the momentum parameter too high and we have created a risk of overshooting the minimum, which caused the system to become unstable. On the other hand, the learning rate is very large, the weights diverge and the objective error function heavily oscillates and the network reaches a state where no useful training takes place.

Since we are in the previous two sessions using two extreme values, the parameters of training, in this training we will take the golden mean of this parameters. If the golden mean, and does not lead to the desired approximation of the total net error, we will move to a new architecture with more hidden neurons. Compared to previous two trainings, just replace the values of learning rate and momentum. For learning rate enter 0.5 and for momentum also enter 0.5. Leave the other options the same as in the previous training. After that, click button Train and see what will happen.

The golden medium seems to be not so golden. During the testing we unsuccessfully trained the neural network named MultiLayerPerceptron3. The summary of the results are shown in the Table 2.

But the following useful conclusion can be drawn from this training. First, we see that the architecture of 4 hidden neurons is not appropriate for this training set, because for continuing the training of the neural network we do not get the desired approximation of max error. Error is still higher than desired level.

Second, the oscillations are less than second training (which was expected because the parameters of training is less than in the previous case), but on the other side neural network has no the ability to learn quickly and the training of the system is slow (just like in first training).

In the table below for the previous three sessions we will present the results of all trainings for the second architecture.

Create a new neural network with six hidden neurons and select the same options as in the previous architectures. I called it MultiLayerPerceptron4. We will use the same training set, TS1, as above.

We will start with the value 0.6 for learning rate and momentum of 0.4. After entering values, for learning rate and momentum, click button Train.

During the testing we successfully trained the neural network named MultiLayerPerceptron4. The summary of the results are shown at the final table at the end of this article.

The total net error slowly descends up until the iteration 71, when it finally stops when reaches a level lower than a given (0.01).

From the graphics can be seen that in the first five iterations total net error is extremely high. It varies in the range of 0.2 to 0.4. then in the next 25 iterations shows a tendency to fall slower than in the first 5 iterations. After 30 iterations error, from step to step, continue with the fall and in the 71 iteration reaches the desired level.

As we mentioned above, we could not test the network after training the previous architecture, because the total error is not reached a satisfactory level. Now we are now able to test network to make sure that it is trained properly. Go to network window and click button Test. New window will show up with a test result.

Total Mean Square Error measures the average of the squares of the "errors". The error is the amount by which the value implied by the estimator differs from the quantity to be estimated. An mean square error of zero, meaning that the estimator predicts observations of the parameter with perfect accuracy, is the ideal, but is practically never possible. The unbiased model with the smallest mean square error is generally interpreted as best explaining the variability in the observations. The test showed that total mean square is 0.002678718637271287. The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the mean square error is close to zero relative to the magnitude of at least one of the estimated treatment effects.

Now we need to examine all the individual errors for every single instance and check if there are any extreme values. When you have a large data set, individual testing requires a lot of time. Instead of testing 101 observations we will random choose 5 observations which will be subjected to individual testing. Three following table will show the value of input, output and errors in 5 randomly selected observations. These values are taken from the window Test Results.

In the introduction we mentioned that any animal belonging to one of a seven given groups. Normalization of the initial data we have made so that if an animal belong to the first group, the output values for that instance will be 0, 0, 0, 0, 0, 0, 1. If belong to the second group, the output values for that instance will be 0, 0, 0, 0, 0, 1, 0 and so on. Columns of the Table 3.2. show the instances belonging to the appropriate group after appropriate training the neural network. In the original data set is shown that bear belong to a first group. After completion of testing would be ideal if the value of output after the test were the same as the output values before testing. As with other statistical methods, and classification using neural networks include errors that arise during the approximation. Individual error between the original and the assessed values are shown in Table 3.3.

For bear and deer we can say that we are wrong 0.48%, 0.32% respectively if we classify them in the first group. You will agree with me that the error is very small, less than desired 1%. But what happens to the remaining three, randomly selected, animals: moth, piranha and swan? We are wrong 1.35% if classify moth in sixth group, 1.55% if we classify piranha in fourth group and 1.58% if classify swan in second group. Although these errors are still small they are not less than desired 1%.

At the beginning we said that the goal is try to quickly find the smallest network that converges and then refine the answer by working back from there. since we find the smallest neural network do the following:

After only one iteration total net error is 0.0094. Total mean square error is 0.00255727454351991. But what is most interesting are the individual values of bear, deer, moth, piranha and swan. We will show them in two tables below.

Output values for bear and deer are now perfect and the output value moth, piranha and swan are significantly improved closer to the ideal value. As the output values of the instances after testing are closer to the baseline values, the approximation error is less.

For all randomly selected animals now we can say that we are wrong less than 1% if we placed them into the starting group.

Recommendation: If you do not get the desired results, continue to gradually increase the training parameters. The neural network will definitely learn the new sample, and it would not forget all the samples it had learnt previously.

Advanced Training Techniques

When the training is complete, you will want to check the network performance. A learning neural network is expected to extract rules from a finite set of examples. It is often the case that the neural network memorizes the training data well, but fails to generate correct output for some of the new test data. Therefore, it is desirable to come up with some form of regularization.

One form of regularization is to split the training set into a new training set and a validation set. After each step through the new training set, the neural network is evaluated on the validation set. The network with the best performance on the validation set is then used for actual testing. Your new training set consisted of the say it for example 80% - 90% of the original training set, and the remaining 10% - 20% would be classified in the validation set. Then you have to compute the validation error rate periodically during training and stop training when the validation error rate starts to go up. However, validation error is not a good estimate of the generalization error, if your initial set consists of a relatively small number of instances. Our initial set, we named it TS1, consists only of 101 instances (animal species). In this case 10% or 20%, of the original training set, consisted of the 10 or 20 instances. This is the insufficient number of instances to perform validation. In this case instead validation we will use a generalization as a form of regularization.

One way to get appropriate estimate of the generalization error is to run the neural network on the test set of data that is not used at all during the training process. The generalization error is usually defined as the expected value of the square of the difference between the learned function and the exact target.

In the following examples we will check the generalization error, such as from the example to the example we will increase the number of instances in the training set, which we use for training, and we will decrease the number of instances in the sets that we used for testing.

Let’s go on. First I will choose random 70% instances of training set which I termed TS1. These 70 (70% of 101 animal species) animal species we will use to create a new training set which we will call TS70. The remaining 31 animal species we will use to create the test set which we will call TS30.

Training set TS70 consists of the following species: antelope, bear, buffalo, calf, cavy, deer, elephant, fruitbat, girl, goat, gorilla, hamster, hare, lion, lynx, mongoose, mole, oryx, platypus, polecat, pony, raccon, reindeer, seal, sealion, vole, wallaby, wolf, crow, dove, flamingo, hawk, kiwi, parakeet, penguin, pheasant, swan, vulture, wren, skua, rhea, pitviper, seasnake, tuatara, bass, carp, catfish, herring, pike, piranha, sole, stingray, tuna, frog, newt, flea, gnat, moth, termite, wasp, clam, crab, crayfish, lobster, slug, starfish, worm, slowworm, tortoise, seawasp

Training set TS30 consists of the following species: aardvark, boar, cheetah, dolphin, giraffe, leopard, mink, opossum, porpoise, puma, pussycat, squirrel, vampire, chicken, duck, gull, lark, ostrich, skimmer, sparrow, chub, dogfish, haddock, seahorse, frog, toad, honeybee, housefly, ladybird, octopus, scorpion

Unlike previous training, now there is no need to create new neural network. Advanced Training Techniques consist in the fact that we examine the performance of existing architectures, using a new training and test set of data. Satisfactory results we found using architecture MultiLayerPerceptron4. By the end of this article we will use not only this architecture, but also the parameters of the training that we used in this architecture previously which brought us desired results. But before you open an existing architecture, create new training sets. First training set name it TS70 and second one name it TS30.

Now open neural network MultiLayerPerceptron4, select training set TS70 and in new network window press button Train. The parameters that we now need to set will be the same as the ones in previous training attempt: the maximum error will be 0.01, the Learning rate 0.7, and the Momentum 0.4. We will not limit the maximum number of iterations, and we will check 'Display error graph', as we want the see how the error changes throughout the iteration sequence. Then press Train button again and see what will happen.

During the testing we again successfully trained the neural network named MultiLayerPerceptron4. The summary of the results are shown in the Table 4.

The total net error slowly descends up until the iteration 53, when it finally stops when reaches a level lower than a given (0.01).

Fewer instances of the same architecture and same parameters of training leading up to the fact that we trained our architecture through a lower number of iterations. Graph has approximately the same shape as in the previous training. No large deviations and for that reason we declare this training successful.

After successful training the neural network, we can test the same to discover wheter the results will be as good as the previous testing.

Unlike previous practice, where we have to train and test neural networks using the same training set, now we will use the second training set, named TS30, to test network in which there are data that a neural network has not been seen.

So go to network window, select training set TS30 and press button Test. You should see something like this.

Total Mean Square Error is 0.01526, which is about 0.005 higher than the desired error or 0.5% higher than the desired error. Looking at the percentage that is not a great deviation. But what about the individual errors? Now that is training set relatively small you can go through each instance and look at its variation or error. See the ninth and twenty-ninth row of the test results. Ninth row refers to animal species and thirtieth row refers to animal species toad. You can see that this instance is characterized by high error. In the first case, the error is 39.74%, while in the second case, the error is 34.06%.

From this the conclusion is drawn that the neural network memorizes the training data well, but fails to generate correct output for some of the new test data. The problem may lie in the fact that we used 31 instances for the test vs. 70 instances that are used to train neural network. So how many data should be used for testing? Some authors believe that the 10% could be a practical choice. We will create four new training sets. More precisely we will make two training set to train and two training set to test the same architecture. Two training sets, which we use to train the network, will consist of 85% and 90% of the initial instances of our training set TS1, and the remaining two training sets, which we use to test the network, will consist of 15% and 10% of the initial instances of our training set TS1. Final results of the advanced training you can see in Table 4. And further we will restricted to a maximum error of 0.01, 0.7 for learning rate and 0.4 for momentum.

Training attempt	Training set	Testing set	Hidden Neurons	Network trained	Iterations	Total Net Error	Total Mean Square Error
11.	TS70	TS30	6	yes	53	0.00985	0.01526
12.	TS85	TS15	6	yes	250	0.009998	0.02003
13.	TS90	TS10	6	yes	119	0.009997	0.01005

After training attempt 12 and testing the network with 15% of the basic sample and after checking every single instance, we noticed that the largest error is 99.92% only in one instance (instance number 12). In other instances errors are range in the desired interval. It is sufficient that only one instance has such a big error and that the Total Mean Square Error reaches value of 0.02003. Because of error 0.9992, network training can not be accepted as correct one. We have not achieved generalization using training set TS85.

After training attempt 13 and testing the network with 10% of the basic sample and after checking every single instance, we noticed that the largest error is 1.3% only in one instance which is 0.3% more than the desired. Approximation error is not greater than 1% for the remaining ten instances.

During this experiment, we created three different architectures, one basic training set and six training sets derived from the basic training set. We normalize the original data set using a linear scaling method. Through six basic steps we explained in detail the creation, training and testing neural networks. If the network architecture using a small number of hidden neurons training will become excessively and the network may over fit no matter what are the values of training parameters. We point out major differences between the Perceptron and MultiLayerPerceptron, as network types. Through the various tests we have demonstrated the sensitivity of neural networks to high and low values of learning parameters. We have shown that the best solution to the problem of classification of animal species, in seven different groups, is architecture with one hidden layer and six hidden neurons. Finally, we explain the importance of generalization and we pointed to the importance of validation as an important form of regularization. In the table below can been seen the overall results of this experiment. Best solution is indicated in green color.

Training attempt 1

Training attempt 2

Training attempt 6

Training attempt 7

Training attempt 8

Training attempt 9

Advanced Training Techniques

Training attempt 11