Automatic Recognition of Tea Diseases Based on Deep Learning

With the rapid development of intelligent agriculture and precision agriculture, computer image processing technology has been widely used to solve various problems in the agricultural field. In particular, the advantages of convolutional neural networks (CNNs) in image classification have also been widely used in the automatic recognition and classification of plant diseases. In this paper, a deep convolutional neural network named LeafNet capable of recognizing the seven types of diseases from tea leaf disease images was established, with an accuracy of up to 90.23%, aiming to provide timely and accurate diagnostic services in the remote and topographic tea plantation in China. At the same time, the traditional machine learning algorithm is applied for comparative analysis, which extracts the dense scale-invariant feature transform (DSIFT) of the image and constructs the bag of visual word (BOVW) model to express the image based on the DSIFT descriptor. The support vector machines (SVMs) and multilayer perceptron (MLP) were used to identify tea leaf diseases, with an accuracy of 60.91 and 70.94%, respectively.


Introduction
Tea has a long history of cultivation in China, and the tea planting area and yield rank first in the world. According to statistical data, in 2016 China's 17 provinces had a total of 2.87 million hectares of tea plantation and production, and the total output of tea reached 2.4 billion tons [1]. As the main tea-producing areas in China are mainly distributed in subtropical regions, the natural environment differs due to geographical latitude and topographical conditions. The tea tree is a perennial evergreen woody plant, which grows in warm and humid growth environment. However, these regions are conducive to the breeding and reproduction of diseases. In recent years, the tea planting area has increased year by year, and the tea leaf diseases have risen continuously, which has seriously threatened the quality and yield of tea. Because the distribution of tea areas in China is mostly in high mountain areas, the infrastructure construction in these areas is relatively lagging behind, and the occurrence of tea leaf diseases is often not controlled in a timely and effective manner, resulting in huge economic losses. Therefore, being able to detect and identify diseases early in the field is an important task to ensure the sustainable development of the tea industry.
The diagnosis of plant diseases is usually based on the appearance of the disease. When the leaves of a plant are infected by a disease, the appearance of the leaves will change significantly. Each disease usually has a discernible leaf color and texture symptom, and plant diseases can be diagnosed based on these characteristics. However, farmers mainly rely on their own experience to diagnose plant diseases with their own senses. Due to the limitation of knowledge background, there are ambiguities in the diagnosis. Most tea trees in China are planted in mountainous areas, which are large, difficult to investigate in the field, and inefficient. Relying on agricultural experts to diagnose tea leaf diseases is not only time-consuming but also costly. The transportation and infrastructure conditions in these places are limited. Finally, the expert must have experience and knowledge in various disciplines and need to understand all the symptoms of the disease and the causes of the diversity of the disease. At the same time, because China's agricultural population is relatively large and the number of experts engaged in agricultural services is extremely limited, it is necessary to establish a system that can diagnose tea leaf diseases in a timely and accurate manner.
The current diagnostic methods of plant diseases mainly include microscope identification, molecular biology technology, and spectroscopic technology, but the first method is time-consuming and subjective. Even experienced plant pathologists may have wrong judgments, leading to inaccurate conclusion. The latter two methods are currently considered more accurate, and their main disadvantages are the high labor intensity and the requirement of specific instruments.
With the rapid development of intelligent agriculture and precision agriculture, machine learning methods and computer image processing technologies have been applied to the identification of plant diseases [2,3], providing a new method for detecting plant diseases, which can help farmers and researchers quickly and accurately identify the types of plant diseases. The general approach based on machine learning and computer image processing technology is first to manually design and extract disease image features, namely, global features, such as color features [4], shape features [5], texture features [6], or two or more than three features [7][8][9][10][11], and local features, using scale-invariant feature transform (SIFT), speeded-up robust features (SURF), dense scale-invariant feature transform (dense SIFT), and pyramid histograms of visual words (PHOW) [12][13][14]. After extracting the features, they are identified and classified using different classifiers, such as artificial neural networks [15,16] and support vector machines [17,18]. Because traditional machine learning relies on features extracted manually, the resulting recognition system is not fully automated.
At present, most of the researches on tea using computer vision technology focus on tea quality detection [19], tea species identification [20], and tea leaf disease information query and management based on expert systems [21]. Because the expert system has limited knowledge and needs to be updated and maintained on a regular basis, it is also limited for noncomputer professional technicians. For some literatures, the identification of tea diseases is based on hyperspectral [22] or infrared thermal images [1]. These methods are easy to operate and have high accuracy, but the cost of the instrument is not suitable for widespread promotion.
In recent years, the popularity of the Internet has led to the explosive growth of Internet data, and the technical performance of computers and smartphones has continued to improve. These factors are the main reasons that have led to widespread attention for deep learning. Deep learning refers to the process of learning sample data through a certain training method to obtain a deep network structure containing multiple levels [23]. Deep learning is a branch of machine learning. Its essence is also a neural network, but the number of hidden layers is more than one layer, which is an extension of artificial neural networks. "Neural network" is a component of deep learning.
The concept of deep learning was first mentioned by Professor Geoffrey Hinton of the University of Toronto in a paper on back-propagation algorithms. The concept of "depth" was used to represent large artificial neural networks. With the introduction of deep learning, more and more researchers have begun to develop large-scale neural network systems. These deep neural network systems can take the characteristics from the original data, can work alone without human manipulation, and then can use what humans have learned to learn new things.
The advantage of the deep learning is that it does not require artificial feature extraction but this is obtained automatically by the network. It can solve nonlinear separable problems and has strong generalization ability and robustness. Among them, the most widely used is the convolutional neural network, which is a deep neural network. Images can be directly used as input data, eliminating the complicated process of feature extraction and data reconstruction in traditional machine learning algorithms. At the same time, the multilayer network structure of the convolutional neural network maintains a high degree of invariance to image translation, scaling, or lighting changes [18]. At present, convolutional neural networks have been applied to the identification and diagnosis of plant diseases [24][25][26].
In recent years, many researchers in the world have used machine learning algorithms to build many disease recognition systems, but because the characteristics of each plant disease are different, the different machine learning methods will have different recognition effects. Hence, based on previous studies, this paper uses deep convolutional neural networks to identify and classify tea leaf diseases. At the same time, the traditional machine learning algorithm is compared with the proposed convolutional neural network, and a recognition system suitable for the tea leaf disease is found through comparative analysis.

Date acquisition
The existing databases on the network such as ImageNet, PlantVillage, and CIFAR-1 datasets do not have sufficient tea leaf disease images and some studies have collected disease photos in indoor or controlled environments. These factors have made the recognition system designed to identify diseases under natural light conditions to have certain limitations, so a new disease data set is constructed in this paper.
Tea leaf disease images were all captured using the Canon PowerShot G12 camera in the natural light environment of the tea garden in Chibi and Yichang within Hubei Province. The images were taken about 20 cm directly above the leaves with autofocus mode at resolution of 4000 Â 3000 pixels. A total of 3810 disease images were collected, which contained 7 diseases, and all disease images have been identified by plant pathologists. The identification criteria used for the tea leaf diseases were based on the previously described identification schemes [27,28]. In order to meet the requirements of the model algorithm and reduce the computational complexity of the network, all disease images are resized to 256 Â 256 pixels and 750 Â 750 pixels, respectively. Figure 1 shows the types of tea leaf diseases used in this experiment. Data amplification processing is performed on a smaller number of disease images so that the number of the seven diseases image is balanced. Data amplification processing improves the generalization ability of the classifier, which is more conducive to network training. Three different methods were used to alter the image input and improve classification (Figure 2). A total of 7905 tea leaf disease images were obtained after the amplification treatment ( Table 1). The 80/20 ratio of training/test data is the most commonly used ratio in neural network applications. In addition, a 10% subset of the test dataset was used to validate the dataset [29].

Tea leaf disease identification based on BOVW model
Traditional machine learning algorithm is a shallow architecture that contains one or two nonlinear transformation layers. It can automatically learn the underlying laws in the data and use the learned rules to make predictions. In the field of computer vision, many models can be realized by manually designing and extracting the visual characteristics of the image in advance, and the image content is converted into a quantitatively calculated information description form, after being processed by the shallow structure model.

Image visual feature
The extraction and selection of image visual features is an important means to transform the image content into a quantitatively calculated information description form, which mainly include global features and local features. Global features refer to the overall attributes of the entire image, mainly including color features, texture features, and shape features. These features are features that can be directly observed by the eyes. Global features are pixel-level shallow features with good stability, real-time performance, and simple and easy-to-implement algorithms. However, their shortcomings are high feature dimensions, large amount of calculations, and changes in image scale, lighting, and perspective. Local features are features extracted from local areas of the image, including corners, lines, edges, and areas with special attributes. Local features are distinguishable and robust to changes in lighting, rotation, perspective, and scale, as well as low dimensions and easy implementation.
The scale-invariant feature transform (SIFT) is local feature descriptor proposed by David G. Lowe in 1999 [30]. The SIFT descriptor maintains invariance to image rotation, translation, scaling, affine transformation, perspective and brightness changes, and noise and also maintains stability. And it can be combined with other algorithms to form a new optimization algorithm, thereby increasing the operation speed. The traditional SIFT descriptor mainly extracts stable feature points in the image, which will lead to loss of some information in the image and long calculation time. And the number of feature points extracted from each image is different, which will inevitably lead to different dimensions. Lazebnik et al. improved the number and distribution of SIFT descriptors to obtain dense SIFT [31]. The main difference between the dense SIFT descriptor and the traditional SIFT descriptor is that the sampling method is different. The SIFT descriptor constructs a scale space to detect and filter feature points. The dense SIFT algorithm applies a fixed-size rectangular window for sampling from the left to the right of the image and from the top to the bottom according to the specified step size. The center of the window is used as a key point, and an image block composed of 16 pixels around the center is divided into 4 Â 4 pixel-sized units. Within each pixel, the SIFT algorithm is used to calculate the gradient histogram in 8 directions and obtain 4 Â 4 Â 8 = 128 dimensional feature vectors to form a DSIFT descriptor. The feature points extracted by this method are uniformly distributed, and the specifications are the same; they maintain good stability to illumination, changes in perspective, and affine transformation, scaling, and rotation.

Bag of visual word-based feature representation
Bag of visual word (BOVW) model was mainly applied to text classification and retrieval technology. The core idea of the bag of visual word model is to treat text as a collection of different words, ignoring the word order, grammar, and syntax of the text, and these words are discrete and independent of each other or do not depend on the presence of other words. The frequency of each word in the text is counted and is represented with histogram so that each text is represented as a vector.
Due to the successful application of the BOVW model in text retrieval, Csurka et al. introduced the BOVW model to the field of computer vision [32]. Think of an image as a document and the features of the image (usually referred to as local features) as the words that make up the image. Unlike the words in the text, there are no ready-made words in the image. We need to extract independent features from the image, which are called visual word. Similar features can be regarded as a visual word. In this way, the image can be described as an unordered set of visual words (local features). Although local features (such as SIFT) also can describe an image, each SIFT is a 128-dimensional vector, and an image contains hundreds or thousands of SIFT descriptor. The calculation amount is very large, so these vectors are clustered, and the cluster center was used to represent a visual word.
The image classification using BOVW model mainly includes the following steps: 1. Image feature extraction and description: Local feature vectors of the entire training set image are obtained through methods such as point-of-interest detection, dense sampling, or random sampling. Commonly used local features include SIFT descriptor and SURF descriptor.
2. Construct a visual vocabulary: After obtaining the local feature vectors of all sample images, use the k-means algorithm to cluster the local feature vectors. The k-means algorithm is an unsupervised learning algorithm. It divides the data into different categories through an iterative process and then calculates the Euclidean distance between each data and various types of centers [33]. The smaller the distance, the higher the similarity. k represents the number of clusters, and means represents the mean of the data in the clusters. If there are k cluster centers (i.e., visual words), then the size of the visual vocabulary is also k. This manuscript selects 1000 visual words, and the size of the visual vocabulary is 1000.
3. Representing images by word frequency: using the vocabulary as a standard, count the number of occurrences of each visual word in the image, and each image becomes a word frequency vector corresponding to the visual word sequence in the vocabulary, that is, each image is represented by a 1000dimensional numerical vector.
4. Select classifier to classify the 1000-dimensional numerical vector generated in the previous step as the input of the classification.

Support vector machines
Support vector machines (SVMs) were proposed by Corinna Cortes and Vapnik in 1995 [34]. It is a learning method based on VC statistical theory and structural risk minimization criteria. It has advantage in solving small sample, nonlinear, and high-dimensional pattern recognition problems. The basic idea of the SVMs is to map the low-dimensional space vector to the high-dimensional space through the nonlinear transformation defined by the inner product. In this high-dimensional space, the optimal classification hyperplane is determined according to the maximum geometric distance between the support vector and the classification plane. SVMs were initially used to classify two-class problems in the analysis of linear separable cases and require smaller sample sizes and an appropriate train rule, which have led to widespread use in image classification and recognition.
With the deepening of research on support vector machines, many scholars have carried out various toolkits in order to make them suitable for specific fields. In this manuscript a linear classifier LIBLINEAR designed by Professor Lin Zhiren of the National Taiwan University is used, mainly for processing large-scale data and features [35]. LIBLINEAR can be used in the following three cases: when the number of features is much larger than the number of samples; when the number of features and samples is large; and when the number of features is much smaller than the number of samples. Because the complexity of the linear classifier is lower than the nonlinear classifier, the training operation time is greatly reduced, and the training performance of the linear and nonlinear classifiers is also comparable under a large amount of data.

Multi-layer perceptron
The perceptron was proposed by Rosenblatt in 1958 [36]. It is an artificial neural network structure and the earliest feed-forward neural network. A single-layer perceptron contains only two layers, namely, the input layer and the output layer. Due to its limited mapping capability, it can only achieve linearly separable classification problems. A multi-layer perceptron has one or more hidden layers between the input layer and the output layer, which is mainly used for nonlinear classification and regression. The training algorithm is consistent with the traditional multilayer neural network and also uses a back-propagation algorithm.
Perceptron in this manuscript uses a three-layer structure. Because the extracted features are 1000-dimensional vectors, the input layer contains 1000 nodes, the hidden layer contains 100 nodes, and the output layer contains 7 nodes, which refer to the number of types of tea leaf disease.

Deep learning network construction
The network architecture designed in this manuscript was improved based on the classic model AlexNet model, named as LeafNet. The total number of parameters (weights and deviations) of the classic AlexNet network reaches more than 60 million, the parameters of the convolution layer comprises 3.8% of the total network parameters, and the parameters of the fully connected layer comprises 96.2% of the total. Therefore, by reducing the number of LeafNet's convolutional layer filters and the number of fully connected layer nodes, the total number of network parameters is reduced, and the computational complexity is reduced. The recognition model has a relatively simple structure and a small amount of calculation, which effectively reduces the problem of overfitting.

Network structure
LeafNet consists of five convolutional layers, two fully connected layers, and a classification layer. The number of filters for the first, second, and fifth convolutional layers is half of those used in AlexNet's filters. In addition, the number of neurons in the fully connected layer is set to 500, 100, and 7, respectively. The entire network structure is shown in in Table 2.
In this experiment, except for the last layer, the rectified linear unit (ReLU) activation function is selected instead of the traditional sigmoid and tanh functions. The main disadvantages of the sigmoid and tanh functions are the large amount of calculations, and when the input is large or small, the output is relatively smooth, the gradient is small, and it is not conducive to the weight update, which ultimately cause the network to fail to complete the training. ReLU is more in line with the principle of neuron signal excitation. It will make some neurons' output 0, making the network sparse and reducing the interdependence of parameters, effectively alleviating overfitting. At the same time, ReLU has better transmission error characteristics and solves the problem of gradient disappearance, so it makes the training network converge faster. After the nonlinear neuron output of the first two convolutional layers, a local response normalization operation is introduced. It is a normalization operation and mimics the lateral inhibition phenomenon of neurobiology. Local response normalization creates a competition mechanism for the output of local neurons. Local response normalization creates a competition mechanism for the output of local neurons, making the neurons with large responses larger, thereby enhancing the generalization ability of the model.
The first two fully connected layers have introduced the dropout operation. The dropout technique is an effective solution to overfitting via the training of only some of the randomly selected nodes rather than the entire network [37]. In this article, the dropout ratio is set to 0.5.
Softmax is the activation function of the last fully connected layer, which is mainly used in the output layer of multi-classification problems. It can make the sum of all output values equal to 1. That is, the output value of multiple classifications is converted into a relative probability, in which the category which has a high relative probability is the predicted value.

Training network
LeafNet's training uses stochastic gradient descent (SGD) technique. The weight values of all convolutional layers and fully connected layers are initialized with a Gaussian distribution, and the bias is initialized with a constant of 1. This setting guarantees that the input of the ReLU activation function is a positive number and can also speed up the training speed of the network [25]. Because the number of samples is small, the batch size is set to 16. Batch training can improve the convergence speed of the network and keep the memory usage at a low level. The initial learning rate of all layers of the network is set to 0.1. The learning rate is reduced according to the decline of the error, and each time it is reduced to 0.1 times the original learning rate in subsequent iterations, with the minimum threshold of the learning rate set to 0.0001. The number of epochs was set as 100, while the weight of decay was set to 0.0005 and the momentum was set to 0.9 [38]. LeafNet is implemented using Matlab's MatConvNet toolbox. The network training is performed on a Windows system, configured with a Core i7-3770K CPU, 8 GB of RAM, and accelerated training via two NVIDIA GeForce GTX 980 GPUs.

Performance measurements
As mentioned in [39], the classification accuracy and mean class accuracy (MCA) are used to evaluate the performance of the algorithm. CCR k is first defined as the correct classification rate for class k, as shown in Eq. (1): Where C k is the number of correctly identified for class k and N k is the total number of elements in class k. Classification accuracy is then defined by Eq. (2): Lastly, MCA is determined using Eq. (3):

Results and analysis
In this study, the accuracy of the SVM, MLP, and CNN classifiers in determining disease states for tea leaves from images was evaluated. The results of these analyses are shown in Figure 3. Error matrices were used to evaluate the accuracy of tea leaf disease recognition classifiers (Tables 3-5). From these data, although LeafNet algorithms are significantly better than SVM and MLP algorithms, three recognition algorithms can usually correctly identify most tea leaf diseases. Traditional machine learning algorithms extract the surface features of images, and the number is limited. The ability to represent image features is not strong, resulting in a low accuracy rate for identifying diseases. However, the CNN can automatically extract the deep features of the image, which can more accurately express the features of the disease image, so its recognition accuracy is higher.
It can be seen from the error matrix that the recognition accuracy of MLP and SVM for the seven tea leaf diseases is 70.94% and 60.91%, respectively, and the MCA is 70.77% and 60.62%, respectively. In the two algorithms, the correct rate of the bird's-eye spot is the highest, but there is no obvious regularity for the rest of diseases. The bird's-eye spot is clearly distinguishable, characterized by small and dense red-brown dots, which are significantly different from other disease characteristics, so its accuracy of identification is high.
The recognition accuracy of tea leaf disease by SVM and MLP algorithm is not high, which is caused by artificial selection of features. The recognition effect of SVM and MLP algorithm largely depends on whether the artificially selected features are reasonable, and researchers usually rely on personal experience when selecting features. Although better results can be obtained using artificial feature      classification, these features are specific to certain datasets. If you use the same features to analyze different data sets, the results may be very different, which is a problem inherent in these technologies. LeafNet has the best recognition effect on the bird's-eye spot, which may be due to the obvious plant pathological symptoms and the strong recognition ability of the LeafNet algorithm. The white spot disease was the second, while the other diseases range from 84 to 93%. Because of the similar pathological characteristics of the gray blight, red leaf spot, and brown blight, the classification accuracy of the three diseases is lowest. The symptoms of gray blight and brown blight diseases are too similar, which both exhibit annulations in their late stage and cannot be distinguished. In addition, the symptoms of white spot and bird's-eye spot diseases both include reddish brown spots at early stages. In addition, both anthracnose and brown blight diseases are typified by waterlogged leaves during early disease stages, while symptoms are different in the later stages. Some diseases can occur in tea plants throughout the year, although some diseases occur at distinct times. Consequently, diseases diagnosed at different times may affect the accuracy of disease identification. Another factor that affects the accuracy may be that the tea leaf can be infected with two or more diseases at the same time. This is because when the tea leaf is infected by one pathogen, the leaves are suffering from physiological weakness, and the second pathogen can easily infect. Therefore, the above factors explain the main reasons for the low accuracy of the test model in some diseases.
In addition, the performance of LeafNet is compared with the method of Reference [40], which contains 10 diseases of 3 crops with a maximum accuracy of 97.3%. Therefore, the performance of LeafNet is slightly lower than Reference [40], which used currently popular transfer learning algorithm. The main advantages of this algorithm are as follows: the network can converge quickly when the data set is small; easy to implement; and shorter training time. Therefore, in the future we will continue to research on and apply transfer learning algorithms to identify more plant diseases.

Conclusion
CNNs have developed into mature techniques that have been increasingly applied in image recognition. The computational complexity needed for neural network analyses is considerably reduced compared to other algorithms, and it also significantly improves computing precision. Concomitantly, the high fault tolerance of CNNs allows the use of incomplete or fuzzy background images, thereby effectively enhancing the precision of image recognition.
Feature extraction is an important step in image classification and directly affects classification accuracies. Thus, two feature extraction methods and three classifiers were compared in their abilities to identify seven tea leaf diseases in the present manuscript. These analyses revealed that LeafNet yielded the highest accuracies among SVM and MLP classification algorithms. CNNs thus have obvious advantages for identifying tea leaf diseases. Importantly, the results from the present study highlight the feasibility of applying CNNs in the identification of tea leaf diseases, which would significantly improve disease recognition for tea plant agriculture. Although the disease classification accuracy of the LeafNet was not 100%, improvements upon the present method can be implemented in future studies to improve the method and provide more efficient and accurate guidance for the control of tea leaf diseases.
In this manuscript, the expansion process of sample data is a time-consuming process, but with the continuous growth of network information resources, the number of tea tree disease images will continue to increase, so we must collect images of different morphological features in the early, middle, and late stages of each disease and continuously expand the tea tree disease data set to make the data set more detailed and comprehensive.
At present, disease recognition is based on computer system operations. However, as the performance of smartphones continues to improve, the recognition model of deep convolutional neural networks is migrated to android-based mobile applications. It can timely and accurately obtain relevant information about diseases and can provide help for the control of tea tree diseases.