Dataset theseindependent researchers have trouble training datasets and

Dataset expansion and accelerated computationfor image classification: A practical approachAditya Mohan 1 , Nafis uddin Khan 21Jaypee University of Information TechnologyWaknaghat Distt. Solan, [email protected] University of Information TechnologyWaknaghat Distt. Solan, [email protected] During training of many machine learning algorithms for various purposes mainly consisting of images in thedataset, the major hindrance and setback comes in the form of non-availability of three features of paramount significance-quantity of data, availability of GPUs (Graphic Processing Units) and high-rate computation catalysts. Because of theseindependent researchers have trouble training datasets and specifying features -which can be in great quantity for images. In thefollowing paper we present an approach for leveraging the power of “transfer learning” and easily accessible examples in theform of raw content on the internet not only to use already-prepared datasets made specifically for neural network training but tobring into usage more training examples using the same, sampling the average accuracy output rate of the images, along withreducing model training and execution time by parallel operations on different nodes.Keywords: machine learning· transfer learning· datasets· neural networks· convolutional neural network (CNN)1 IntroductionNeural network is an altogether different approach as compared to supervised machine learning algorithms withvarious reasons attributed, the most prominent one being the cost of running and processing algorithms overhardware.The reasons for the above can be understood as follows. Random Access Memory(RAMs) on machines are cheap andare readily available. Hundreds of gigabytes of RAM is required to execute a complex supervised machine learningproblem. This is can be easily afforded by users but on the other side, general access to Graphic Processing Units(GPUs) is not cheap. For instance, for gaining access to a hundred GigaByte VRAM (Virtual RAM) on GPUs1. Theprocedure would be time consuming and would possibly involve significant costs.A huge amount of machine learning approaches executed well only under a commonly accepted conjecture: the testdata and training data are extracted from the the same distribution and same feature regions.When the distribution is amended (which mostly happens in the case of image based datasets), maximum number ofstatistical models are required be built-up from the beginning using freshly accumulated training data. In most real-life applications, it is expensive (as mentioned above) and becomes highly time-consuming and painstaking to re-gather the required data for training and then rebuild the models. So it would be of great help to reduce the efforts andtime invested in re-collecting the data for training. And for these cases only transfer learning or knowledge transferbetween domains of various tasks becomes highly appropriate. Therefore, “Transfer Learning” bestows upon us theability to use pre-trained models from other sources (Google, independent researchers) by introducing small changes.Throughout the entire process of implementing transfer learning, development of pre-trained models is a prerequisite.A pre-trained model is a model created by some other source for solving a problem in the same domain ex.: voice andimage classification. General approach towards developing a pre-trained model involves usage of a model trained onanother problem as a commencing point Instead of building a model from scratch.For implementation of pre-trained models, various approaches are available for usage. But since this requires,traditionally, handcrafting a bunch of features, for example – running edge detection for finding an individual’soutlines or storing color histograms for any section of an individual(hair, teeth) which is not possible accounting to amyriad of variations in the features belonging to the same category, which might be due to the robustness and imagesource variation(cameras from cell phones, surveillance cameras etc.; image rotation).It is worthwhile to note that1the faces here are representing various multidimensional, meaningful visual expressions and thus developing acomputation based model for face recognition is strenuous. To cater to these discrepancies 2, neural networks aretaken into consideration. A Neural network is a powerful technology for classifying visual (graphic) inputs arisingfrom multiple sources of documents. From research and result comparison, it has been inferred that the mostimportant practice is acquiring a training set with its size being as substantial as possible. In context to this paper,getting maximum number of closely related images for training. The next important practice is that convolutionalneural networks (CNNs) are best fitted for operations on visual documents rather than on networks which are fullyconnected. 1The convolutional neural networks are equipped with the ability to for partial invariance to rotation,translation, deformation and scale. The convolutional neural network operates by extracting large features in a set oflayers based on a neatly defined hierarchy.2 Convolutional Neural Networks2.1 Functioning of CNNA convolutional neural network consists of 3 types of layers 3:-Convolutional: These layers consist of a grid (rectangular in shape) of multiple neurons requiring that the previouslayer also be consisting of neurons set-up in the same grid shape. Here, each neuron accepts inputs from a cuboidalsegment of the previous layer; it is worthy to note that the weights for this grid section are the same for each neuron inthat specific convolutional layer.Max-Pooling: After each convolutional layer, there might be a pooling layer present. The pooling layer’s task is topick upsmall-sized rectangular blocks from the previous convolutional layer which is subsampled, therebyproducing a single output from that block.Fully-Connected: Finally, after operations done by several convolutional and max-pooling layers, the final, the highlevel inference deduction in the neural network is performed by fully connected layers. A fully connected layerconsumes all the neurons from the previous layer, connecting it to each and every single neuron it possesses.Forward propagation in convolutional LayersIn an N×N square neuron layer followed by a convolutional layer, if a m×m filter named ? is used, the output of theconvolutional layer shall have size (N?m+1)×(N?m+1). Now, to compute the before-nonlinearity input to any unitx on that layer, the contributions (with weights added to them by the filter components) from the previous layercells have to be summed up usingx=??? y l()((1))Forward propagation in max-pooling LayersHere simply a k×k region is passed and the output is a single value, which itself is the maxima in that region.Backward propagation in convolutional LayersFor some error function, E, and error values at the convolutional layer to our knowledge, the error required to becomputed is the partial of E w.r.t. every neuron output. For computing gradient component for each weight,values of “deltas” are to be computed using-==? x=??(x)(2)Propagating errors back to the layers using chain rule-=?(?()()())=??()()?(3)Backward propagation in max-pooling LayersThe backpropagated error(s) belonging to the max-pooling layers remain sparse, thus not doing any learning bythemselves.2.2 Integrating Pre-trained CNN models 4Pre-trained CNN models save a lot of time and computation power and cost and require implementation of transferlearningI.e. re-training the last layer on features of the targeted image set to be used for classification training andthis holds true for multiple classes. Deep learning library Keras has provided five Convolutional Neural Networks thathave been pre-trained on ImageNet dataset, namely:- VGG19 , VGG16 , ResNet50 ,Inception V3 and Xception outof which this paper uses Inception V3 pre-trained CNN model. Inception was used due to the fact that it was trainedby Google on 100,000 images with 1000 categories, making it highly preferable for achieving maximum trainingaccuracy.ImageNetImageNet is a project targeted at categorizing and labeling images into approx. 22000 distinct object categoriesfor serving the purpose of computer vision research. ImageNet in context of deep learning and Convolutional NeuralNetworks, refers to Imagenet large scale visual recognition challenge (ILSVRC). Its primary goal is to train a modelcapable of correctly classifying an input image into 1000 distinct object categories.Here models are trained on 1.2 million training images with 50,000 more pictures for validation purpose and100000 pictures for testing. The mentioned 1000 image categories represent object classes encountered in oureveryday happenings such as squirrel species, cat species, indoor and outdoor objects, vehicle types.The Inception ModelThe Inception micro-architecture was initially introduced and presented by Szegedy et al. in their 2014 paper,”Going deeper with convolutions”5. The primary goal of this module includes performing as a featureextractor(itself being multi-leveled) by computation of 1×1, 3×3, and 5×5 convolutions within the exactly samemodule of that network and consequently the output of these filters are stacked along the channel dimensions beforebeing passed onto the next layer of that network.The original, initial version(presented) of this architecture was called GoogLeNet, with subsequent manifestationsbeing simply called Inception vM, where M refers to the version set out by Google. The Inception VM ((M=3)architecture included in Keras deep learning library core comes from Szegedy et al’s later publication namedRethinking the Inception Architecture for Computer Vision(2015)6 which proposes various updates to inceptionmodule for further boosting the classification accuracy of ImageNet. The weights for Inception V3 are smaller thanboth ResNet and VGG close to 96MegaBytes.Since ResNet boasts of using global average pooling rather than fullyconnected layers,its model size turns out to be smaller and this in turn reduces the model size down to 102MB forResNet50(50 layers of weights). Owing to its depth and the count of fully connected nodes, VGG has size over533MegaBytes for VGG16 and 574MegaBytes for VGG19 and this makes deploying VGG a time-consuming task.Figure 1 General layout of a Convolutional Neural Network 7.Figure 2 . Layout of inception module 4.2.3Training the Inception ModelOn feeding images as the input on each layer, a series of operations on the data is performed until it outputs a labeland classification percentage. Each layer has its own abstractions, for example, edge detection on one, shape detectionon the other which keep on getting increasingly more abstract as layers proceed. The last 2 layers in the inceptionmodule comprise of the highest level detectors for all objects and as per training specification, these last two layersare to be trained on the features of the desired classification parameters(image clusters in separate folders to be usedfor training the classifier etc. etc.)2.4Proposed approach to ImplementationThe proposed approach- Execute the following steps on 2 different systems simultaneously and take the requiredobservations.For retrieving images from the internet, we used specific google search keywords for getting the most relevant anduseful images for training the model. For instance, a combination of keywords 8 including but not limited toallintext:, allintitle:, inurl:, intitle:, filetype:, site: were used in google searches for opening the relevant images.For accelerated and hassle-free image extraction web browser extensions were deployed for saving the imageson the respected nodes, which, depending on the system speed, extracted the image results directly in separatefolders. This is a unique and easy-to-use method which when combined with the other steps can prove highlybeneficial for increasing dataset size and reduction dataset preparation time.The entire training dataset was specified in different folders as per the classification data they contained (forexample, images of one person in one folder, images of the other in the second etc.)? After all the dependencies (Tensorflow, virtual environments for running the entire process in, deep learninglibraries like Keras etc.) are set-up, the Inception model classifier has to be trained on the images in thevarious folders created above.? Now since the training has to be done on the last 2 layers of the model. Therefore, commands areimplemented to cache the outputs of the lower layer on the disk so they that repeatedly they don’t have to becalculated.? Number of epochs (iterations, mostly) are defined.? Output labels (which will be the same as the training folder name) and graphs are defined for ease of readingand interpreting the outputs.? The training is executed for training the classifier.? After the training, scripts in Tensorflow are implemented to use the newly trained classifier for classifyingimages from the test dataset as being a part of the training dataset(images in folders, in this case).? The scripts comprise of the trained model and test data images stored in variables and for the operation part itincludes feeding the image-based data into the model(which is now retrained) to get the production output.Here Softmax function is used to map the input data into different probabilities of a positively anticipated(and mentioned) output.? Same process is undertaken from different systems for accomplishing the same goal.? IMPORTANT: The training accuracy, classification accuracy and processing time (model training andclassification time) are observed and taken a note of on both the systems.Figure 3. Training layers of Inception model.3Results? The classifier was trained on training datasets comprising of photographs separated in 2 folders.? The training of the classifier took place step-wise, displaying training accuracy, cross-entropy and validationaccuracy for each training image as it was used to train the model.? Commands were executed for implementation classification – target images used – images bearingresemblance to images in both the datasets.? Applying the above step for testing various images, accuracy score for each image was brought up, graphswere plotted for understanding the execution time.Figure 4. Inception module preparation before training.Figure 5. Image dataset (zoomed out to 50%) of one folder(one type) of images for training the data.Figure 6. Training accuracy, cross-entropy and validation accuracy for each training image while being used to train themodel.Figure 7. Final result after training the model and testing it on individual target images bearing resemblance to each of thetraining dataset, one by one.Figure 8. Graphical comparison of time required for training the model and testing of images for classification with dataconcentrated on a single node and (ii) data distributed on 2 different nodes for simultaneous application.4 ConclusionHere we have successfully achieved training and extraction of readily available images from the internet byperforming highly specific Google searches using specific Google search tricks in the appropriate direction so as toget the most to-the-point images – the ones most relevant for training the model, using a pre trained, well tested andhighly popular model called Inception provided by a reliable source i.e. Google, which required us training onlycertain elements of it.The model we trained was able to classify the test dataset images with high accuracy in lesser time when ondistributed nodes as compared to when done on a single node(as is visible on the graph- the time difference in boththe cases is significant) and this is one exhibition of the fact that as the dataset- containing both relevant andstandard and unambiguous(as per their theme) increases in size, more the amount of accuracy or the accuracy rateincreases for classification, also that all the costs associated with time required to train the model and predict targetimages can be significantly reduced if the task is distributed on different nodes and is performed simultaneously.As mentioned above, we successfully dispatched the training work on different nodes and by combining theiraverage output accuracy rate for specific test images, we were able to successfully map out high accuracy rate forclassification of the test dataset images, thereby distributing the model training and classification time on differentnodes speeded up model training and in-turn speeded up classification which itself demonstrates the fact thatdistribution of tasks to nodes has accelerated the process of training and classification of a huge amount of datawhich also benefits the users by being non-centralized and customizable to approach.Thus we have clearly justified the objective and thus the aim of the paper to leverage to the power of “transferlearning and easily accessible examples in the form of raw content on the internet not only to use already-prepareddatasets made specifically for neural network training but to bring into usage more training examples using thesame, sampling the average accuracy output rate of the images and also using distributed nodes for faster trainingand classification of the images, thereby boosting the efficiency of testing, debugging and building machine learningprojects.References12345678Lawrence, Steve & Lee Giles, C & Chung Tsoi, Ah & Back, Andrew. (197). Face Recognition: A Convolutional NeuralNetwork Approach. Neural Networks, IEEE Transactions on. 8. 98 – 113. 10.1109/72.554195.Simard, Patrice & Steinkraus, David & Platt, John. (2003). Best Practices for Convolutional Neural Networks Applied toVisual Document Analysis.958-962. 10.1109/ICDAR.2003.1227801.AndrewGibiansky.http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networksAdrianRosebrock,https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-kerasSzegedy, Christian & Liu, Wei & Jia, Yangqing & Sermanet, Pierre & Reed, Scott & Anguelov, Dragomir & Erhan,Dumitru & Vanhoucke, Vincent & Rabinovich, Andrew. (2015). Going deeper with convolutions. The IEEE Conference onComputer Vision and Pattern Recognition (CVPR). 1-9. 10.1109/CVPR.2015.7298594.Szegedy, Christian & Vanhoucke, Vincent & Ioffe, Sergey & Shlens, Jon & Wojna, Zbigniew. (2016). Rethinking theInception Architecture for Computer Vision. 2818-2826. 10.1109/CVPR.2016.308.Algobeans.com, https://algobeans.com/2016/01/26/introduction-to-convolutional-neural-networkGoogle search operators, Googleguide.com,http://www.googleguide.com/advanced_operators_reference.html