Resham domains. Over the past decade, natural language

Resham N. WaykoleDepartment of Computer EngineeringPimpri Chinchwad College of [email protected] D. ThakareDepartment of Computer EngineeringPimpri Chinchwad College of [email protected] Language Processing (NLP) and Machine Learning concepts are acclaimed in today’s digitalization of data. Over the time, value of the data keeps changing and it is important to tackle that value for performing in depth research in various domains. Over the past decade, natural language processing has gained much importance because it reveals a lot of hidden information in the texts. It is difficult to discover the information of interest from a huge volume of the text data. Thus, information extraction based on computational text processing is necessary. For many of information management goals, the task of recognising phrases  and words in free text which falls under particular classes of interest is an important first step. It is crucial to manage huge amount of text being generated dramatically. The text can be for example clinical and biomedical text. Features can be extracted for classification of the documents. Feature extraction is extracting an important subset of features from a data for improving the classification task. Correctly identifying the related features in a text is important. Therefore, applying and expanding NLP techniques can help to better understand and study the data. This paper aims at analysing the clinical literature for cancer. The feature extraction methods such as bag of words, tf-idf, word2vec are compared for clinical text analysis. The extracted features are evaluated against Logistic Regression and Random Forest Classifier.Keywords: Natural Language Processing, Feature Extraction, Classification, Bag of Words, TF-IDF, Word2Vec, Logistic Regression, Random Forest Classifier.II. IntroductionText data is most simplest form of data which is unstructured in nature. It is generated in huge amount in most scenarios. Humans can clearly perceive and process unstructured text data but it is difficult for machines to understand the same. This voluminous text data is a important source of knowledge and information. Therefore, to use this information extracted from text data effectively in variety of applications, methods and algorithms are needed. NLP has gained a great deal of attention in past few years because of the huge amount of text data gets generated in many forms such as  social networks, patient records, news outlets, healthcare insurance data, etc. in a report generated by EMC. It is predicted that, by 2020, the volume of data will grow upto 40 zettabytes4. It is difficult for humans to go through all such text data and find the information of interest and to organize large amount of data. To enable the effective transformation and representation of such data, the process includes calculating the word frequencies from the document and in the entire collection of documents. Therefore, it is important to extract the needful information from the unstructured text data6. III. Related WorkExtracting information from text helps in analyzing the text data for various applications, reports, clinical records, automated terminology management, research subject identification, data mining and studying effect of research on them, etc. Feature Extraction is vital technique in dimensionality reduction to extract the important features. Samina Khalid et. al 1 have reviewed some common feature selection and feature extraction methods. It is analyzed for determining the effectiveness of these techniques for achieving high performance of learning algorithms. Because this ultimately improves prediction accuracy of the classifier. They have also analyzed some widely used dimensionality reduction techniques for the strengths and weaknesses of the techniques. 2 Several basic text mining tasks and techniques such as text data pre-processing, clustering and classification are described. Also the text mining in healthcare and biomedical domains are briefly explained. 3 proposed a term frequency (TF) with stemmer-based feature extraction algorithm and the performance of the algorithm is tested using various classifiers. The results shows that the proposed method outperforms other methods. 5678 various feature selection methods such as document frequency, information gain etc and feature extraction techniques such as principal component analysis (PCA), latent  semantic indexing (LSI), etc are discussed and classifiers used for classification of documents are discussed. 16 19several automatically extracted features are compared. The features are extracted for sentiment analysis of twitter.The commonly used feature extraction method TF-IDF is used. The TF-IDF technique is improved for feature extraction for better accuracy1521. Faheema AG et. al 14 have introduced an efficient technique which increases the accuracy by  using bag of visual word representation as a feature selection method. The word2vec is another feature extraction technique which is widely used. The 9 10 have discussed the word2vec method. however, 9 have  proposed a hybrid method to extract features from the data. The approach is proposed by using both LDA and Word2Vec. The method derives the relationships between topics and documents. It also combines the contextual relationships among the words. The results shows that features generated by this hybrid technique are useful for improving performance of a classification. IV. DataThe datasets is taken from kaggle. The dataset is related to cancer which contains the genes, genetic mutations caused by cancer and clinical text. The datasets are provided via two different files – training and test. One called as training/test_variants which has the information about the genetic mutations. The other training/test_text provides the clinical text which are clinical research papers related to cancer which are used by human experts used to classify the genetic mutations. Both files can be linked via the ID field. The training dataset contains 3000 instances.V. MethodsText feature extraction is the process of taking out a list of words from the text data and then transforming them into a feature set which is usable by a classifier. This work emphasizes on the review of available feature extraction methods. The following techniques can be used for extracting features from text data.A. Bag of words:The bag of words is the most common and the simplest among all the other feature extraction methods; it forms a word presence feature set from all the words of an instance. It is known as a “bag” of words, since the method doesn’t care about how many times a word occurs or the order of the words, all what matters is whether the word is present in a list of words. The features can be used in modelling with machine learning algorithms. This method is very flexible and simple. It is usually used for extracting features from text data in various ways. A bag of words is the presentation of text data. It specifies the frequency of words in the document. It includes: 1. A lexicon of known words 2. A frequency of the existence of those known words. The complexity of bag of words model is both in determining how to score the presence of familiar words and how to design the vocabulary of familiar words.