Text classification can be new secret weapon for building cutting-edge systems and organizing business information. Turning textual data into quantitative data is incredibly helpful to get actionable insights that can drive business decisions where one can also automate manual and repetitive tasks and get more done.
All users want to have their documents in a more systematic and secured way.So Assume a situation that we have huge collections of books. Therefore It may contain novels, storybooks, and fictions, books on Culture and Heritage, History, and Geography etc. Suppose someone enquires of a book on History as it is quite difficult for us to find it. If we manually go for searching, it may take several hours, may be days also.
If we can categorize the books in different categories, it would have been efficient to search and secured too. But a major problem faced by institutions, organizations, and businesses is that of information overload for Sorting out useful documents from collection that are not of interest challenges the ingenuity and resources of both individuals and organizations.
Keyword search engines can be helpful but there are some limitations in this and they don’t discriminate by context.
Manual classifications and clustering
On the other hand if we manually go for classifications and clustering, it is not feasible for large volumes of documents. So we need to develop an automatic classifier to manage the documents in a more secure way and by classifying a document we can establish the required level of protection with less manual effort whereas Documents’ classification and clustering are two very important techniques for achieving this goal.I have completed the development of a full fledged text classification system for English and Urdu language using Scikit-Learn and this is why i wanna share the bottlenecks and crucial phases of text classification.
In this article I will discuss supervised text classification which refers to the process of classifying text documents or sentences into one or more predefined labels where exists many commercial applications of text classification such as categorizing news articles based on topics, classifying user emotions for certain product, organization or services, validating loan applications for particular bank, organizing movies or music on the basis of their genre, auto tagging of customer queries and many more.
There are several APIs which provide the pipeline of text classification such as NLTK, SPacy, Gensim, TextBlob, FastText, and flair, however, I will use Scikit-Learn because it is the best API for various machine learning tasks and is build on top of SciPy, NumPy, and matplotlib and provides robust pipeline to train and test several classifiers in no time. If you are looking for the implementation of text classification using Keras then check this article.
Types of Text classification
2.1 Supervised classification Supervised classification
The process of classifying documents on the basis of provided external knowledge or information, here each example is a pair consisting of an input object and a desired output value and a supervised algorithm analyses the training data and produces an inferred function which is further used for mapping other examples where Labels or predefined classes are already assigned to all data samples and It is like that a “teacher” gives the classes (supervision).Like human learning from past experiences, a computer does not have “experiences”.
A computer system learns from data, which represent some “past e xperiences” of an application domain where Our focus is to learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk and is commonly called supervised learning, classification, or inductive learning and Supervised learning process in two steps:
• Firstly, Learning (training ): Learn a model using the training data.
• Secondly Testing: Test the model using unseen test data to access the model accuracy.
2.2 Unsupervised classification
Entirely without reference to external information and can be achieved through clustering.
A way of grouping together data samples that is similar in some way according to some criteria we just pick out. So, it’s a method of data exploration, a way of looking for patterns or structure in the data that are of interest and involves the use of descriptors and descriptors extraction where Descriptors are set of words that describe the contents within the cluster and is considered to be a centralized process. E.g.: Web document clustering for search users therefore, requires no predefined classes or category whereas In unsupervised clustering, we have unlabelled collection of documents.
The aim is to cluster the documents such that Documents within a cluster are more similar than documents between clusters.
2.3 Types of Supervised Text Classification
There are numerous articles on tutorials over the web regarding text classification, however, most of them cover only simplest form of classification. I will cover three different types of text classification along with code snippets as real world scenarios whereas Different types of text classification are briefly illustrated below.
2.4 Binary Classification
Binary classification is the simplest form of classification. As it categorizes the text into two classes such as identifying whether the email is spam or not.
2.5 Multiclass Classification
Multiclass classification implies a classification task with more than two predefined classes. For instance, classifying a collection of fruit descriptions into apples, oranges, pears, and grapes. Although this type of classification may has several labels for particular collection of samples, however, each sample gets only one label like a description of particular fruit would be either pear or apple but can not belong to both classes at the same time.
2.6 Multilabel Classification
Multilabel classification assigns set of predefined labels to each sample and All labels are mutually exclusive such as, categorizing movies into more than one genre.
In other words, a sample may belong to multiple labels like a business news may contain content related to sports, or entertainment along with investments and profits.
2.7 Generic Methodology of Text classification using ScikitLearn
Regardless of type of classification task, typically the process of classifying text into predefined categories starts by loading the dataset after resolving duplicates and empty values. Then preprocessing pipeline comprising of tokenization, stop word removal, and stemming is performed in order to reduce the dimensionality of data.
After preprocessing, set of features are extracted and transformed into feature vectors using bag of words representation before training machine learning algorithm over transformed features. Once the machine learning algorithm is trained it becomes text classification model which is then used to make predictions over unseen data. Finally, the performance of text classification model is assessed using performance metrics and the model parameters are further tuned in order to raise classification performance.
3 Sequence of Steps For Text Classification
I will implement text classification by completing the below mentioned steps in sequence.
1 Python Ecosystem
2 Load Dataset
3 Feature Engineering
4 Model Selection and Training
5 Model Evaluation
4 Python Ecosystem
First, you would need following libraries in order to build a text classification engine
You can install mentioned libraries using pip or the search library window provided by the most famous python IDE namely PyCharm and Run the following commands in terminal, if you are using a lightweigh IDE such as IDLE or any other at the moment. Pip3 install pandas pip3 install nltk pip3 install scikit-learn . Go ahead and install the latest version of following libraries izn your favorite IDE
5 Load Dataset
For the task of text classification, i am exploiting the short version of very famous classification dataset namely 20_newsgroup and is readily available in Scikit-Learn and contain almost 18000 newsgroups posts regarding 20 different categories and As it is a very huge dataset so i have prepared a short version having 200 newsgroups posts by slicing only 10 samples of each category.
70% of data is extracted for training and 30% for testing purpose Then all corpus labels are encoded using label encoder which will assign unique integer (starting from zero) to each corpus label can only process numeric data. For example, category column is encoded as Numerical in the figure 10.
6 Feature Representation
The classifiers and learning algorithms as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation where raw data is transformed.
One common approach for extracting features from text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored and the most extensively used feature representation approaches are inverse document frequency (TF-IDF) term frequency (TF), and term frequency.
Specifically, for each term in our dataset, I will calculate a measure called Term Frequency, Inverse Document Frequency (TF-IDF) and term frequency and I will use sklearn.feature_extraction.text.TfidfVectorizer and CountVectorizer to calculate a tf-idf and tf vector for each sample of the dataset as both aforementioned bag of word based data representation approaches are comprehensively discussed below:
6.1 Term Frequency Inverse document frequency (TFIDF)
TF-IDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition for this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.
Whereas The math formula for this measure : tfidf(t,d,D) =tf(t,d)idf(t,D) such that t denotes the terms, d denotes each document, and D denotes the collection of documents. Whereas TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)
• Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents then
• N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams
• Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus
6.2 Term Frequency (TF)
The first part of the formula tf(t,d) is simply to calculate the number of times each word appeared in each document. Of course, as with common text mining methods: stop words like a, the, punctuation marks will be removed beforehand and words will all be converted to lower cases whereas TF matrix is a matrix of a dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
6.3 Creating TF, and TF-IDF vectors
In order to illustrate, lets consider a hypothetical corpus having five documents of three classes (class 1, class 2, class 3) and Hypothetical corpus contain ten unique terms which are represented as t1…t10. Class1Doc1=t1 t2 t3 t4 t5. Class1Doc2=t2 t6 t3 t7. Class2Doc1=t4 t7 t8 t1. Class2Doc2=t9 t7 t4 Class3Doc1=t8 t10 t1. Now the total occurrence of certain term in particular class is the frequency of that term. For example, term frequency of t1 is 1 for all three classes as it occurred in documents of all three classes.
Whereas, TF-DF of t1 without smoothing is computed as: TF(t) = (Number of times term t appears in a document) where IDF(t) = log_e(Total number of documents / Number of documents with term t in it)and TF-IDF of t1= 1*((ln(5/3))=0.51 TF and TFIDF representation can be created using CountVectorizer and TFIDFVectorizer classes provided by Scikit-Learn. Results of both representations are shown in figure [12, 13] respectively
7 Feature selection
Feature selection filters most discriminative features for classifiers and however, there exists several techniques which can be exploited to filter highly discriminative features. For example, a feature ranking metric like ACC2 can be exploited to filter most discriminative terms where ACC2 computes the absolute difference between true positive and false positive rate and The equation of ACC2 is given below:
|ACC2= |tpr− f pr| whereas, tpr and fpr can be defined as:||(2)|
|tpr = tp/sizeof positiveclass||(3)|
|f pr = tn/sizeofnegativeclass||(4)|
Such that tp is the number of positive class documents in which term (t) is present and tn is the number of negative class documents in which term (t) is not present and ACC2 score of the term brown will be
ACC2= |(1/1)−0| => 1 (5)
7.1 Implmenting feature representation approaches on small 20 newsgroup dataset
I have implemented CountVectorizer and TFIDFVectorizer at word, ngram and character Level and Moreover i have selected useful features through max features threshold.
8 Model Selection Training
After feature engineering, particular machine learning algorithm is selected to train over yielded features so I have implemented Naive Bayes, and OnevsRest classifier with multifarious features where Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors whereas a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature here. Whereas, One-vs-all classification is a method which involves training N distinct binary classifiers, each designed for recognizing a particular class. Then those N classifiers are collectively used for multi-class classification as demonstrated below:
The code snippets for both classifier are revealed below:
9 Model Evaluation
So Let’s take a hypothetical example of document classification in which we have two classes A & B where Each class has two documents. D1, D2, D3, and D4 representing four documents.
Class A (Agriculture)
D1. Agriculture experts have said that Pakistan has more than thirty types of fruits.
D2. Experts added that at this time Tashwara gardens have fallen in the endurance process.
Class B (Sports)
D3. Cricket South Africa’s annual awards were fostered by Fast Bowler Kigassu Rabada.
D4. This is the first time a player has won more than 5 awards for 2 times.
So to evaluate the performance of classifier we need Gold standard of the corpus and Gold standard is manually labelled corpus in which each document is named with its class name from which it belongs as In our case we have two classes Agriculture and Sports.
Labeling according to gold standard
In Gold standard documents of each class will be labelled with the class name such as D1 & D2 belong to Agriculture and D3 & D4 belong to Sports and Classifier predicts the label/class of a document so that To evaluate how many predicted label of the classifier are correct, there exist various evaluation measures such as Precision, Recall, F measure etc. as all mentioned evaluation measures use confusion matrix to evaluate the performance of classifier so To understand confusion matrix let’s take an example related to umpire decision about the batsman.
|Decision||Actually out||Actually not out|
|Umpire decision out||tp||fp|
|Umpire decision not out||fn||tn|
Tp = If the umpire gives out and actually it was out so there is no mistake
Fn = If Umpire gives not out and actually it was out, so it was mistake of the
Umpire Fp = If Umpire gives out and actually it was not out. It is also mis-
|Class||Actually class Sports||Actually class Agriculture|
|Predicted class Sports||tp=1||fp=1|
|Predicted class Agriculture||fn=1||tn=1|
take of Umpire Tn = If Umpire gives not out and actually it was also not out, there is no mistake
We can conclude that Tp & Tn represents that there was no mistake whereas On the other hand Fp & Fn illustrates that there was mistake in the predicted and actually decision. Similarly from our hypothetical data set, let’s we assume that after prediction classifier assigns following labels to each document. D1
According to Gold Statement
Actually according to Gold Standard the label of D1 was Agriculture and classifier assigned it Sports and Similarly the Gold Standard label of the D2 was Agriculture and classifier also assigned it Agriculture.
On the other hand D3 & D4 actually belongs to Sports class but classifier labelled D3 with agriculture which is wrong and D4 with Sports class which is right. It was just hypothetical corpus but if a corpus has thousands documents then it’s difficult to evaluate its performance like this and Whereas at experimental level we use Precision, Recall, and F measure to evaluate the performance of classifier then Let’s make a confusion matrix for this hypothetical predictions.
- As actual class of D1 was Agriculture and classifier prediction was Sports, so this count will goes to Fp.
- Similarly actual class of D2 , Agriculture and classifier labelled it Agriculture, so accordingly to our confusion matrix this will make a count for Fn.
- Document D3 belongs to Sports but classifier labeled it Agriculture, so according to our confusion matrix this will make count for Fn.
- Documents D4 actually belong to Sports and Classifier also labeled it Sports which will make a count for Tp.
Now we calculate the accuracy of classifier using above confusion matrix
Accuracy = tp+ f p/tp+ f p+ fn+tn => 2/4=> 0.5 (6)
Accuracy , a good measure when the target variable classes in the data nearly balanced. For example 60% classes in fruits description data are apple and 40% are oranges. Now if a model predicts whether a new fruit description is Apple or an Orange, 97% of times correctly then accuracy can be considered a very good measure.
And However, accuracy should NEVER be used as a measure when the target variable classes in the data are a majority of one class. Such as, lets consider a cancer detection example with 100 people in which only 5 people has cancer. Let’s say our model is very bad and predicts every case as No Cancer. In doing so, it has classified those 95 non-cancer patients correctly and 5 cancerous patients as Non-cancerous. Now even though the model is terrible at predicting cancer, The accuracy of such a bad model is also 95%.
We calculate the Precision of classifier using above confusion matrix
Precision = tp/tp+ f p => 1/2=> 0.5 (7)
Precision address the shortcomings of accuracy and if i consider the same cancer example containing 100 people, and only 5 people have cancer then Let’s say our model is very bad and predicts every case as Cancer. Since we are predicting everyone as having cancer, our denominator(True positives and False Positives) is 100 and the numerator, person having cancer and the model predicting his case as cancer is 5. So in this example, we can say that Precision of such model is 5%.
Then we calculate the Recall of classifier using above confusion matrix
Recall = tp/tp+ fn => 1/2=> 0.5 (8)
So, We calculate the F1 measure of classifier using above confusion matrix
F1=2∗ precision∗recall/precision+recall => 0.5/1=> 0.5 (9)