Reddit Comment Classification [Kaggle]

Image Source

This was a competition hosted on Kaggle and was a miniproject for the COMP 551: Applied Machine Learning Course.

We analyze text from the website Reddit, and develop a multilabel classification model to predict which subreddit (group) a queried comment came from. Reddit is an online forum, where people discuss various topics from sports to cartoons, technology and video-games. The dataset is a list of comments from 20 different subreddits (groups/topics). This problem can be formulated as a type of Sentiment analysis problem, which is quite well-known in the Natural Language Processing (NLP) literature. Sentiment analysis is a computational approach toward identifying opinion, sentiment, and subjectivity in text.

For this dataset, we implemented a Bernoulli Naive Bayes classifier, trained and tested it against the dataset. We also analyzed various models for improving the classification accuracy, including Support Vector Machines, Logistic Regression, k-Nearest Neighbours, the Ensemble method of Stacking and a Deep Learning model ULMFiT (J.Howard and S.Ruder, 2018) . We also tried using the FlairNLP library concatenating several combinations of embeddings such as FlairEmbeddings + BERT to get text features for classification

We compare the accuracy of these models for different Feature extraction methods, namely Term Frequency-Inverse document frequency (TF-IDF) , Binary and Non-Binary Count Vectorizer. We also analyze the performance gain/loss after applying Dimensionality reduction methods on the dataset. In particular, we explore the Principle Component Analysis (PCA) inspired method of Latent Semantic Analysis (LSA) .

We observed that the best results were obtained by stacking various combinations of the models described above. For the final submission, we used an ensemble classifier with ’soft’ voting by Stacking SVM, Naive Bayes and Logistic Regression at their optimum parameter settings.which gave an accuracy of 57.97% on our validation data and 58.011% on kaggle public leaderboard. Adding ULMFit to the stack and using a logistic regression on top as meta classifier further bolstered the accuracy to 60.1%. We finished 10th and 8th out of 105 teams(Group 60) on the public and the private leaderboards of the competition respectively.

comments powered by Disqus

Related