NLP: Product Review Analysis

This project was developed using ".ipynb" files that were written in Portuguese, therefore, its repository contains the files in this same language.

To access the project repository, with all it's files in Portuguese, click HERE

The ".ipynb" used to develop the project is called "projeto_NLP_meduarda_final.ipynb". You can also click onto the Google Colaboratoy Link for project visualization

This page will present, in English, the project's goal, the process used to develop it and the results.

Project Goal

The project has the goal to use NLP to analyse a product review, to be able to relate the product text review with its rate.
The product chosen was a "Tablet Mirage 7 Pol 64gb 4gb ram Quad core wi-fi cor Preto", from Mercado Livre
The reviews are in Portuguese
The data was obtained through Mercado's Livre API. Product ID: MLB4241052396

Project Development

The technical development of this project was divided into three phases:
- Data Load
- Text Processing
- Models Development

Data Load

For the project, the only data necessary was the one collected from "Mercado Livre" API.
The data was then converted in a ".csv" format
Data Description:
- The spreadsheet in ".csv" contains 1255 lines with two columns.
- The column named "content" contains the person's opinion about the product in text format
- The column named "rate" contains the person's rate given to the product. (It has a range form 1 - 5)
After that, it is possible to analyse the data and count the amount of different rate that exists. The table below presents these results and it is possible to observe that the data is not uniformly distributed by rate, the extreme values (1 and 5) have a higher presence.

Rate	Count
1	234
2	85
3	160
4	181
5	595

Text Processing

This phase is important to normalize the reviews, tokenize it and apply or not lemmatization to the words.
Text Normalization:
- The reviews are in Portuguese, therefore, to normalize the text it was used python's Spacy Portuguese package (!python -m spacy download pt_core_news_sm)
- The text was transformed, turning uppercase letters in lowercase
- The text was transformed, and special characters (such as punctuation) were removed
- The text was transformed, and stopwords were removed (‘a’, ‘o’, ‘de’, ‘e’). The ones removed were within the most frequent elements
Words Tokenization:
- Tokenization is the process where each word in the phase is transformed into a single element (tokens)
- In this phase it was used the NLTK Tokenizer
Words Lemmatization:
- Lemmatization is the process that reduces each word to its base form
- It was made a test, for the reviews, with and without lemmatization and it was chosen not to apply it
After this process, the biggest word vector contained 108 elements.
The total of tokens presented in the whole dataframe was 14552 elements.
The vocabulary presented in the whole dataframe was 1832 elements.
The table below shows some examples of the reviews before and after the Text Processing

Rate	Review (Before Text Processing)	Review (After Text Processing)
4	Comprei pro meu filho de 5 anos. Baixei netfli...	comprei pro meu filho anos baixei netflix uns ...
5	Gostei muito, fácil de manusear, serve para o ...	gostei muito fácil manusear serve para meu dia...
3	O produto é bom,,, o problema é a tela q deixa...	produto é bom problema é tela q deixa desejar
1	Fica desligado toda hora sozinho não gostei.	fica desligado toda hora sozinho não gostei

Model Development

For analysing the reviews three models were tested:
- SVM + BOW
- SVM + EMBEDDINGS
- BERT
SVM is "Support Vector Machine".

SVM + BOW

It's a model that relies on the "weight" of and element. It considers how many times it appears in a sentence
It does not keep track of elements order
The amount of columns in the dataframe is going to be equal to the vocabulary's size
Model chosen and its parameters (Found with GridSearch's help):
- SVC: Support Vector Classifier
- Parameters -- C: 100; Gamma: 0.1; Kernel: RBF
- Test group represented 20% of the data

SVM + EMBEDDINGS

In this model, each word is represented by a dense vector (for this project, the size chosen was 75)
It was used the "Word2Vec" python library to generate the vectors
The sentences were standardized, with zero padding, so that all of them had the dimension equal, the one with the most amount of words (108)
Model chosen and its parameters (Found with GridSearch's help):
- SVC: Support Vector Classifier
- Parameters -- C: 100; Gamma: scale; Kernel: Linear
- Test group represented 20% of the data

BERT

Keeps track of the sentence bidirectional context. It aims to predict a word based on the one that comes before and after
The data was divided into three: Training (0.70); Validation (0.20); Test (0.10)
It was imported though Transformers
BertTokenizer: will tokenize the words
Was used a pre-trained model: BertForSequenceClassification.from_pretrained('neuralmind/bert-base-portuguese-cased’)
For the application of this model, it was used a help from a YouTube Video Tutorial.

Results

Comparison between the three models

The table below compares the results obtained from each model

Model	ACC	F1-MACRO	F1-AVG
SVM + BOW	0.709163	0.589962	0.682820
SVM + EMBEDDINGS	0.657371	0.551792	0.653619
BERT	0.666667	0.402572	0.614899

Conclusion

Due to the classes not having the same amount of reviews, all three classifiers encountered difficulties to analyse the reviews. BERT could not classify elements of rate 2, due to the small number of samples.
A more balanced database, with higher amount of samples, could bring better results
In all the classifiers, they had more accuracy in identifying the extreme rates
- This aspect could be due to the high amount of examples or to the strong words presented in those cases
Finally, the model classifier that presented the best results was: SVM + BOW

Share on

X (formerly Twitter) Facebook LinkedIn

Maria Eduarda E. Neves