NLP: Product Review Analysis
This project was developed using ".ipynb" files that were written in Portuguese, therefore, its repository contains the files in this same language.
To access the project repository, with all it's files in Portuguese, click HERE
The ".ipynb" used to develop the project is called "projeto_NLP_meduarda_final.ipynb". You can also click onto the Google Colaboratoy Link for project visualization
This page will present, in English, the project's goal, the process used to develop it and the results.
Project Goal
- The project has the goal to use NLP to analyse a product review, to be able to relate the product text review with its rate.
- The product chosen was a "Tablet Mirage 7 Pol 64gb 4gb ram Quad core wi-fi cor Preto", from Mercado Livre
- The reviews are in Portuguese
- The data was obtained through Mercado's Livre API. Product ID: MLB4241052396
Project Development
- The technical development of this project was divided into three phases:
- Data Load
- Text Processing
- Models Development
Data Load
- For the project, the only data necessary was the one collected from "Mercado Livre" API.
- The data was then converted in a ".csv" format
- Data Description:
- The spreadsheet in ".csv" contains 1255 lines with two columns.
- The column named "content" contains the person's opinion about the product in text format
- The column named "rate" contains the person's rate given to the product. (It has a range form 1 - 5)
- After that, it is possible to analyse the data and count the amount of different rate that exists. The table below presents these results and it is possible to observe that the data is not uniformly distributed by rate, the extreme values (1 and 5) have a higher presence.
Rate | Count |
---|---|
1 | 234 |
2 | 85 |
3 | 160 |
4 | 181 |
5 | 595 |
Text Processing
- This phase is important to normalize the reviews, tokenize it and apply or not lemmatization to the words.
- Text Normalization:
- The reviews are in Portuguese, therefore, to normalize the text it was used python's Spacy Portuguese package (!python -m spacy download pt_core_news_sm)
- The text was transformed, turning uppercase letters in lowercase
- The text was transformed, and special characters (such as punctuation) were removed
- The text was transformed, and stopwords were removed (‘a’, ‘o’, ‘de’, ‘e’). The ones removed were within the most frequent elements
- Words Tokenization:
- Tokenization is the process where each word in the phase is transformed into a single element (tokens)
- In this phase it was used the NLTK Tokenizer
- Words Lemmatization:
- Lemmatization is the process that reduces each word to its base form
- It was made a test, for the reviews, with and without lemmatization and it was chosen not to apply it
- After this process, the biggest word vector contained 108 elements.
- The total of tokens presented in the whole dataframe was 14552 elements.
- The vocabulary presented in the whole dataframe was 1832 elements.
- The table below shows some examples of the reviews before and after the Text Processing
Rate | Review (Before Text Processing) | Review (After Text Processing) |
---|---|---|
4 | Comprei pro meu filho de 5 anos. Baixei netfli... | comprei pro meu filho anos baixei netflix uns ... |
5 | Gostei muito, fácil de manusear, serve para o ... | gostei muito fácil manusear serve para meu dia... |
3 | O produto é bom,,, o problema é a tela q deixa... | produto é bom problema é tela q deixa desejar |
1 | Fica desligado toda hora sozinho não gostei. | fica desligado toda hora sozinho não gostei |
Model Development
- For analysing the reviews three models were tested:
- SVM + BOW
- SVM + EMBEDDINGS
- BERT
- SVM is "Support Vector Machine".
SVM + BOW
- It's a model that relies on the "weight" of and element. It considers how many times it appears in a sentence
- It does not keep track of elements order
- The amount of columns in the dataframe is going to be equal to the vocabulary's size
- Model chosen and its parameters (Found with GridSearch's help):
- SVC: Support Vector Classifier
- Parameters -- C: 100; Gamma: 0.1; Kernel: RBF
- Test group represented 20% of the data
SVM + EMBEDDINGS
- In this model, each word is represented by a dense vector (for this project, the size chosen was 75)
- It was used the "Word2Vec" python library to generate the vectors
- The sentences were standardized, with zero padding, so that all of them had the dimension equal, the one with the most amount of words (108)
- Model chosen and its parameters (Found with GridSearch's help):
- SVC: Support Vector Classifier
- Parameters -- C: 100; Gamma: scale; Kernel: Linear
- Test group represented 20% of the data
BERT
- Keeps track of the sentence bidirectional context. It aims to predict a word based on the one that comes before and after
- The data was divided into three: Training (0.70); Validation (0.20); Test (0.10)
- It was imported though Transformers
- BertTokenizer: will tokenize the words
- Was used a pre-trained model: BertForSequenceClassification.from_pretrained('neuralmind/bert-base-portuguese-cased’)
- For the application of this model, it was used a help from a YouTube Video Tutorial.
Results
Comparison between the three models
- The table below compares the results obtained from each model
Model | ACC | F1-MACRO | F1-AVG |
---|---|---|---|
SVM + BOW | 0.709163 | 0.589962 | 0.682820 |
SVM + EMBEDDINGS | 0.657371 | 0.551792 | 0.653619 |
BERT | 0.666667 | 0.402572 | 0.614899 |
Conclusion
- Due to the classes not having the same amount of reviews, all three classifiers encountered difficulties to analyse the reviews. BERT could not classify elements of rate 2, due to the small number of samples.
- A more balanced database, with higher amount of samples, could bring better results
- In all the classifiers, they had more accuracy in identifying the extreme rates
- This aspect could be due to the high amount of examples or to the strong words presented in those cases
- Finally, the model classifier that presented the best results was: SVM + BOW