Data Representation: Text Transformation to Matrix
This project was developed using ".ipynb" files that were written in Portuguese, therefore, its repository contains the files in this same language.
To access the project repository, with all it's files in Portuguese, click HERE
The ".ipynb" used to develop the project is called "data_representation_text_matrix.ipynb". You can also click onto the Google Colaboratoy Link for project visualization
This page will present, in English, the project's goal, the process used to develop it and the results.
Project Goal
- The project has the goal to develop a numerical matrix from a .txt text
- After the matrix development, the project also converts it back into the original text
Project Development
- The technical development of this project was divided into three phases:
- Data Load
- Text to matrix conversion
- Matrix to text conversion
Data Load
- The .txt text was provided during the Post-graduation course and can be seen into the project Repository.
- Below it is possible to read its initial sentences
"De motu Circulari Fluidorum.
Hypothesis.
Resistentiam, quæ oritur ex defectu lubricitatis partium Fluidi, cæteris paribus, proportionalem esse velocitati, qua partes Fluidi separantur ab invicem."
Text to matrix conversion
- The idea is to convert the text into matrix where its amount of lines and columns depends on the text structure.
- Each line in the text is a line in the matrix
- The amount of columns was defined by the line that has the most amount of words.
- The lines that had fewer words, were being completed with the text "None", as many times as necessary, until it had the same amount of words as the longest sentence.
- After that, there was a text matrix, but still with words and not numbers
- Each word presented in the text vocabulary received a numerical value, creating a "conversion matrix"
- In possession of the conversion element and text matrix, it was possible to create a new numerical matrix, that represented the text
- Below it is a representation of the matrix and some information regarding the numbers Used to convert the text and the words related to it:
- Total words found in the text: 3303
- Total vocabulary found in the text: 1322
- Example of text - number conversion:
[..., (3, 'quiescentibus,'), (4, 'motus'), ...,(297, 'Circulari'),..., (434, 'motu'),..., (1125, 'De'),...]
- Representation of numerical Matrix created:
[[1125 434 297 ... 1322 1322 1322]
[1169 1322 1322 ... 1322 1322 1322]
[ 861 1322 1322 ... 1322 1322 1322]
...
[1040 1322 1322 ... 1322 1322 1322]
[ 25 763 231 ... 1187 541 1171]
[ 1 1322 1322 ... 1322 1322 1322]]
Matrix to text conversion
- The process to convert the numerical matrix into the original text is the reverse of the one presented in the last topic
- The "text-number convertor" is used to turn the numerical values in the matrix into words
- After that, a text matrix is obtained by putting together the elements on the same line, adding spaces
- Since each line in this matrix is now a line in the text, the final conversion is made, putting together all the lines.
- The fraction of the final result is presented below. It is possible to observe that it is equal to the text demonstrated in the project's beginning.
"De motu Circulari Fluidorum.
Hypothesis.
Resistentiam, quæ oritur ex defectu lubricitatis partium Fluidi, cæteris paribus, proportionalem esse velocitati, qua partes Fluidi separantur ab invicem."