Plotly & Pandas: Movie Data Visual Analysis

This project was developed using ".ipynb" files that were written in Portuguese, therefore, its repository contains the files in this same language.

To access the project repository, with all it's files in Portuguese, click HERE

The ".ipynb" used to develop the project is called "plotly_pandas_movies_visual_analysis.ipynb". You can also click onto the Google Colaboratoy Link for project visualization

This page will present, in English, the project's goal, the process used to develop it and the results.

Project Goal

The project has the goal to apply a descriptive analysis of a "Movie Lens Data"
The data was collected through "Group Lens" and its "movielens" "ml-100k/" files.
The files used during the project were: u.data; u.item and u.user
This Project was developed during the Post-graduation in Deep Learning course and as Question and Answer format. Therefore, the project was organized in order to answer the questions that were made, using the PLOTLY and PANDAS python libraries.
Finally, this project main goal was to develop knowledge in PANDAS and PLOTLY libraries

Project Development

The technical development of this project was divided into data load, followed by it's 2 phases (2 main questions):
- Presentation in visual format (using Plotly) of Minessota's data files
- DataFrame manipulation using Pandas

Data Load

For the project, the only data necessary were the three mentioned before in the "Project Goal" section
The data contains user rating related to a specific type of movie within the three .csv files. The analysis made are related to the specification of each file and integration between them.
In each description, there is an example of the data

Data Description:

u.data: contains the relation between user, movie and the rate given, such as: user_id; movie_id; rating; timestamp
user_id movie_id rating timestamp
22 377 1 878887116
186 302 3 891717742

user_id	movie_id	rating	timestamp
22	377	1	878887116
186	302	3	891717742

u.item: contains movie information, such as: movie_id; movie_title; release_date; video_release_date; IMDb_URL; movie's genre

movie_id	movie_title	release_date	video_release_date	IMDb_URL	unknown	Action	...	Thriller	War	Western
1	Toy Story (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Toy%20Story%2...	0	0	...	0	0	0
2	GoldenEye (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?GoldenEye%20(...	0	1	...	1	0	0

u.user: contains the user information, such as: user_id; age; gender; occupation; zip_code
user_id age gender occupation zip_code
1 24 M technician 85711
2 53 F other 94043

To better analyse the data, in further steps, the different .csv files will be joined by the user or movie ID

user_id	age	gender	occupation	zip_code
1	24	M	technician	85711
2	53	F	other	94043

Presentation in visual format (using Plotly) of Minessota's data files

Visual representation of every categorical information

Each one of the three separated data (users, data and item) has different categorical items, therefore, it analysis was made separating one another

u.user (Users):
- The two categorical data found were: gender and occupation
- The image below shows the visual information regarding those items

u.data (Rating per user):
- The only categorical data found important to present a graph at this moment was: rating
- The users and movies ids are also categorical data. It would be possible to either group them by movie or by users. This would allow an observation such as: Chose a single user and see the movies they ranked. This analysis was not done in the moment.
- The image below shows the visual information regarding the rating

u.item (Movies Information):
- For this data, the categorical data identified worth demonstration was movie gender
- A single movie can contain one or more genders
- The image below shows the visual information regarding the gender

Value Information regarding numerical data

In this phase, it was obtained the numerical value for each numerical data.
Those values were: maximum, minimum, mean, median, and STD
The image below shows the visual information regarding those values for: age (user information); rating (data information)

Creation of time graphs related to the data

In this phase, two graphs were created with the goal to analyse data through time
The first graph represents the release of movies, starting in 1922 and ending in 1998
The second graph represents the release of movies by genre, starting in 1922 and ending in 1998
From the second graph it is possible to visualize a huge increase in the Drama section followed by Comedy, compared to the others, especially after 1992

Rating Analysis (by user gender and movie genre)

In this phase, two more graphs were created in order to better understand the data
The first graph analyzed the rating by different user's gender
The second graph analyzed the rating by different movie's genre
The goal was to analyse if the rating was influenced by gender or movie genre
In conclusion, the average rating given by male and female was very similar with women giving more 1 and 5 scores, and man the main responsible for the other ones.
Regarding movie genre, the rating follows a pattern related to the total amount of each genre that exists. Even so, some aspects are interesting: There are more Drama then Comedy movies, but there are more Comedy movies with 1 and 2 scores, comparing to Drama.

DataFrame manipulation using Pandas

This final phase focused on manipulate the DataFrames by working with pandas elements
The first manipulation made was the generation of a new DataFrame with only "Male" and "Student" users that are over 20
- A function was applied into the USER dataset:
  student_users = users[(users.age > 20)&(users.occupation == "student")&(users.gender == "M")]
- The table below presents a fraction of the new DataFrame that was generated
  user_id age gender occupation zip_code
  9 29 M student 01002
  33 23 M student 27510
  73 24 M student 41850
The second manipulation made was to discover the amount of female programmers
- A function was applied into the USER dataset:
  programmer_F_users = users[(users.occupation == "programmer")&(users.gender == "F")]
- It was found a total of 6 users that fit the criteria
The last manipulation made was to discover the amount of rating over 3
- A function was applied into the DATA dataset:
  over3_data = data[data.rating > 3]
- Results per rating (4 and 5): Rating 4 with 34.174 units; Rating 5 with 21.201 units
- It was found a total of 55.375 ratings over 3

user_id	age	gender	occupation	zip_code
9	29	M	student	01002
33	23	M	student	27510
73	24	M	student	41850

Share on

X (formerly Twitter) Facebook LinkedIn

Maria Eduarda E. Neves