Plotly & Pandas: Movie Data Visual Analysis

Plotly & Pandas: Movie Data Visual Analysis

This project was developed using ".ipynb" files that were written in Portuguese, therefore, its repository contains the files in this same language.

To access the project repository, with all it's files in Portuguese, click HERE

The ".ipynb" used to develop the project is called "plotly_pandas_movies_visual_analysis.ipynb". You can also click onto the Google Colaboratoy Link for project visualization

This page will present, in English, the project's goal, the process used to develop it and the results.

Project Goal

  • The project has the goal to apply a descriptive analysis of a "Movie Lens Data"
  • The data was collected through "Group Lens" and its "movielens" "ml-100k/" files.
  • The files used during the project were: u.data; u.item and u.user
  • This Project was developed during the Post-graduation in Deep Learning course and as Question and Answer format. Therefore, the project was organized in order to answer the questions that were made, using the PLOTLY and PANDAS python libraries.
  • Finally, this project main goal was to develop knowledge in PANDAS and PLOTLY libraries

Project Development

  • The technical development of this project was divided into data load, followed by it's 2 phases (2 main questions):
    • Presentation in visual format (using Plotly) of Minessota's data files
    • DataFrame manipulation using Pandas

Data Load

  • For the project, the only data necessary were the three mentioned before in the "Project Goal" section
  • The data contains user rating related to a specific type of movie within the three .csv files. The analysis made are related to the specification of each file and integration between them.
  • In each description, there is an example of the data
  • Data Description:
    • u.data: contains the relation between user, movie and the rate given, such as: user_id; movie_id; rating; timestamp
      user_idmovie_idratingtimestamp
      223771878887116
      1863023891717742
    • u.item: contains movie information, such as: movie_id; movie_title; release_date; video_release_date; IMDb_URL; movie's genre
      movie_idmovie_titlerelease_datevideo_release_dateIMDb_URLunknownAction...ThrillerWarWestern
      1Toy Story (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?Toy%20Story%2...00...000
      2GoldenEye (1995)01-Jan-1995NaNhttp://us.imdb.com/M/title-exact?GoldenEye%20(...01...100
    • u.user: contains the user information, such as: user_id; age; gender; occupation; zip_code
      user_idagegenderoccupationzip_code
      124Mtechnician85711
      253Fother94043
  • To better analyse the data, in further steps, the different .csv files will be joined by the user or movie ID

Presentation in visual format (using Plotly) of Minessota's data files

Visual representation of every categorical information

  • Each one of the three separated data (users, data and item) has different categorical items, therefore, it analysis was made separating one another
  • u.user (Users):
    • The two categorical data found were: gender and occupation
    • The image below shows the visual information regarding those items
Categorical USER
  • u.data (Rating per user):
    • The only categorical data found important to present a graph at this moment was: rating
    • The users and movies ids are also categorical data. It would be possible to either group them by movie or by users. This would allow an observation such as: Chose a single user and see the movies they ranked. This analysis was not done in the moment.
    • The image below shows the visual information regarding the rating
Categorical DATA
  • u.item (Movies Information):
    • For this data, the categorical data identified worth demonstration was movie gender
    • A single movie can contain one or more genders
    • The image below shows the visual information regarding the gender
Categorical ITEM

Value Information regarding numerical data

  • In this phase, it was obtained the numerical value for each numerical data.
  • Those values were: maximum, minimum, mean, median, and STD
  • The image below shows the visual information regarding those values for: age (user information); rating (data information)
Categorical ITEM

Creation of time graphs related to the data

  • In this phase, two graphs were created with the goal to analyse data through time
  • The first graph represents the release of movies, starting in 1922 and ending in 1998
  • The second graph represents the release of movies by genre, starting in 1922 and ending in 1998
  • From the second graph it is possible to visualize a huge increase in the Drama section followed by Comedy, compared to the others, especially after 1992
Categorical ITEM
Categorical ITEM

Rating Analysis (by user gender and movie genre)

  • In this phase, two more graphs were created in order to better understand the data
  • The first graph analyzed the rating by different user's gender
  • The second graph analyzed the rating by different movie's genre
  • The goal was to analyse if the rating was influenced by gender or movie genre
  • In conclusion, the average rating given by male and female was very similar with women giving more 1 and 5 scores, and man the main responsible for the other ones.
  • Regarding movie genre, the rating follows a pattern related to the total amount of each genre that exists. Even so, some aspects are interesting: There are more Drama then Comedy movies, but there are more Comedy movies with 1 and 2 scores, comparing to Drama.
Categorical ITEM
Categorical ITEM

DataFrame manipulation using Pandas

  • This final phase focused on manipulate the DataFrames by working with pandas elements
  • The first manipulation made was the generation of a new DataFrame with only "Male" and "Student" users that are over 20
    • A function was applied into the USER dataset:
      student_users = users[(users.age > 20)&(users.occupation == "student")&(users.gender == "M")]
    • The table below presents a fraction of the new DataFrame that was generated
      user_idagegenderoccupationzip_code
      929Mstudent01002
      3323Mstudent27510
      7324Mstudent41850
  • The second manipulation made was to discover the amount of female programmers
    • A function was applied into the USER dataset:
      programmer_F_users = users[(users.occupation == "programmer")&(users.gender == "F")]
    • It was found a total of 6 users that fit the criteria
  • The last manipulation made was to discover the amount of rating over 3
    • A function was applied into the DATA dataset:
      over3_data = data[data.rating > 3]
    • Results per rating (4 and 5): Rating 4 with 34.174 units; Rating 5 with 21.201 units
    • It was found a total of 55.375 ratings over 3