This website details Exploratory Data Analysis and Visualizations on the Recipes Dataset
Introduction
In this project we will be analyzing the ‘Recipes’ data set. This dataset details different recipes, their nutritional breakdown, the steps needed to complete them, and the user ratings. We were curious as to whether the recipes in the later years had the same rating distribution as the recipes in the earlier years.
The question we investigate is: Are the recipes from recent years (2018) rated the same as the recipes from ealier years?
We will be using the data from RAW_recipes.csv and RAW_interactions.csv. The first file contains the data about the recipes, while the second file contains the data about the ratings of these recipes. Two csv files merged together will form our DataFrame.
We wanted to know whether the year in which the recipe was published affects how the users rate the recipes. We hypothesize that the more recipes there are, the less likely they are to be very unique, so the ratings would change. We care because this information can help recipe websites grow and expand faster as a business. The higher ratings the website has the more prestigious and reliable it is. If there was something that made the ratings different in the later year compared to earlier years, we would be able to work on another project to further investigate what caused the changes in the ratings.
Our dataset contains 234429 rows
The columns we are interested in are detailed below
| Column | Description | Data Types |
|---|---|---|
| id | Recipe ID | string |
| minutes | Minutes it takes to prepare the recipe | int |
| contributor_id | ID of the user who submitted the recipe | string |
| submitted | The date when the user posted the recipe | string |
| nutrition | Nutrition information including calories (#), total fat ,sugar, sodium, protein, saturated fat, carbohydrates | string |
| rating | The rating of the recipe | int |
| review | The review left with the rating | string |
| rating_avg | The average rating for a specific recipe | float |
| calories | The number of calories in the recipe | float |
| year | The year in which the recipe was submitted | int |
Cleaning and EDA
When conducting our data cleaning process we cleaned several columns.
rating:
We cleaned the rating column by replacing the “0” values with np.NaN values. This changed the amount of actual data that we have and can use. In fact, while conducting our permutation tests we chose to drop the rows with np.NaN values. We made this decision because while shuffling the years column and counting up every type of rating we noticed that the results were inconsistent. This is because the amount of np.NaN value within each year was different after every permutation, making it difficult to compare our data.
rating_avg:
For every recipe we computed the average rating, and assigned it to that recipe. This helped us see the distribution of the ratings since we were able to significantly reduce the missing values.
year:
We used the submitted column to extract the year the recipe was submitted in, to better suit our graphs and analysis. Since we are only interested in the year that the rating was submitted in, we created a new column to store that data.
calories:
The nutrition column had several different nutrition labels. We were interested in trends between calories and ratings, so we decided to extract the caloric value of each recipe as a float from the nutrition column and add it as a new column.
Our Cleaned DataFrame
| id | minutes | contributor_id | submitted | nutrition | rating | review | rating_avg | calories | year |
|---|---|---|---|---|---|---|---|---|---|
| 333281 | 40 | 985201 | 2008-10-27 | [138.4, 10.0, 50.0, 3.0, 3.0, 19.0, 6.0] | 4.0 | These were pretty good, but took forever to bake. I would send it ... | 4.0 | 138.4 | 2008 |
| 453467 | 45 | 1848091 | 2011-04-11 | [595.1, 46.0, 211.0, 22.0, 13.0, 51.0, 26.0] | 5.0 | Originally I was gonna cut the recipe in half (just the 2 of us her... | 5.0 | 595.1 | 2011 |
| 306168 | 40 | 50969 | 2008-05-30 | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 5.0 | This was one of the best broccoli casseroles that I have ever made... | 5.0 | 194.8 | 2008 |
| 306168 | 40 | 50969 | 2008-05-30 | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 5.0 | I made this for my son's first birthday party this weekend. Our gues... | 5.0 | 194.8 | 2008 |
| 306168 | 40 | 50969 | 2008-05-30 | [194.8, 20.0, 6.0, 32.0, 22.0, 36.0, 3.0] | 5.0 | Loved this. Be sure to completely thaw the broccoli. I didn&... | 5.0 | 194.8 | 2008 |
Univariate Data
We created a density histogram that describes the proportion of each rating. As you can see on the graph it seems like a left skewed distribution. Thus, the majority of ratings fall into the 4 or 5 star category, while the rest of the ratings are 1, 2, or 3 stars. It seems that people tend to leave higher ratings rather than lower ratings for the recipes. People probably feel more strongly about the things that they liked and want to share with the world, rather than the things that they didn’t like or thought were just ok.
Bivariate Data
We made a scatterplot with years on the x-axis and average rating on the y-axis. As you can see on the graph, there tends to be more points towards the higher ratings in general. There also seems to be more points in earlier years. For example, in 2008, the points from 3-5 form a line showing that the majority of points are in that region. In contrast, in 2018, there are still clusters of points around 4-5 rating but more gaps in the data. This shows that while the trend of having more points towards higher ratings is consistent, there is less data in 2018.
Aggregate Data
This is a table grouped by year, that takes the mean of the rating column for every year. The mean values tend to range between 4.46 and 4.71. We wanted to know if such differences were significant or not.
| year | rating |
|---|---|
| 2008 | 4.66142 |
| 2009 | 4.68105 |
| 2010 | 4.69894 |
| 2011 | 4.70747 |
| 2012 | 4.72201 |
| 2013 | 4.70401 |
| 2014 | 4.71736 |
| 2015 | 4.68498 |
| 2016 | 4.50763 |
| 2017 | 4.46356 |
| 2018 | 4.49537 |
Assessment of Missingness
NMAR Missingness in review column
In this dataset, we believe that the missingness type of missing values in review would be NMAR. We believe that users who had strong positive or negative feelings about the recipes are more likely to leave a review. The user would be less likely to leave a review if the feeling mildly about the recipe. This explains why review would be NMAR since they are missing depending on how users feel about the recipe which is essentially the value itself.
Permutation Testing
As our question is centered around the distribution of rating in different years, our selected column would be rating and we are testing its missingness dependency on two columns: calories and review. In this case, we found our observed statistics to be the absolute difference in mean calories and review between the two groups of ratings missing and ratings not missing. We then run permutation tests 1,000 times to get 1,000 test statistics and calculate the probability of seeing values equal to or greater than our observed mean difference.
When testing whether rating missingness depends on calories, our p-value is 0.0. Since 0.0 < 0.05, we reject the null hypothesis. This means the missingness of rating would be dependent on the calories column.
When testing whether rating missingness depends on reviews, the p-value that we get is 0.2. Because 0.2 > 0.05 (our significance level) we fail to reject the null hypothesis. The missingness of rating would not depend on review column.
Hypothesis Testing
In our hypothesis testing we aim to answer the question: Do the ratings for the recipes published in 2018 differ from the ratings for the recipes published before 2018?
Our null hypothesis is that the distribution of ratings for recipes in 2018 is the same with the distribution of ratings for recipes for years before 2018.
Our alternative hypothesis is that the distribution of ratings for recipes in 2018 is different from the distribution of recipes for years before 2018.
In this case, as we are examining the differences in categorical distributions. We are using the total variation distance between the two rating distributions in 2018 and years before 2018. Since the numbers of recipes posted in 2018 and years before 2018 are different, we are calculating the total variation distances using proportions in each rating category. We chose our p-value as 0.05 since it is a standard value of threshold that demonstrates statistical significance.
In our permutation test, we first drop the rows where ratings have NaN values as we would not want the number of missing rating values to differ in our two groups each time when we perform a permutation test. We then shuffle the year column and assign it as shuffled_year. We then group the data by 2018 and years before 2018 based on shuffled_year. After grouping the DataFrame into two groups, we get the proportions of the rating distribution for each of the groups and then take the total variation distance between the distributions. We would repeat the process 1,000 times to see what is the probability of seeing values equal to or even more extreme than our observed total variation distance.
The graph shows the distribution of tvds compared to our observed statistic. Since we would like to fix our significance level at 0.05 and our p-value is 0.09, we would fail to reject the null hypothesis (0.09 > 0.05). The distribution of ratings for recipes in 2018 is the likely the same as the distribution of ratings for recipes for years before 2018.