FINAL ASSESSMENT – SUMEDH NAKOD
INTRODUCTION
Starbucks has over 87,000 possible drinking combinations. It is one of the most famous multinational chains of coffee houses on the planet due to its convenience, good-tasting coffee, and widespread franchises at over 30000 locations. But have you ever pondered upon what makes up our beverage? As we know, more and more people are trying to eat healthier and healthier foods in order to maintain their health. I believe people should have a better understanding of the nutritional contents of their drinks, Therefore the following analysis will try to achieve the:
What are the relatively healthy foods in Starbucks?
Is there any correlation between different nutrition values and how it affects the calorie
Hypothesis : Caffeine does not impact the calorie count of the Starbucks beverages.
Please note: Since Plotly graphs and animated graphs do not work on Rmarkdown PDF. I am making my report in MS word itself.
To help customers to make a premeditated decision about their beverage of choice and to answer if nutritional features like Fats, sodium, cholesterol etc. have any correlation with calories. First, the data will be modified as per requirement using the ‘dplyr’ and ‘tidyverse’ packages, which will include mutating columns using string matching to create categories of beverages and more and impute the missing values. Further I will demonstrate my learnings using multiple linear modelling. And showcasing my knowledge on interpreting the statistical values. Using these statistical inferences will made to answer the questions mentioned above. Also, a conclusion will be on the hypothesis. Lastly, a reproducible code will be attached at the end of the report.
The data used for this analysis was extracted from the kaggle website which is known for having many real life datasets from which I selected the STARBUCKS dataset which provides us with the nutritional value of more than 1000 drinks having 16 different parameters. But in-order to make this data useful and meaningful it will need to be altered using some data wrangling techniques such as adding a new column specifying the category of a particular drink instead of treating each beverage individually for better understanding. We will also need to convert the milk column from numerical to a factor as its content specifies the type of milk used ranging from 0 to 5 and mutating a new column specifying if it is vegan or not, for these the use of dplyr and tidyverse packages would be used. Cleaning the dataset is another useful technique that will be covered. As my dataset had no missing values, I used a new library called as missForest which mainly uses Random Forest Algorithm to predict data, but it also contains a function called prodNA() which helps to randomly delete some percentage of observations. The result of these operations is then used for imputing missing values
through different imputations with the help of Amelia package and data transformation techniques with the help of tidyverse and dplyr. After this process, mi.meld function of amelia is used to provide us with the average result throughout different models using a mathematical phenomenon known as Rubin’s Rules. With help of these statistical inferences and the correlation between nutrition, general advice will be given to people drinking coffee from Starbucks.
Analysis and Result
For the graphical analysis, we are starting with doing some exploratory data analysis on our dataset. Figure 1 below shows an overall summary of our dataset and all the basic statistical measures to get a quick understanding of our dataset. But instead of just printing a table I have used KableExtra package to create an interactive table that highlights the row on hover. This feature may not work for a pdf file. Also, before I could do this, I had to do some data wrangling such as converting column types to factors and mutating new columns using string manipulation techniques and logic.
Figure(1) : Interactive summary of Starbucks nutrients contents
After the summary let’s look at sugar count per category of drinks. As mentioned above new column was created called as Category which divides drinks into different category of coffees. Also, to provide the viewer with an interesting fact about daily sugar intake of males (36g) and females (24) , and for our analysis purpose as well, an animated plot (saved as a gif) was coded along with alternating categories (Transition_states() ,refer code at bottom) indicating categories of drinks that surpass the normal daily intake levels. And as we can see Frappuccinos should be avoided at all times for people having a low-sugar diet. While there are options in other categories of drink which people can opt for.
Code Help
Figure(2) : Animated Bar plot of sugar content per category of beverage
Next, we try to analyse if there is any trend between the fat content and the calorie count of drinks within category. But again, we don’t do normal plots. To plot the below graph we have used the plotly library to provide us with more customizable options such as filtering by just clicking on category type, also on hover we are provided with more information regarding that data point such as the exact product name.
As we can see, we can conclude that calories and fat does have some linear relationship. Using the hover feature we can clearly see that drinks in and around “White hot chocolate” which the furthest at the top is a must-avoid drink for whoever is on a diet or care about the number of calories they are intaking.
Figure(3) : Interactive scatter plot showing relationship between calories and total fat.
Now, instead of going through each plausible parameters we use facet grid method to analyse relationship between remaining of our nutrients. Also, remember we have a custom-made column which specifies if the product is vegan or non-vegan. This will provide more details so that customers could have clearer information about their drinks.
Below we have used a geom_smooth() and under that, we have used the lm method which will show us if we have any linear relationship as well.
Evidently, as we can see below all these factors affect the no. of calories in a particular drink. But it is noteworthy to take note of the effect of cholesterol, the data points do not look that tightly fitted towards our linearly predicted line and could cause our accuracy to decrease.
Figure(4) : Grid plot showing relationship between other nutrients and calories.
Correlation heat map tries to tell us that, is there any linear trend between the two variables. In our case as we are concerned about linearity between calories and caffeine. So basically, the closer we are to the value 0 means there is no linearity, which indeed is the case with caffeine. We can also make a note of other nutrition which have a better correlation with calories, which are total fat, sodium, total carbs, sugar, and cholesterol. We will use these features further for our modelling.
浙大学霸代写 加微信 cstutorcs
MISSING DATA
Figure(5) : Correlation heat map for overall relationship of nutrients
Now, le look at how to handle the missing data. Originally the Starbucks dataset did not have any missing values, so I had to delete some random data from the dataset. To achieve this i came across 2 methods, first is using sample() function in R so that it chooses random observations from the dataset and replaces them with null. But while doing this method I came across some bugs that did not allow me to delete as much data as I wanted to. But on doing some research I came across a new library called as missForest() which is basically used to predict values using the random forest algorithm which is very popular in the Machine Learning community. Apparently, this library also has a function “prodNA()” which produces random null values in our dataset.
Below le look at the summary of our dataset with missing values:
Figure(6) : Data Summary of starbucks dataset having missing values using skimr package
Following is a plot showcasing the missing values created in our dataset using the “missmap()” function. Which graphically represents the missingness in the data, which we are going to handle using multiple imputation technique.
Figure(7) : Missing data map
After we imputed our dataset using amelia and applying tidyverse to store our imputations and multiple linear model together in a dataframe. We first plot a added variable plot, those are effective for showing the correlation of a dummy variable (nutrients) with calories in our case because the dummy variables is conditional on other covariates becomes a continuous variable, making the relationship easier to visualize.
We can also make note that there are some outliers in almost each plot below except for the plot with sugar that can affect the estimated regression parameters.
Figure(8) : Added variable plot for our imputed dataset
The normal QQ plot provides us with an indication of univariate normality of the dataset. That is if the data is normally distributed, the points will fall on the 45 degree reference line otherwise deviate. From our below QQ plot we can see that most of our data points lie on the reference line but as the value of our theoretical quantiles increases the data points starts to depart from the line but there is not a lot hence not to worry about. There is a possibility that log transformation can possibly help bring the data points closer to the reference line, but we wont be discussing that in this project.
Next, we plot a residual vs fitted graph which is used to detect non-linearity, unequal error variances and outliers. This plot shows us 3 major points:
The residuals bounce randomly around the 0 line. This suggests that the assumption that the relationship is linear is reasonable.
The residuals roughly are forming a horizontal band which suggests the variance of the error terms are equal.
Some of the residuals stand out as outliers from the basic random pattern of residuals.
Figure(9) : Plotting our model based plots
We then try to extract coefficients such as std. error and estimate for all imputations. And from this
summary we can also get our multiple linear regression equation: For eg: Imputation 1 we get
Calories = 15.99+0.044*caffeine+0.268*cholesterol+3.754*sugar+0.0951*sodium+10.282*tot al fat – 29.49*trans fat.
Equations for other imputation will be similar to the above equation and can be seen in the below summary.
Programming Help
Figure(10) : Summary statistics of all imputations
But instead of looking at every imputation individually, we can combine and average them across all imputations. For this we will make use of Rubin’s rule for this we are going to make use of `mi.meld()` function. Armed with these, we can create our regression summary table with some more dplyr wizardry. To calculate the p-value and confidence intervals, we need to extract the degrees of freedom from one of the imputed models to do so.
As we know that narrower the confidence equal the more precise our estimate is, we can say that we are 95% confident that the interval between the confidence values given below contains the true value of our calories parameter.
Figure(11) : Averaged summary statistics over all imputations
Let us not miss one of the most important statistical values for our modelling Adjusted R2. The correctly melded R2 value that we get is 0.969751, while the average R2 from all imputations is 0.9697488. The incorrectly averaged value is 0.969751, which is basically identical to the correctly melded 0.9697488. This is probably because the models from the five imputed models are already fairly similarthere might be more variance in daa ha le nea.
Repeating the same steps above but for a model without caffeine as a feature decreases our adjusted R2 value by 0.001, which is a very small value, and by our above analysis there seems to be no given relation with calories.
LIMITATION
As we saw in our added variable plots, we noticed some of the graph had quite a few outliers which can cause an effect on our model and moving forward it could become a limitation which can be needed to be worked on. Handling outliers is also one of the most important things to do while analysing your data.
Do not put much weight on residual vs fitted plot for a small dataset. It may not be worthwhile to look for conclusive solution from that plot for a smaller dataset.
It should be noted that the dataset was taken at certain point in time and the details may have been updated. Also there are few nutrients like protein, dietary fibres etc. that are not included in the data set and can impact the calorie count of the drinks as well.
CONCLUSION
Overall, the results of all the analyses corroborate one other and point to the following drink categories at Starbucks as the best choices for people having a particular diet in place : Cold & Ice Brewed Coffees, Freshly Brewed Coffees, and Starbucks Refreshers and Teas. Espressos are also a good choice but only those ones with less sugar, cholesterol.
So to conclude our hypothesis, we accept our null hypothesis which said that there is no correlation between caffeine and calories. There are many other nutrients that have a greater impact on the calorie count of beverages.
For further research we can look at the details of each drink to a granular level and compare it with the health index scores in a way to provide us with more information about the drinks that can cause a heart related issue.
Fact : Avoid afternoon caffeine crash : caffeine is an adenosine antagonist, the longer you are awake the adenosine builds up in your blood which makes you feel fatigued. Caffeine essentially blocks the adenosine receptor but when caffeine wears off the adenosine binds to the receptor, and you crash and feel really tired. To avoid this do not drink coffee for at least 90mins after you wake up, let the adenosine get handled naturally this way you on face the afternoon crash.