York University
MTH 3333
For all the questions involving R programming, please submit your R code and attach
the screenshot of the R output.
Question 1: Please download the bike sharing data set from the following website:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. (I also include the data set
on crowdmark.) This dataset contains the hourly and daily co
...[Show More]
For all the questions involving R programming, please submit your R code and attach
the screenshot of the R output.
Question 1: Please download the bike sharing data set from the following website:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. (I also include the data set
on crowdmark.) This dataset contains the hourly and daily count of rental bikes between
years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. There are two data sets and we will use the day.csv. Import the data
into R and perform the following exploratory analysis.
a) The variable "registered" records the number of registered users used the bike sharing
service on a particular day. Please provide the mean value of the variable "registered" for
each day of the week in year 2011.
b) Plot the conditional density plot of the variable "registered" conditional on each month
of the year in year 2011.
c) Produce a two-dimensional levelplot of the variable "casual" against the combination
of temperature (variable "temp") and humidity (variable "hum").
Question 2: Perform linear regression model on the bike sharing data set from Question
2.
a. Provide the summary result of the regression model with "casual" as the response
variable and "temp", "hum" as the predictors.
b. What other predictors do you think might be important for the modelling of the
variable "casual"? Please construct another linear model including more predictors
and provide the summary result of the second model.
c. Perform 1000 times of 5-fold cross validation and each time you randomly partition
the dataset into five equal parts. You will use 80% of the data as the training data
and 20% as the validating data. For models in a) and b), calculate the total sum of
squared prediction error divided by the size of the validation data and by the number
of cross-validations. Which model has better predictive power?
Question 3: In the following marketing set, we have 9 years with the sales in 10 million
euro and the advertising expenditure in million euro.
[Show Less]