This post is a demonstration of an exported Rmarkdown document that I made for one of my undergraduate assignments, converted into a blog-post form via the R package hugodown.
Question 8a
import the College.csv dataset using read.csv
getwd()
#> [1] "/home/richard/Insync/hochuan97@gmail.com/Google Drive/1School/2122Sem1/ST3248/Homework questions/Tutorial1"
setwd("/home/richard/Insync/hochuan97@gmail.com/Google Drive/1School/2122Sem1/ST3248/Homework questions/Tutorial1")
college <- read.csv("College.csv")
head(college)
#> X Private Apps Accept Enroll Top10perc Top25perc
#> 1 Abilene Christian University Yes 1660 1232 721 23 52
#> 2 Adelphi University Yes 2186 1924 512 16 29
#> 3 Adrian College Yes 1428 1097 336 22 50
#> 4 Agnes Scott College Yes 417 349 137 60 89
#> 5 Alaska Pacific University Yes 193 146 55 16 44
#> 6 Albertson College Yes 587 479 158 38 62
#> F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
#> 1 2885 537 7440 3300 450 2200 70 78
#> 2 2683 1227 12280 6450 750 1500 29 30
#> 3 1036 99 11250 3750 400 1165 53 66
#> 4 510 63 12960 5450 450 875 92 97
#> 5 249 869 7560 4120 800 1500 76 72
#> 6 678 41 13500 3335 500 675 67 73
#> S.F.Ratio perc.alumni Expend Grad.Rate
#> 1 18.1 12 7041 60
#> 2 12.2 16 10527 56
#> 3 12.9 30 8735 54
#> 4 7.7 37 19016 59
#> 5 11.9 2 10922 15
#> 6 9.4 11 9727 55
Question 8b
renaming the rows based on the first column of the dataset
rownames(college) <- college[, 1] # select all rows of first col
college <- college[, -1]
head(college)
#> Private Apps Accept Enroll Top10perc Top25perc
#> Abilene Christian University Yes 1660 1232 721 23 52
#> Adelphi University Yes 2186 1924 512 16 29
#> Adrian College Yes 1428 1097 336 22 50
#> Agnes Scott College Yes 417 349 137 60 89
#> Alaska Pacific University Yes 193 146 55 16 44
#> Albertson College Yes 587 479 158 38 62
#> F.Undergrad P.Undergrad Outstate Room.Board Books
#> Abilene Christian University 2885 537 7440 3300 450
#> Adelphi University 2683 1227 12280 6450 750
#> Adrian College 1036 99 11250 3750 400
#> Agnes Scott College 510 63 12960 5450 450
#> Alaska Pacific University 249 869 7560 4120 800
#> Albertson College 678 41 13500 3335 500
#> Personal PhD Terminal S.F.Ratio perc.alumni Expend
#> Abilene Christian University 2200 70 78 18.1 12 7041
#> Adelphi University 1500 29 30 12.2 16 10527
#> Adrian College 1165 53 66 12.9 30 8735
#> Agnes Scott College 875 92 97 7.7 37 19016
#> Alaska Pacific University 1500 76 72 11.9 2 10922
#> Albertson College 675 67 73 9.4 11 9727
#> Grad.Rate
#> Abilene Christian University 60
#> Adelphi University 56
#> Adrian College 54
#> Agnes Scott College 59
#> Alaska Pacific University 15
#> Albertson College 55
comments: the first column is now Private, and each row is now named with the university
Question 8c i
Produce numerical summary of variables in the data set
summary(college)
#> Private Apps Accept Enroll
#> Length:777 Min. : 81 Min. : 72 Min. : 35
#> Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
#> Mode :character Median : 1558 Median : 1110 Median : 434
#> Mean : 3002 Mean : 2019 Mean : 780
#> 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
#> Max. :48094 Max. :26330 Max. :6392
#> Top10perc Top25perc F.Undergrad P.Undergrad
#> Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
#> 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
#> Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
#> Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
#> 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
#> Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
#> Outstate Room.Board Books Personal
#> Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
#> 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
#> Median : 9990 Median :4200 Median : 500.0 Median :1200
#> Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
#> 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
#> Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
#> PhD Terminal S.F.Ratio perc.alumni
#> Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
#> 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
#> Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
#> Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
#> 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
#> Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
#> Expend Grad.Rate
#> Min. : 3186 Min. : 10.00
#> 1st Qu.: 6751 1st Qu.: 53.00
#> Median : 8377 Median : 65.00
#> Mean : 9660 Mean : 65.46
#> 3rd Qu.:10830 3rd Qu.: 78.00
#> Max. :56233 Max. :118.00
Question 8c ii
Produce scatterplot matrix of first 10 columns
Question 8c iii
boxplots of outstate Outstate verses Private
plot(college$Private, college$Outstate,
xlab = 'Private',
ylab = 'Outstate')
Question 8c iv
creating new qualitative variable Elite
Elite <- rep("No", nrow(college)) # create repeat vector of no's, for number of rows in college
Elite[college$Top10perc > 50] <- "Yes" # replace any No with Yes conditionally for the row
Elite <- as.factor(Elite) # return as factor instead of numeric
college <- data.frame(college , Elite) # append to dataframe
check how many elite universities
summary(college$Elite) # 78 elite universities
#> No Yes
#> 699 78
comments: there are 78 elite universities
boxplot of Outstate vs Elite
plot(college$Elite, college$Outstate,
xlab = 'Elite',
ylab = 'Outstate')
Question 8c v
Create histograms for a few quantitative variables with differing number of bins
par(mfrow = c(2,2)) # set plot into 4 quadrants
# Apps
hist(college$Apps, breaks=10,main = "Application histogram, bin 10")
hist(college$Apps, breaks=50,main = "Application histogram, bin 50")
hist(college$Apps, breaks=100, main = "Application histogram, bin 100")
hist(college$Apps, breaks=500, main = "Application histogram, bin 500")
# Top25perc
hist(college$Top25perc, breaks=10, main = "Top 25 histogram, bin 10")
hist(college$Top25perc, breaks=50, main = "Top 25 histogram, bin 50")
hist(college$Top25perc, breaks=100,main = "Top 25 histogram, bin 100")
hist(college$Top25perc, breaks=500,main = "Top 25 histogram, bin 500")
# Enroll
hist(college$PhD, breaks=10, main = "PhD histogram, bin 10")
hist(college$PhD, breaks=50, main = "PhD histogram, bin 50")
hist(college$PhD, breaks=100, main = "PhD histogram, bin 100")
hist(college$PhD, breaks=500, main = "PhD histogram, bin 500")
Question 8c vi
Continued exploration.
By observing the scatterplot, we can see several associated variables. Apps, Accept, Enroll, F.Undergrad are associated with each other. Top10perc and Top25perc are associated with each other.
Also, from the boxplots, we can also see that out-of-state (Outstate) tuition is higher for Private universities, and also for elite universities. We can see that out-of-state tuition can be explained partly by whether the university is private, or whether it is an elite university.