# Import raw data
raw.df <- read.csv("https://raw.githubusercontent.com/vivienneprince/College-Scorecard---Munging-EDA-Project/main/data/starfishdf.csv")
# Import field variables
gs4_deauth()
starfish.fields <- read_sheet('https://docs.google.com/spreadsheets/d/1PL5zn6QLU9GSSD8rRreL8r7xaoClrPIrf5qnioZ3eE4/edit?usp=sharing')
This is our function for checking the completeness and free-of-error dimensions of data quality as described by Pipino, Lee, and Wang.
For completeness, we calculated the ratio of non-na values within each column. So a value of 1 would signify there are no na’s in that column.
For free-of-error, we returned the ratio of outliers within each numeric column. If the column is not numeric, it shows ‘not numeric’. Outliers are by default defined as 3IQR less than the 1st quartile, or 3IQR more than the 3rd quartile. The IQR multiplier can be user specified by adding the argument ‘criterion= x’ where x is the desired multiplier to the function call. So a value of 0 would signify there are no outliers based on the criteria specified.
qualitycheck <- function(df, critereon=3) {
# setup
col_count <- ncol(df)
# return table setup
returntable <- matrix(nrow=col_count,ncol=2,byrow=TRUE)
colnames(returntable) <- c("Ratio_Completeness","Ratio_Outliers")
rownames(returntable) <- colnames(df)
# returntable <- as.table(returntable)
for (i in 1:col_count) {
# Completeness check: percentage of non-na values
na_count <- sum(length(which(is.na(df[,i]))))
naratio.count <- (1 - (na_count/nrow(df)) )
# Free-of-error check: count of outliers
if ( class(df[1,i])==class(1) || class(df[1,i])==class(1.0) ) {
IQR <- quantile(df[,i], .75, na.rm = T) - quantile(df[,i], .25, na.rm = T)
threshold.upper <- quantile(df[,i], .75, na.rm = T) + IQR*critereon
threshold.lower <- quantile(df[,i], .25, na.rm = T) - IQR*critereon
outliers.index <- which(df[,i] > threshold.upper | df[,i] < threshold.lower)
outliers.count <- length(outliers.index)/col_count
}
else outliers.count='not numeric'
# Write check values into table:
returntable[i,] <- matrix(c(naratio.count, outliers.count),ncol=2)
}
returndf <- as.data.frame(returntable)
returndf$Ratio_Completeness <- as.numeric(returndf$Ratio_Completeness)
returndf <- cbind(VAR_NAME = rownames(returndf), returndf)
rownames(returndf) <- NULL
return(returndf)
}
ordered.df <- raw.df[,order(colnames(raw.df))]
data.quality <- qualitycheck(ordered.df)
# data.quality[data.quality$Ratio_Completeness == 0, 1:2]
# data.quality[data.quality$Ratio_Outliers == 'not numeric', ]
As we can see, there are a number of fields that contain little to no data (<20%)
Remove all columns that are empty
good.fields <- data.quality[data.quality$Ratio_Completeness != 0,1]
clean.df <- ordered.df[, good.fields]
qualitycheck(clean.df)
hist(qualitycheck(clean.df)$Ratio_Completeness)
There are people who believe Asians are intrinsically ‘smarter’. I don’t think this is true, but I wondered if there would be a negative relationship between ratio of Asian students in the university, and admission rate. I thought about this because the Asians going to university overseas in America is a small subset of all Asian students, and it would make sense that this subset are the highest scoring Asian students (or the richest) (lurking variables, yay stats class). From the graphs, it looks like Asians are the only ones with a negative correlation, so my hypothesis stands, but I wouldn’t say causation is determined.
My graphs look specifically at the state of Florida to compare NCF with the rest of the state:
I subsetted the data to reflect this as well as created a dummy variable to denote which college is New College for graph creation.
In this graph, I wanted to look at the amount of first generation students that a university has and their average family income. There is the obvious negative relationship between the percentage of first gen students and average family income, as first gen students usually come from poorer families. What I am interested is the place where New College sits. Through my experience with the college, everyone came from already highly educated families that were well-off. And it seems that this is the case. New College has the lowest amount of first gen students in the state, and has a rather higher average family inc, the second highest in Public Schools.
In this graph, instead of first gen students, I looked at the percentage of white students. Once again, I wanted to see the relationship between race and income, but how New College fits into the picture. As we can see, New College is fairly white, and once again pretty well-off.
In these graphs I was just curious about the costs of attendence comparing New College with the rest of the Public Schools. I choose just to focus on Public schools as Private schools are going to be more expensive as a given. From these comparisons, we can see that New College is pretty expensive as a public school in Florida. However, the reason why New College has the lowest first gen students may be cost, but could be other things - its worth investigating this lead further.
year = latest distanceonly = 1 #“Online Only” distanceonly = 0 #“Traditional”
This graph looks at the distribution of admission rates for institutions factored on online only.
This graphs looks at the distribution for the number of undergraduate students and graduate students for institutions factored on online only.
This graph looks at the distribution of students age of entry for institutions factored on distance only.
This graph looks at the relationship between an institution’s net tuition revenue per full-time equivalent student and instructional expenditures per full-time equivalent student. The relationship appears positive for both traditional and online only institutions.
This graph looks at the relationship between an institution’s average faculty salary and average cost of attendance (academic year institutions). The relationship appears positive traditional institutions, but negative for online only institutions.
This graph looks at the relationship between an institution’s percent of students over 23 at entry and share of married students. The relationship appears positive traditional institutions, but negative for online only institutions.
This graph looks at the relationship between an institution’s median family income in real 2015 dollars and median earnings of students working and not enrolled 6 years after entry. The relationship appears positive for both traditional and online only institutions.