Import data

# Import raw data
raw.df <- read.csv("https://raw.githubusercontent.com/vivienneprince/College-Scorecard---Munging-EDA-Project/main/data/starfishdf.csv")

# Import field variables
gs4_deauth()
starfish.fields <- read_sheet('https://docs.google.com/spreadsheets/d/1PL5zn6QLU9GSSD8rRreL8r7xaoClrPIrf5qnioZ3eE4/edit?usp=sharing')

Data Quality

Data Quality Check Function

This is our function for checking the completeness and free-of-error dimensions of data quality as described by Pipino, Lee, and Wang.

For completeness, we calculated the ratio of non-na values within each column. So a value of 1 would signify there are no na’s in that column.

For free-of-error, we returned the ratio of outliers within each numeric column. If the column is not numeric, it shows ‘not numeric’. Outliers are by default defined as 3IQR less than the 1st quartile, or 3IQR more than the 3rd quartile. The IQR multiplier can be user specified by adding the argument ‘criterion= x’ where x is the desired multiplier to the function call. So a value of 0 would signify there are no outliers based on the criteria specified.

qualitycheck <- function(df, critereon=3) { 

  # setup
  col_count <- ncol(df)
  
  # return table setup
  returntable <- matrix(nrow=col_count,ncol=2,byrow=TRUE)
  colnames(returntable) <- c("Ratio_Completeness","Ratio_Outliers")
  rownames(returntable) <- colnames(df)
  # returntable <- as.table(returntable)
  
  
  for (i in 1:col_count) {
    
    # Completeness check: percentage of non-na values
    na_count <- sum(length(which(is.na(df[,i]))))
    naratio.count <- (1 - (na_count/nrow(df)) )

    
    # Free-of-error check: count of  outliers
    if ( class(df[1,i])==class(1) || class(df[1,i])==class(1.0) ) {
      IQR <- quantile(df[,i], .75, na.rm = T) - quantile(df[,i], .25, na.rm = T)
      threshold.upper <- quantile(df[,i], .75, na.rm = T) + IQR*critereon
      threshold.lower <- quantile(df[,i], .25, na.rm = T) - IQR*critereon
      outliers.index <- which(df[,i] > threshold.upper | df[,i] < threshold.lower)
      outliers.count <- length(outliers.index)/col_count
    }
    else outliers.count='not numeric'
    
    # Write check values into table:
    returntable[i,] <- matrix(c(naratio.count, outliers.count),ncol=2)
  }
  returndf <- as.data.frame(returntable)
  returndf$Ratio_Completeness <- as.numeric(returndf$Ratio_Completeness)
  
  returndf <- cbind(VAR_NAME = rownames(returndf), returndf)
  rownames(returndf) <- NULL
  
  return(returndf)
  
}

Quality check:

ordered.df <- raw.df[,order(colnames(raw.df))]
data.quality <- qualitycheck(ordered.df)

# data.quality[data.quality$Ratio_Completeness == 0, 1:2]
# data.quality[data.quality$Ratio_Outliers == 'not numeric', ]

Data Quality table

Ratio Completeness

As we can see, there are a number of fields that contain little to no data (<20%)

Ratio Outliers

Data clean up

Remove all columns that are empty

good.fields <- data.quality[data.quality$Ratio_Completeness != 0,1]
clean.df <- ordered.df[, good.fields]
qualitycheck(clean.df)

hist(qualitycheck(clean.df)$Ratio_Completeness)

Asians!

Where are the Asians?

Total share of enrollment of undergraduate degree-seeking students who are Asian (ugds_asian)

I asked the question, “where are the Asians?” To find out, I made a graph that represents the concentration of Asian students geographically, focusing on Florida, because that is where I live. It turns out that all the Asians are in Orlando (concentration-wise). This is not surprising because the tastiest Asian restaurants I’ve been to in Florida has been in Orlando. They must get good business there.

2015

2016

2017

2018

latest

Asians = Smart?

Race VS Admissions

There are people who believe Asians are intrinsically ‘smarter’. I don’t think this is true, but I wondered if there would be a negative relationship between ratio of Asian students in the university, and admission rate. I thought about this because the Asians going to university overseas in America is a small subset of all Asian students, and it would make sense that this subset are the highest scoring Asian students (or the richest) (lurking variables, yay stats class). From the graphs, it looks like Asians are the only ones with a negative correlation, so my hypothesis stands, but I wouldn’t say causation is determined.

Asians

White

Black

Hispanic

A look at NCF

My graphs look specifically at the state of Florida to compare NCF with the rest of the state:

I subsetted the data to reflect this as well as created a dummy variable to denote which college is New College for graph creation.

First Generation Students vs. Family Income of Florida Universities in 2015

In this graph, I wanted to look at the amount of first generation students that a university has and their average family income. There is the obvious negative relationship between the percentage of first gen students and average family income, as first gen students usually come from poorer families. What I am interested is the place where New College sits. Through my experience with the college, everyone came from already highly educated families that were well-off. And it seems that this is the case. New College has the lowest amount of first gen students in the state, and has a rather higher average family inc, the second highest in Public Schools.

Percentage of White Undergraduate Students vs. Average Family Income of Florida Universities in 2015

In this graph, instead of first gen students, I looked at the percentage of white students. Once again, I wanted to see the relationship between race and income, but how New College fits into the picture. As we can see, New College is fairly white, and once again pretty well-off.

Average Cost of Attendence / Tuition for All Public Florida Colleges in 2015

In these graphs I was just curious about the costs of attendence comparing New College with the rest of the Public Schools. I choose just to focus on Public schools as Private schools are going to be more expensive as a given. From these comparisons, we can see that New College is pretty expensive as a public school in Florida. However, the reason why New College has the lowest first gen students may be cost, but could be other things - its worth investigating this lead further.

Online only learning vs traditional institutions comparison

year = latest distanceonly = 1 #“Online Only” distanceonly = 0 #“Traditional”

Comparison of admission rate for online only to traditional institutions

This graph looks at the distribution of admission rates for institutions factored on online only.

Comparison of undergradute enrollment and graduate enrollment for online only to traditional institutions

This graphs looks at the distribution for the number of undergraduate students and graduate students for institutions factored on online only.

Comparison of entry age for online only to traditional institutions

This graph looks at the distribution of students age of entry for institutions factored on distance only.

Comparison on net tuition revenue per full-time equivalent student vs instructional expenditures per full-time equivalent student

This graph looks at the relationship between an institution’s net tuition revenue per full-time equivalent student and instructional expenditures per full-time equivalent student. The relationship appears positive for both traditional and online only institutions.

Average faculty salary vs average cost of attendance (academic year institutions)

This graph looks at the relationship between an institution’s average faculty salary and average cost of attendance (academic year institutions). The relationship appears positive traditional institutions, but negative for online only institutions.

Percent of students over 23 at entry vs married

This graph looks at the relationship between an institution’s percent of students over 23 at entry and share of married students. The relationship appears positive traditional institutions, but negative for online only institutions.

Median family income in real 2015 dollars vs Median earnings of students working and not enrolled 6 years after entry

This graph looks at the relationship between an institution’s median family income in real 2015 dollars and median earnings of students working and not enrolled 6 years after entry. The relationship appears positive for both traditional and online only institutions.

Team Starfish