Among some of the amazing talks from Cognitive Systems Institute Group Speaker Series, one of the most wonderful is the talk made by Mine Cetinkaya-Rundel about “Teaching Data Science”. One of the examples presented by her was about the salary teachers and the relationship with the SAT.

In this very simple example, we can apply linear regression, clustering and correlation.

Something about the dataset: SAT

The SAT data frame has 50 rows and 7 columns. Data were collected to study the relationship between expenditures on public education and test results. This data frame is provided by the faraway library.

Name of the column Description expend Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars) ratio Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994 salary Estimated average annual salary of teachers in public elementary and secondary schools (in thousands of dollars) takers Percentage of all eligible students taking the SAT

verbal Average verbal SAT score math Average math SAT score

total Average total score on the SAT

1) Is there a relationship between salary / expeditures with total score?

Cleaning any gap into our dataset, and a simple scatterplot to check the relationship between salary and SAT score.

library(faraway)

library(tibble)

library(stats)

library(tidyverse)

library(ggplot2)

data("sat")

df<-sat[complete.cases(sat),]

There is nothing remarkable in this graphic. So, let’s consider to include two variables for our analysis: expenditures (money invested by the state for each student) and takers (percentage of all eligible students taking the SAT).

First idea: including variable expenditures

ggplot(df, aes(x = salary, y = total, color = df$takers)) +

geom_point() +

theme_minimal() +

labs(x = "Salary ($1,000)", y = "Average SAT score",

title ="Relationship between salary and total SAT score",

subtitle = "Less students taking SAT, higher scores for those whom taking the test.",

caption = "source: faraway package

author: thinkingondata.com") +

scale_colour_viridis_c()

Second idea: including variable takers

ggplot(df, aes(x = salary, y = total, color = df$takers)) +

geom_point() +

theme_minimal() +

labs(x = "Salary ($1,000)", y = "Average SAT score",

title ="Relationship between salary and total SAT score",

subtitle = "Less students taking SAT, higher scores for those whom taking the test.",

caption = "source: faraway package

author: thinkingondata.com") +

scale_colour_viridis_c()

Answer:

According with both visualizations there is a relationship between better scores with those states where less students take the examen.

The main reason is if just only fully prepared student and willing to go to the university are taking the test, so, logically we can expect from that group more preparation and better scores, specially when we compare that with states where everyone must take the exam no matter if they want to take the test (and are prepared for that).

2) Can we check tendency into different groups of SAT score and % students taking test?

Let’s separate all the data by groups organizing the data in clusters:

clusters <- kmeans(df %>% select(salary, total, takers), centers = 3)

SAT <- df %>%

mutate(cluster = factor(clusters$cluster))

SAT <- SAT %>%

mutate(frac_cat = cut(takers, breaks = c(0, 22, 49, 81),

labels = c("low", "medium", "high")))

ggplot(SAT, aes(x = salary, y = total, color = frac_cat)) +

geom_point() +

geom_smooth(method = "lm") +

labs(x = "Salary ($1,000)", y = "Average SAT score",

title ="Groups per salary and total SAT",

subtitle = "Percentage of eligible students by colors",

caption = "source: faraway package

author: thinkingondata.com") +

theme_minimal() +

scale_color_viridis_d()

Answer

Yes, we can create groups by salary, total SAT, % of student taking the test. For each of those groups using geom_smooth we added a fitted line from a linear regression model. For datasets with n < 1000 the default method is loess.

Loess is a method for fitting a smooth curve between two variables.

This is a nonparametric method because the linearity assumptions of conventional regression methods have been relaxed. Instead of estimating parameters like m and c in y = mx +c, a nonparametric regression focuses on the fitted curve. The fitted points and their standard errors represent are estimated with respect to the whole curve rather than a particular estimate. So, the overall uncertainty is measured as how well the estimated curve fits the population curve. [1]

3) What about the correlation in our dataset?

round(cor(df),2)



#expend ratio salary takers verbal math total

#expend 1.00 -0.37 0.87 0.59 -0.41 -0.35 -0.38

#ratio -0.37 1.00 0.00 -0.21 0.06 0.10 0.08

#salary 0.87 0.00 1.00 0.62 -0.48 -0.40 -0.44

#takers 0.59 -0.21 0.62 1.00 -0.89 -0.87 -0.89

#verbal -0.41 0.06 -0.48 -0.89 1.00 0.97 0.99

#math -0.35 0.10 -0.40 -0.87 0.97 1.00 0.99

#total -0.38 0.08 -0.44 -0.89 0.99 0.99 1.00

Answer

The most important correlation is negative and is between total (total score of SAT) and takers (number of students taking the SAT).

4) Is there any relationship between salary and SAT scores?

With this dataset from 1994, we don’t have evidence in that direction

library(faraway)



data("sat")

df<-sat[complete.cases(sat),]

df1 <- df1[,-3]

df1 <- df1[order(-df1[,3]),]

colorIndices = (df1$total < median(df1$total)) + 1

satColors = c("#482677FF","#29AF7FFF")[colorIndices]

parallelplot(df1[,1:3], horizontal.axis=FALSE, col=satColors,lwd=1.5,

main="Average state SAT: best scores")

Answer:

According with this visualization there is no evidence about a relationship between better scores and high salary for teachers.

Conclusions

The smaller the percentage of high school graduates taking the SAT in a state, the better the average score: because only students fully prepared are taking the exam and for that reason the score are better.

There are some ideas to explore in the future but one is calling my attention: size of the class, composition of the classes, funding for the school are variables to consider looking for better scores for SAT.

Further material