The authors of this blog post are Zack Schwartz and Xi Yu, computer science graduate students at The George Washington University.

Among the divergent topics that a student can select to data mine for the Data Mining course, education is often overlooked. While not as hip and fun as mining Twitter and other social media, the American public school system is certainly no less important or relevant. The US Government spends billions of dollars on programs every year to ensure that its schools are providing adequate education to its citizens. It is no secret that inner-city youth lag far behind their suburban counterparts. Schools in low income and high minority areas are struggling or failing entirely to meet basic standards. But have we devoted the time, energy, and resources needed to actually analyze available the data? What do we genuinely know about our public school system? A simple Google search for “data mining public schools” reveals that there is strong resistance to data mining anything related to schools and children:

These concerns are not without merit, however. The possession of vast amounts of data on people is incredibly valuable. This provides the government with greater power—a distressing thought indeed. Although not unlike anything else in our world, data can be used for either benevolent or malevolent purposes. The US Government has pledged a level of transparency, where they make publicly available many of its data sets in the hopes that individuals like us data mine them for the benefit of others and the community.

New York State is one of the largest and more diverse public school systems in the nation. New York State conducts what is known as “Report Cards” for all of the public schools in the state. Fortunately, the data is freely available at https://reportcards.nysed.gov/index.php and contains a wealth of useful information that can be data mined. Among all of the data, we have chosen a set of attributes that contain a mix of student performance on standardized tests, racial breakdowns, post-graduation plans, student behavior, and teacher qualifications. This data was extracted, cleaned, parsed, and ran through common data mining algorithms to see if any interesting patterns could be discovered. Clear patterns did emerge with some unexpected, yet powerful results.

Pre-processing of the data consisted of the following:

2009-2010 academic year

Only considered schools with “HIGH SCHOOL” in its name

Removed schools that had missing or obviously incorrect data for various columns.

The resulting data set consisted of 779 schools and 42 attributes. Please note that for the tables in this document, we rounded the numbers to either first decimal or whole number for ease of data entry. For exact numbers, and the complete list of attributes, see the following:

List of raw results and full list of attributes for 10 clusters

For this post, we will focus on clustering the data into 10 clusters. This means that the data mining algorithm used (k-means) finds 10 groups of schools that are similar to each other based on their attributes. Therefore all the schools in cluster 1 are similar to each other, but may be very different overall from the schools in cluster 6. Describing how these clusters are created is beyond the scope of this blog post but you can read about unsupervised learning algorithms.

The raw results is a bit challenging to absorb, so we conveniently broke up the columns and tabulated the results for ease of analysis.

Our first task is to analyze school performance based on percentage of students passing their math, english, and regents standardized tests. We grouped the performance of schools into three categories, “Poor”, “Average”, and “Strong” and put that in the column to the far right. Here are the results:

The difference in performance of the students on their test scores between the clusters is quite clear and consistent throughout. We have identified clusters 1, 4, 5, and 7 as poor performing clusters. Clusters 3 and 6 are about average, and clusters 0, 2, 8, and 9 are high performing. Let’s take a further look into the attributes of each of these groups.

We will start off with inspecting post-graduation plans. One reasonable hypothesis is that the more students who attend four-year colleges following graduation, the better the performance of the school. However, unexpectedly, we have discovered this is not necessarily the case. We have kept the categorization from the previous table for ease of reference.

This crucial table demonstrates the unexpected results that we have found. Cluster 0 and cluster 3, strong and average performing schools respectively, have much less students attending four year colleges upon graduation than the poor performing clusters 1, 2, and 4 (and equal to 5). The strongest clusters though, 8 and 9, have very large percentages attending four year schools. Likewise, cluster 7, a poor performing school has the lowest percentage of students attending four year schools. However, we conclude that the attendance of four year colleges upon graduation is not a reliable indicator of the performance of the school’s students. However, the table does provide more useful information. The two clusters, 0 and 3, that are seemingly out of whack in terms of the link between four year schools and performance have the highest rate of students directly entering employment. We do not surmise this is a coincidence.

That is not all this table reveals. The most telling and predictive nature of whether a school is a strong, average, or poor performer is the last column – percent unknown. It is evident from the data that schools whose students’ plans are unknown following graduation are the weakest performers. While attendance to four year schools is not a strong indicator of performance, percentage of students with unknown plans definitely is.

Now let’s analyze additional attributes relating to the students such as percentage of students who drop out of high school, percentage of students who get suspended, percentage who receive free or reduced lunch, and percentage of non-completers, and the attendance rate.

There is little that is unexpected in this table – the most interesting tidbit is that the percentage of students who receive free or reduced lunch is the strongest indicator of student performance. Schools with under 30% free lunch perform the strongest, the average schools have 40% and 45%, and the poor performing schools are over 69%. While we do not have economic data, we can take a reasonable postulate that free and reduce lunch is linked with income levels. The other data strongly correlate higher dropout rates, lower attendance and pronounced incompletion rates with poor performance. One might easily glance over suspension rates, but there is something here to be cognizant of. Clusters 0 and 3 – the “anomalies” when it came to the four year colleges, follow the same logic with suspensions. Despite being strong and average respectively, their suspension rates are higher or equal to a number of poor performing schools.

We have tip-toed around the racial breakdowns of the schools, but we cannot ignore it forever. There is a strong correlation between the performance of the school and its racial makeup. Here is the table breaking down percentages for White, Asian, Black, Hispanic, and American Indian:

The educational and economic divide between the races in this country is no secret. It is an ugly social problem that we must face as a nation. We would like to highlight cluster 3, which has defied several trends so far (low four-year colleges and higher suspension rates). It has the highest percentage of white students and only performs at an average level. Clusters 8 and 9, which have consistently and predictably been the absolute strongest clusters in all categories thus far, are the 4th and 5th most White respectively. At 84% and 75%, perhaps a little bit of diversity can go a long way. Cluster 6, however, is the most diverse cluster of all with all races (except for American Indian), pulling very close percentages. Expectedly, it performs average.

We have yet to analyze teacher data and how strongly they equate with school performance. This table shows percentage of teachers with less than 3 years of experience, percentage of teachers who have Masters degrees or further education, percentage of teachers who instruct core classes and are considered “Not High Quality”, percentage of teachers without an appropriate certification, and percentage of teacher turnover.

There are a few takeaways from this table of teacher data. The percentage of teachers that have less than three years of experience, percentage of teachers who are not highly qualified to teach core courses, and percentage of teachers who do not have an appropriate certification have very strong correlations to school performance. The percentage of teacher turnover is also apparent. However, the percentage of teachers holding Masters or advanced degrees does not seem to influence performance at all. The percentages are all over the map and are completely unpredictable given the current amount of available data.

One may infer that poor performing schools could be in unsafe locations and it is difficult to obtain more qualified teachers. As a result, they have to hire teachers who have little experience or improper certifications which results in poor performance – and the cycle continues.

The final collection of data that we will look at is the average number of students in grade 10 math and English classrooms.

There is no surprise here. There are about 5 to 6 more students per classroom on average in poor performing schools compared to the stronger schools. Here the data shows the importance of the student to teacher ratio. The fewer students per teacher, the more time each teacher can spend on a student, thus improving their educational experience.

While beyond the scope of this project, the authors were curious to see if there were correlations to geographical locations. The authors plotted on Google Maps the locations of 20 random schools from clusters 1 (poor performer), 2 (strong performer), and 6 (average performer).

Cluster 1 (POOR PERFORMER)

Cluster 2 (STRONG PERFORMER)

Cluster 6 (AVERAGE PERFORMER)

The results coincide with our general assumptions about inner-city school standards lagging behind. Cluster 1 is a poor performing cluster and it is highly condensed in New York City. Cluster 2 contains strong performing schools and are largely scattered across upstate New York. Cluster 3 is average and contains a mix of New York city and the surrounding suburbs.

We have observed a considerable amount of data in this paper regarding New York State public schools. There are clear and present patterns – real indicators among these attributes that can predict how well a school performs.

The above data, however, only demonstrates patterns. It does not indicate cause and effect. We cannot say that poor performing schools should stop providing free and reduced lunch because we have no idea if free and reduced lunch is a cause of poor performance. What we can say is that schools with higher rates of free and reduced lunch indicate poor performance. Likewise for all of the relevant data we discussed.

It is our hope that people find the content of this paper useful and that it may spur debate on the best methods to improve the public schools of the great state of New York.