Two weeks back, we decided to experiment teaching data science to grade 6th-9th kids! We think it is important to introduce students to thinking in a data-driven way early on in their lives; also kids are way more fun than higher-ed students, so it was an easier choice for us to make!

We sent out a form asking kids to apply to our cool Kids Data Camp — the first in the world?! We thought kids in 5th grade would have been too young and those in 10th would be more focused on school academics. We had 18 people apply to us, with most of them interested in science and math and a few in history/arts.

This weekend, we had 14 people turn up (there was no selection barring self-selection). This included one 10th grade student and one sophomore undergraduate who tagged with the group to learn!

Some kids came in early. We put on a Youtube video on Scratch for them. It was fun to discuss it with them and they related it to Lego right away. We asked kids to install scratch at home — and make a dancing Shah Rukh Khan (famous Bollywood movie actor) on it and also have him jump around from one building to another!

The Ice Breaker

Once all the kids had assembled, we had a quick ice breaker. Parth and Abhishek, interns from IIT Kanpur, divided senior and junior students in two sets in order to pick one from each to form a group. A simple way to maximize group success- read here to learn how they did it !

Then Samarth, our intern from Harvard, introduced the idea of data science to kids. He started with the famous John Snow cholera outbreak example. Kids were very quick- by a show of hands, everyone had seen a Google map. They understood that infected people were clustering around one pump and there were other vacant pumps. Couple of questions — Why some dots are large and small? Why did someone not go to a pump which was farther away. We answered. We told them there were three learnings for them:

a. Don’t waste water- it wasn’t as easily available 100 years back and still not to many;

b. Don’t run away from problems; try to solve them; else they will catch up with you (couple of them said that their way of solving the problem was to just run away from the city!),

c. You can solve problems with data — here is a medical problem you solved by plotting infected families on a map; you did not need any exposure to biology or medicine to come up with a preliminary inference.

Teaching the Data Science using the cholera outbreak problem

The data exercise

We moved on and started with the key data set and experiment. Our aim was to give kids an idea of the whole cycle of data science — data collection, data entry/cleaning, feature extraction, visualization and model building (if we could get there, we had presumed we wouldn’t due to a paucity of time) and also sensitize them to data security/permissions concerns.

We designed the following exercise: Every kid will get a set of 48 faces with names and their hobbies. The kids had to give a rating of 5 if they will make the person a friend, 1 if not and could choose other numbers in between. All the 7 groups completed the exercise with one mentor each. Out of these we pulled 16 samples out as a validation set :) The ‘train’ data sets were then exchanged among groups.

The experiment exercise

Introducing Features

We then asked the following: if we wanted to know what kind of people does Raghav (one of the kids) prefer to make friends, how could they infer this by going through these sheets? One of the kids suggested that we could look at what kind of games his friends played and then tell accordingly. We asked what else? We then introduced that it could be that some of the kids prefer making friends with boys and some with girls; we asked a boy whom does he prefer to make friends with more often — he said boys; couple more said neutral.

Then we discussed two more features: we had smiling and neutral faces — would some people make smiling people friends more often? And also, we had old style names and new names — would some people prefer to make folks with new names friends more often? Kids seemed to have understood that people could possibly, not necessarily, make choices on this basis. For the workshop we decided to go with three features: gender, hobby and name style.

We used excel as the platform for all experiments. We had a sheet with features already entered for the data set. The kids had to enter the ratings and check the features. The kids did find some features wrongly entered and also some ambiguities: is squash indoor or outdoor, is Shilpy a new name or an old name? :)

Question 1: Is this kid a friendly person?

The first task of the kids was to find if the person they were analyzing was a friendly person or not — will s/he more often make friends than not. To get this right, kids had to simply count how many people were marked each as 5, 4, 3, 2 and 1. Some of the kids used filters to do this and others counted manually. They finally made a graph. Here is the first graph we discussed with the whole group, where the red bar depicted percentages and the blue bar depicted the actual number in each bin.

* Original work by kids reproduced

We made two inferences:

K (anonymous) was a friendly person: s/he more often makes friends than not.

K is clear-headed and a fast decision-maker. S/he doesn’t have many may be/may be not cases. S/he either decides to make a person a friend or not.

Then we discussed couple of more graphs of other kids. We said statements positively J: V makes lesser friends, but that is because s/he likes to spend time studying. One group said, she is confused since she had many may be/may be not: we corrected: not confused, she takes time to decide who to make a friend or not, because she could possibly be thinking deeply about it.

Question 2: What kind of people does s/he prefers making a friend?

This was fun! Our next exercise was that they had to find among the people the person chose to befriend, were there, say, more males than females? And similarly for other features. (We had created a balanced data set with 50–50 of each feature type; this created a simplification that we did not have to see the non-friends group) Again kids used filters and counted for input variables of the two types and plotted graphs. We had already inserted a template for the kids to put in their counts in their excel sheets; then plotted the graphs themselves.

Here is a set of graphs we discussed.

Graphs made by kids

* Original work by kids reproduced

So, we learnt — the student for whom we’d made this graph definitely likes to make friends with people who plays outdoor games — this is a clear trend. Next we talked about gender — the person makes male friends slightly more often; but this trend was not completely clear, since the difference between males and females is too little. It needs further investigation. Same for the third feature.

The big take away was: we can find what kind of people each of us make friends with! Kids seem to understand and appreciate this. We told them that they could have done this differently, by interviewing the person and then trying to say who he will make friends — but we do it differently — ‘learning by example’, we see who they make friends, analyze it to figure out trends and then be able to predict!

Making and validating a super simple predictor

Ideally we wanted kids to make a predictor with a simple point based system, but we didn’t get there. We however went ahead and took the example of the kid just discussed, who had shown outdoor games as the key deciding factor, and considered that feature as the predictor — we took her ‘validation’ data from the envelope and saw how we well we did — it was only ok, honestly! But kids got the concept. They could predict unseen data based on a set of seen data.

We then got a data release form signed from them and explained to them that they have the right that their data isn’t publicly disclosed and we seek their permission — we will anonymize their data. One girl opted out. Rest of the data can be found here.

When one of the authors asked with a wink how many from the gathering would like to come over for a part 2 of the data camp the following week, ten of them raised their hands :) A good test for us. See the kids’ blog entries here and mentor experiences here! Harsh also suggested to them that they should start making data entries of their expenditure and pocket money! Some really interesting suggestions came from kids regarding what they would do with this knowledge.

Do note, that we were using lot of assumptions to simplify this — correlation vs. causality, balanced sets, no significance testing, small sample size, etc. Our aim was to lead them to a naiver naïve bayes. We think this is a fine approach like the famous Arundhati nyaya.

Learnings:

We need 5–6 hours to run this right and we would have done the model too and explained things a lot better.

We didn’t have a what-next? A strong take away, resource sheet and continuity.

Kids need to know the concept of percentages — we think 7th to 9th might be a better target.

Currently, we have 1 mentor for every 2 kids. We need this to be more scalable. Should be possible.

Would want to emphasize explaining data science vs. other ways of doing things through some examples. We give them a problem, they try it and then we give the data way of doing it.

More visualizations to share.

And of course, this is just the TIP OF THE ICE BERG!

Read about the entire blog, mentor stories, children feedback(cute) at DataScienceKids.

We’ve given out the entire exercise set, curriculum and experimental setup on the blog so that you can host your own data camp for kids too!

Thanks Harsh, Bhanu, Nishant, Gursimran (for the photos also!), Parth, Abhishek, Vishal, Samarth — good show. Thanks Una-May for the encouragement and helpful ideas!

From Left : Bhanu, Varun, Tushar, Shashank, Nishant, Gursimran, Harsh and Vishal (bottom)

Varun & Shashank

Read more at www.datasciencekids.org