View related sites
One of the more challenging parts of using real data to teach statistics is the difficulty in finding data sets that are accessible for your students, and then wrangling the data into a form that is useable within the available software. One of the many things that the free stats software “R” has going for it is a large bank of pre-installed demo datasets that can be easily accessed and manipulated.
Once R is installed, to find out about the available data sets simply type the command data() and a window will appear with a complete list and a short description of teach. Typing the name of one of the data sets and hitting enter will display it in all its glory.
Once you have a sense of the datasets themselves, they can be very useful starting points for classroom activities and investigation. In this blog I will describe an easy way to get started with the “faithful” data set, which contains the duration of, and time between, eruptions of the Old Faithful geyser in Yellowstone National Park. Setting a homework to learn about Old Faithful in advance of the lesson may help students to familiarise themselves with the context before being asked to think about the data in detail.
So, now you have some data – what can be done? Get students to suggest a starting point; perhaps a graph of some kind would be useful? Try typing plot(faithful)
R automatically decides the most appropriate graph for the data. In this case, as the data is bivariate, it defaults to a scatter graph. Ask students: what does the graph appear to show? Are there any obvious features? The eagle-eyed may spot that although there is some sense of a positive correlation, actually, there appear to be two sub-populations each with little or no correlation.
Challenge your students to suggest a possible explanation for this. Why might some be shorter and some be longer? They may notice that the longer duration eruptions appear to coincide with a longer waiting time between eruptions.
Ask students how they might choose to investigate further and try creating graphs of their suggestions, perhaps plot just the waiting times – plot(faithful$waiting, type = “h”) or just the eruption lengths – plot(faithful$eruptions, type = “h”). Adding the $ sign allows you to specify a named column from the data set and ignore the rest, adding type = “h” tells R to draw a line chart instead of plotting individual points.
Either of these groups shows a new pattern: the short and long durations appear to oscillate, so students can hypothesise that if there is a long eruption, it is likely the next one will happen sooner and be of shorter duration, or vice versa.
It can be difficult to see this pattern with all 272 data points, but using the head() command will allow you to select the first n rows as follows: plot(head(faithful$eruptions, n=20), type = “h”)
This is almost the same as before but head(faithful$eruptions, n=20) has simply replaced faithful$eruptions to give control over the amount of data plotted. Try changing the value of n to use more or less data.
The more students are able to examine data from different viewpoints, the more they will be able to spot features of data, suggest explanations and create new avenues of investigation. The pre-installed data sets alongside quick and easy (when you know how) graphing capability in R can help you to lead discussion based lessons in which you can respond to student suggestions on the fly, rather than having to predict all their possible ideas in advance and prepare resources for them on more unwieldy software such as Excel.