How to generate meaningful fake data for learning, experimentation and teaching using {fakir}

The Problem There’s one thing about R that a lot of people have as their Top-of-Mind. That’s the black-and-white plot of iris dataset which is definitely a huge boring view of R. That’s boring because of aesthetics but also because it’s such a cliched example used over and over again. The other problem is finding the right set of dataset for the right set of problem you want to teach/learn/experiment. Let’s say you want to teach Time Series and that’s a case where your Spam / Ham Classification Dataset isn’t going to be of any use.

Solution No more worries. That’s where fakir has arrived to help us. fakir is an R-package by Colin Fay (of Think-R) who’s been so good with his contributions to the R community.

Video Tutorial https://www.youtube.com/watch?v=EhhljL5zaWs

About fakir As in the documentation, The goal of fakir is to provide fake datasets that can be used to teach R.

Installation and Loading fakir can be installed from Github ( fakir isn’t available on CRAN yet) # install.packages("devtools") devtools::install_github("ThinkR-open/fakir") library(fakir)

Use-case: Clickstream / Web Data Clickstream / Web Data is one thing a lot of organizations use in analytics these days but it’s hard to get your hand on some clickstream data since no company would prefer sharing theirs. There’s a sample Data on Google Analytics Test Account but that may not serve you any purpose in learning Data science in R or R’s ecosystem. This is a typical case where fakir can help you library(tidyverse) fakir::fake_visits() %>% head() ## # A tibble: 6 x 8 ## timestamp year month day home about blog contact ## <date> <dbl> <dbl> <int> <int> <int> <int> <int> ## 1 2017-01-01 2017 1 1 NA 64 446 145 ## 2 2017-01-02 2017 1 2 159 102 487 250 ## 3 2017-01-03 2017 1 3 NA 59 479 433 ## 4 2017-01-04 2017 1 4 123 202 601 109 ## 5 2017-01-05 2017 1 5 362 162 311 378 ## 6 2017-01-06 2017 1 6 NA 244 450 350 That’s how simple is to get a sample Clickstream (tidy) data with fakir . Another good thing to mention is, If you look at the fake_visits() documentation, You’ll find it that there’s an argument that takes seed value which means, you are in control of randomizing the data and reproducing them. fake_visits(from = "2017-01-01", to = "2017-12-31", local = c("en_US", "fr_FR"), seed = 2811) %>% head() ## # A tibble: 6 x 8 ## timestamp year month day home about blog contact ## <date> <dbl> <dbl> <int> <int> <int> <int> <int> ## 1 2017-01-01 2017 1 1 NA 64 446 145 ## 2 2017-01-02 2017 1 2 159 102 487 250 ## 3 2017-01-03 2017 1 3 NA 59 479 433 ## 4 2017-01-04 2017 1 4 123 202 601 109 ## 5 2017-01-05 2017 1 5 362 162 311 378 ## 6 2017-01-06 2017 1 6 NA 244 450 350

Use-case: French Data Also, in the above usage of fake_visits() function you might have noticed another attribute local which can help you select French data instead of English. In my personal opinion, This is crucial if you are on a mission of improving Data Literacy or Democratising Data Science. fake_ticket_client(vol = 10, local = "fr_FR") %>% head() ## # A tibble: 6 x 25 ## ref num_client prenom nom job age region id_dpt departement ## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 DOSS~ 79 Phili~ Bért~ Prof~ 18 Poito~ 86 Vienne ## 2 DOSS~ 69 Étien~ Dupo~ Char~ 42 Breta~ 22 Côtes-d'Ar~ ## 3 DOSS~ 120 Roland Pasc~ Admi~ 34 Île-d~ 77 Seine-et-M~ ## 4 DOSS~ 31 Noël Bena~ Cons~ 43 Poito~ 79 Deux-Sèvres ## 5 DOSS~ 59 Jean Pelt~ Ingé~ 46 Picar~ 80 Somme ## 6 DOSS~ 118 Adèle Pare~ <NA> 19 <NA> 41 Loir-et-Ch~ ## # ... with 16 more variables: gestionnaire_cb <chr>, nom_complet <chr>, ## # entry_date <dttm>, points_fidelite <dbl>, priorite_encodee <dbl>, ## # priorite <fct>, timestamp <date>, annee <dbl>, mois <dbl>, jour <int>, ## # pris_en_charge <chr>, pris_en_charge_code <int>, type <chr>, ## # type_encoded <int>, etat <fct>, source_appel <fct> In the above example, We’ve used another function fake_ticket_client() of fakir that helps us in giving a typical ticket dataset (like the one you get from ServiceNow or Zendesk)

Use-case: Scatter Plot So, the rant that I made at the start of this post about iris (Don’t mistake me: I’ve got huge respect for the scientists who created this dataset, it’s just that the wrong / over-usage of it which I don’t appreciate), Now we can overcome with fakir ’s datasets. fake_visits() %>% ggplot() + geom_point(aes(blog,about, color = as.factor(month))) ## Warning: Removed 51 rows containing missing values (geom_point). (Perhaps, Not a good scatter plot to show Correlation but hey, you can teach scatter plot without plotting Petal Length and Sepal Length)

Summary If you are in the business of teaching or likes experimenting and don’t want to use cliched datasets, fakir is a very nice package to get to know. As the author of fakir ’s package mentions in the description, charlatan is another such R-package that helps in generating meaningful fake data.

Please enable JavaScript to view the comments powered by Disqus.

Disqus