Context During the process of creating the exploratory analysis on the NASA Landslide dataset, I was thinking about the process of diving into a dataset, and how should be addressed to obtain real information and to let the dataset ready to work deeply.

1. From general to particular : how is our dataset? how many lines it has? how many columns it has? 2. Completeness of the dataset: Is it full of NA or not? Which are the columns where we should execute the cleanup? 3. Distribution of discrete and continuous variables to understand its composition. 4. Correlation between variables : Is there any correlation between the variables of our dataset?

Obviously, there are different ways to obtain a photo of our data and execute a cleaning, but the general idea is discover how are data is composed and let the data ready to work with them. An EDA per se, shouldn’t be the end of our work, but just the beginning and foundation to answer our questions about the data.

With this approach in mind, let’s analyze the dataset of landslides.

Data Landslides are one of the most pervasive hazards in the world, causing injuries and fatalities almost in any country. There are several triggers, but one of the main reason are intense and prolonged rainfall over saturated soil on vulnerable slopes. The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts, or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center. This is a unique data set with the ID tag “GLC” in the landslide editor. Initial numbers We are in the presence of a small dataset, around 11,000 records , with a 17% of missing observations .

url1 = "https://raw.githubusercontent.com/frm1789/landslide-ea/master/global_landslide_catalog_export.csv" df <- read_csv(url1) DataExplorer::introduce(df)

Name Value Rows 11,033 Columns 31 Discrete columns 23 Continuous columns 8 All missing columns 0 Missing observations 58,287 Complete Rows 0 Total observations 342,023 Memory allocation 11.3 Mb

Distribution of data

Because we knew that the data came from two different origins, before to move forward with the EDA or as a first step we can check how to information was distributed along the years. We can do that thought an histogram by year. All the code for this image is here.

df$year <- substring(df$event_date, 9, 10) #year count_byyear <- dplyr::count(df, year = df$year) a <- ggplot(df, aes(df$year)) + geom_histogram(stat="count", binwidth = 1, fill = "#453781FF") + #scale_fill_manual(guide = FALSE) + labs( title ="Landslide 2007 - 2016", subtitle = "Distribution of events by year", caption = "source: GLC by NASA Goddard Space Flight Center

by thinkingondata.com" ) theme_minimal()

After to see the distribution, we can conclude is a good idea to work with the information between 2007 to 2017, otherwise our results could be biased about what really happened.

Cleaning by years

We executed a cleaning and because we counted the rows after and before, we lost 45 rows, less than 0.4%, a very infime quantity in the wake to obtain a complete and reliable information from 2007 to 2017.

nrow(df) #11033 c_year = c("88","93","95","96","97","98","03","04","05","06") df_new <- df %>% filter(!((df$year %in% c_year))) nrow(df_new) #10988

Missing observations We can compare using DataExplorer library how many missing observations has the dataset, evaluate if we need to execute a cleaning, and make a comparison about before & after numbers of our cleaning. Preliminary analysis Dataexplorer gives us the plot_missing function which returns and plots frequency of missing values for each feature. Let’s gonna checking the status of our data. b = DataExplorer::plot_missing(df_old, title = NULL, ggtheme = theme_minimal(), theme_config = list(legend.position = c("bottom"))) Cleaning of the data After checking into the previous image, we can clean some columns: Remove : the mandatory columns to remove because they are almost empty in most all the cases.

: the mandatory columns to remove because they are almost empty in most all the cases. Bad : More than 50% of NA, definitely a good idea but analyzing more in detail, because one of the values is “injury_count” and perhaps we don’t have an injured person in each event.

: More than 50% of NA, definitely a good idea but analyzing more in detail, because one of the values is “injury_count” and perhaps we don’t have an injured person in each event. Ok : In most of the cases the desicion to delete or no, depends of their ability to offer information that we can consider relevant to our analysis. c_del = c("event_id", "event_import_id","gazeteer_distance","longitude","latitude","photo_link","notes","source_name","source_link","event_time","event_title", "event_description","location_description","event_import_source","event_date", "location_accuracy","country_code","admin_division_name","admin_division_population","gazeteer_closest_point", "gazeteer_distance", "submitted_date", "created_date" , "last_edited_date", "storm_name") df_new = df_new[ , -which(names(df_new) %in% c_del)] Before and after c = DataExplorer::plot_missing(df_new, title = NULL, ggtheme = theme_minimal(), theme_config = list(legend.position = c("bottom"))) gridExtra::grid.arrange(b,c, ncol = 2 ) Distribution discrete variables We can check the distribution for: landslide_setting

landslide_category

landslide_size

landslide_trigger First, we create each histogram and later with the help of the library gridExtra, we reorganize all the histograms in just one image. The complete code is here gridExtra::grid.arrange(a,b,c,d)