Exploratory Data Analysis of Forest Fire Data

Introduction

Forest fires can create ecological problems and endanger human lives and property. Understanding when they occur and what causes them is important for managing them.

Visualization is one of the most important skills that every Data Analyst should possess. In this project, I perform the most simple form of Exploratory data analyses of Forest Fire data set.

This data-set about forest fires occurring in Portugal and based on this scientific research paper.

Main Goal:To find our patterns in occurrence of forest fire

Initial Setup

#importing data
library(tidyverse)
forest_fire = read_csv("forestfires.csv")

Getting premliminary isights

glimpse(forest_fire)

## Rows: 517
## Columns: 13
## $ X     <dbl> 7, 7, 7, 8, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 5, 8, 6, 6, 6...
## $ Y     <dbl> 5, 4, 4, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4...
## $ month <chr> "mar", "oct", "oct", "mar", "mar", "aug", "aug", "aug", "sep"...
## $ day   <chr> "fri", "tue", "sat", "fri", "sun", "sun", "mon", "mon", "tue"...
## $ FFMC  <dbl> 86.2, 90.6, 90.6, 91.7, 89.3, 92.3, 92.3, 91.5, 91.0, 92.5, 9...
## $ DMC   <dbl> 26.2, 35.4, 43.7, 33.3, 51.3, 85.3, 88.9, 145.4, 129.5, 88.0,...
## $ DC    <dbl> 94.3, 669.1, 686.9, 77.5, 102.2, 488.0, 495.6, 608.2, 692.6, ...
## $ ISI   <dbl> 5.1, 6.7, 6.7, 9.0, 9.6, 14.7, 8.5, 10.7, 7.0, 7.1, 7.1, 22.6...
## $ temp  <dbl> 8.2, 18.0, 14.6, 8.3, 11.4, 22.2, 24.1, 8.0, 13.1, 22.8, 17.8...
## $ RH    <dbl> 51, 33, 33, 97, 99, 29, 27, 86, 63, 40, 51, 38, 72, 42, 21, 4...
## $ wind  <dbl> 6.7, 0.9, 1.3, 4.0, 1.8, 5.4, 3.1, 2.2, 5.4, 4.0, 7.2, 4.0, 6...
## $ rain  <dbl> 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...
## $ area  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

So there are 517 and 13 in our data set.Let’s see some of the rows.

head(forest_fire, n=11)

## # A tibble: 11 x 13
##        X     Y month day    FFMC   DMC    DC   ISI  temp    RH  wind  rain  area
##    <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     7     5 mar   fri    86.2  26.2  94.3   5.1   8.2    51   6.7   0       0
##  2     7     4 oct   tue    90.6  35.4 669.    6.7  18      33   0.9   0       0
##  3     7     4 oct   sat    90.6  43.7 687.    6.7  14.6    33   1.3   0       0
##  4     8     6 mar   fri    91.7  33.3  77.5   9     8.3    97   4     0.2     0
##  5     8     6 mar   sun    89.3  51.3 102.    9.6  11.4    99   1.8   0       0
##  6     8     6 aug   sun    92.3  85.3 488    14.7  22.2    29   5.4   0       0
##  7     8     6 aug   mon    92.3  88.9 496.    8.5  24.1    27   3.1   0       0
##  8     8     6 aug   mon    91.5 145.  608.   10.7   8      86   2.2   0       0
##  9     8     6 sep   tue    91   130.  693.    7    13.1    63   5.4   0       0
## 10     7     5 sep   sat    92.5  88   699.    7.1  22.8    40   4     0       0
## 11     7     5 sep   sat    92.5  88   699.    7.1  17.8    51   7.2   0       0

Description of Various Columns

Here are descriptions of the variables in the data set and the range of values for each taken from the paper:

X: X-axis spatial coordinate within the Montesinho park map: 1 to 9
Y: Y-axis spatial coordinate within the Montesinho park map: 2 to 9
month: Month of the year: ‘jan’ to ‘dec’
day: Day of the week: ‘mon’ to ‘sun’
FFMC: Fine Fuel Moisture Code index from the FWI system: 18.7 to 96.20
DMC: Duff Moisture Code index from the FWI system: 1.1 to 291.3
DC: Drought Code index from the FWI system: 7.9 to 860.6
ISI: Initial Spread Index from the FWI system: 0.0 to 56.10
temp: Temperature in Celsius degrees: 2.2 to 33.30
RH: Relative humidity in percentage: 15.0 to 100
wind: Wind speed in km/h: 0.40 to 9.40 rain: Outside rain in mm/m2 : 0.0 to 6.4
area: The burned area of the forest (in ha): 0.00 to 1090.84

The X and Y variables are coordinates of fire locations.

The acronym FWI stands for “fire weather index”, a method used by scientists to quantify risk factors for forest fires.

Visualizing relation b/w forest fire and Weekdays & Months

#generic function to create bar graphs
create_bar <- function(dframe,x, y){
  ggplot(data = dframe,
         aes_string(x, y)
  )+
    geom_bar(stat="identity")+
    labs(y="Total Forest Fires")+
    theme(panel.background = element_rect(fill = "white"))
}

forest_fire_by_month <- forest_fire %>%
  group_by(month) %>%
  summarise(
    count = n(),
    .groups="drop"
  )%>%
  mutate(
    month = factor(month,
                   levels=c("jan", "feb", "mar", "apr", "may",
                            "jun", "jul", "aug", "sep", "oct",
                            "nov", "dec")
                   )
  )

forest_fire_by_day <- forest_fire %>%
  group_by(day) %>%
  summarise(
    count = n(),
    .groups="drop"
  )%>%
  mutate(
    day = factor(day, 
                 levels=c("sun", "mon", "tue", "wed", "thu",
                          "fri", "sat")
    )
  )

#plotting the bar graphs
create_bar(forest_fire_by_day,"day", "count")

create_bar(forest_fire_by_month, "month", "count")

Examining spread of various factors using Box plots

create_box <- function( x, y){
  ggplot(data=forest_fire,
         aes_string(x=x,
                    y=y
                    )
         )+
    geom_boxplot()+
    theme(panel.background = element_rect("white"))
}

#Creating various box plots by vectorization

x_var_month <- names(forest_fire)[3] ## month
x_var_day <- names(forest_fire)[4] ## day
y_var <- names(forest_fire)[5:12]

month_box <- map2(x_var_month, y_var, create_box)
day_box <- map2(x_var_day, y_var, create_box)

month_box

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

day_box

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Examining the relation b/w Area and various factors

#generic function to create scatter
create_scatter <- function(x, y){
  ggplot(data=forest_fire,
         aes_string(
           x=x,
           y=y 
         ))+
    geom_point()+
    theme(panel.background = element_rect(fill = "white"))
}

x_var_scatter = names(forest_fire)[5:12]
y_var_scatter = names(forest_fire)[13] ##area

scatters <- map2(x_var_scatter, y_var_scatter, create_scatter)
scatters

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

We can notice that the scatter plot is clustered in bottom because there are very few high values of area and more values are spread in lower range.

Let’s try to see the range of area column

summary(forest_fire$area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.52   12.85    6.57 1090.84

Looking at the 3rd quartile we can conclude that more than 75% of the values of area is less than 6.57.

For more clarity let’s plot a histogram of area column

Conclusion

We conclude that how important it is to visualize data.It will help us in making various statistical models for performing predictions.