Introduction

Visualization is one of the most important skills that every Data Analyst should possess. In this project, I perform the most simple form of Exploratory data analyses of Forest Fire data set.

This data-set about forest fires occurring in Portugal and based on this scientific research paper.

Main Goal:To find our patterns in occurrence of forest fire

Initial Setup

#importing data
library(tidyverse)
forest_fire = read_csv("forestfires.csv")

Getting premliminary isights

glimpse(forest_fire)
## Rows: 517
## Columns: 13
## $ X     <dbl> 7, 7, 7, 8, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 5, 8, 6, 6, 6...
## $ Y     <dbl> 5, 4, 4, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4...
## $ month <chr> "mar", "oct", "oct", "mar", "mar", "aug", "aug", "aug", "sep"...
## $ day   <chr> "fri", "tue", "sat", "fri", "sun", "sun", "mon", "mon", "tue"...
## $ FFMC  <dbl> 86.2, 90.6, 90.6, 91.7, 89.3, 92.3, 92.3, 91.5, 91.0, 92.5, 9...
## $ DMC   <dbl> 26.2, 35.4, 43.7, 33.3, 51.3, 85.3, 88.9, 145.4, 129.5, 88.0,...
## $ DC    <dbl> 94.3, 669.1, 686.9, 77.5, 102.2, 488.0, 495.6, 608.2, 692.6, ...
## $ ISI   <dbl> 5.1, 6.7, 6.7, 9.0, 9.6, 14.7, 8.5, 10.7, 7.0, 7.1, 7.1, 22.6...
## $ temp  <dbl> 8.2, 18.0, 14.6, 8.3, 11.4, 22.2, 24.1, 8.0, 13.1, 22.8, 17.8...
## $ RH    <dbl> 51, 33, 33, 97, 99, 29, 27, 86, 63, 40, 51, 38, 72, 42, 21, 4...
## $ wind  <dbl> 6.7, 0.9, 1.3, 4.0, 1.8, 5.4, 3.1, 2.2, 5.4, 4.0, 7.2, 4.0, 6...
## $ rain  <dbl> 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...
## $ area  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

So there are 517 and 13 in our data set.Let’s see some of the rows.

head(forest_fire, n=11)
## # A tibble: 11 x 13
##        X     Y month day    FFMC   DMC    DC   ISI  temp    RH  wind  rain  area
##    <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     7     5 mar   fri    86.2  26.2  94.3   5.1   8.2    51   6.7   0       0
##  2     7     4 oct   tue    90.6  35.4 669.    6.7  18      33   0.9   0       0
##  3     7     4 oct   sat    90.6  43.7 687.    6.7  14.6    33   1.3   0       0
##  4     8     6 mar   fri    91.7  33.3  77.5   9     8.3    97   4     0.2     0
##  5     8     6 mar   sun    89.3  51.3 102.    9.6  11.4    99   1.8   0       0
##  6     8     6 aug   sun    92.3  85.3 488    14.7  22.2    29   5.4   0       0
##  7     8     6 aug   mon    92.3  88.9 496.    8.5  24.1    27   3.1   0       0
##  8     8     6 aug   mon    91.5 145.  608.   10.7   8      86   2.2   0       0
##  9     8     6 sep   tue    91   130.  693.    7    13.1    63   5.4   0       0
## 10     7     5 sep   sat    92.5  88   699.    7.1  22.8    40   4     0       0
## 11     7     5 sep   sat    92.5  88   699.    7.1  17.8    51   7.2   0       0

Description of Various Columns

Here are descriptions of the variables in the data set and the range of values for each taken from the paper:

  • X: X-axis spatial coordinate within the Montesinho park map: 1 to 9
  • Y: Y-axis spatial coordinate within the Montesinho park map: 2 to 9
  • month: Month of the year: ‘jan’ to ‘dec’
  • day: Day of the week: ‘mon’ to ‘sun’
  • FFMC: Fine Fuel Moisture Code index from the FWI system: 18.7 to 96.20
  • DMC: Duff Moisture Code index from the FWI system: 1.1 to 291.3
  • DC: Drought Code index from the FWI system: 7.9 to 860.6
  • ISI: Initial Spread Index from the FWI system: 0.0 to 56.10
  • temp: Temperature in Celsius degrees: 2.2 to 33.30
  • RH: Relative humidity in percentage: 15.0 to 100
  • wind: Wind speed in km/h: 0.40 to 9.40 rain: Outside rain in mm/m2 : 0.0 to 6.4
  • area: The burned area of the forest (in ha): 0.00 to 1090.84

The X and Y variables are coordinates of fire locations.

The acronym FWI stands for “fire weather index”, a method used by scientists to quantify risk factors for forest fires.

Visualizing relation b/w forest fire and Weekdays & Months

#generic function to create bar graphs
create_bar <- function(dframe,x, y){
  ggplot(data = dframe,
         aes_string(x, y)
  )+
    geom_bar(stat="identity")+
    labs(y="Total Forest Fires")+
    theme(panel.background = element_rect(fill = "white"))
}
forest_fire_by_month <- forest_fire %>%
  group_by(month) %>%
  summarise(
    count = n(),
    .groups="drop"
  )%>%
  mutate(
    month = factor(month,
                   levels=c("jan", "feb", "mar", "apr", "may",
                            "jun", "jul", "aug", "sep", "oct",
                            "nov", "dec")
                   )
  )
forest_fire_by_day <- forest_fire %>%
  group_by(day) %>%
  summarise(
    count = n(),
    .groups="drop"
  )%>%
  mutate(
    day = factor(day, 
                 levels=c("sun", "mon", "tue", "wed", "thu",
                          "fri", "sat")
    )
  )
#plotting the bar graphs
create_bar(forest_fire_by_day,"day", "count")

create_bar(forest_fire_by_month, "month", "count")

Examining spread of various factors using Box plots

create_box <- function( x, y){
  ggplot(data=forest_fire,
         aes_string(x=x,
                    y=y
                    )
         )+
    geom_boxplot()+
    theme(panel.background = element_rect("white"))
}
#Creating various box plots by vectorization

x_var_month <- names(forest_fire)[3] ## month
x_var_day <- names(forest_fire)[4] ## day
y_var <- names(forest_fire)[5:12]

month_box <- map2(x_var_month, y_var, create_box)
day_box <- map2(x_var_day, y_var, create_box)
month_box
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

day_box
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Examining the relation b/w Area and various factors

#generic function to create scatter
create_scatter <- function(x, y){
  ggplot(data=forest_fire,
         aes_string(
           x=x,
           y=y 
         ))+
    geom_point()+
    theme(panel.background = element_rect(fill = "white"))
}
x_var_scatter = names(forest_fire)[5:12]
y_var_scatter = names(forest_fire)[13] ##area

scatters <- map2(x_var_scatter, y_var_scatter, create_scatter)
scatters
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

We can notice that the scatter plot is clustered in bottom because there are very few high values of area and more values are spread in lower range.

  • Let’s try to see the range of area column
summary(forest_fire$area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.52   12.85    6.57 1090.84

Looking at the 3rd quartile we can conclude that more than 75% of the values of area is less than 6.57.

For more clarity let’s plot a histogram of area column

Conclusion

We conclude that how important it is to visualize data.It will help us in making various statistical models for performing predictions.