Visualization is one of the most important skills that every Data Analyst should possess. In this project, I perform the most simple form of Exploratory data analyses of Forest Fire data set.
This data-set about forest fires occurring in Portugal and based on this scientific research paper.
Main Goal:To find our patterns in occurrence of forest fire
#importing data
library(tidyverse)
forest_fire = read_csv("forestfires.csv")
glimpse(forest_fire)
## Rows: 517
## Columns: 13
## $ X <dbl> 7, 7, 7, 8, 8, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 5, 8, 6, 6, 6...
## $ Y <dbl> 5, 4, 4, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4...
## $ month <chr> "mar", "oct", "oct", "mar", "mar", "aug", "aug", "aug", "sep"...
## $ day <chr> "fri", "tue", "sat", "fri", "sun", "sun", "mon", "mon", "tue"...
## $ FFMC <dbl> 86.2, 90.6, 90.6, 91.7, 89.3, 92.3, 92.3, 91.5, 91.0, 92.5, 9...
## $ DMC <dbl> 26.2, 35.4, 43.7, 33.3, 51.3, 85.3, 88.9, 145.4, 129.5, 88.0,...
## $ DC <dbl> 94.3, 669.1, 686.9, 77.5, 102.2, 488.0, 495.6, 608.2, 692.6, ...
## $ ISI <dbl> 5.1, 6.7, 6.7, 9.0, 9.6, 14.7, 8.5, 10.7, 7.0, 7.1, 7.1, 22.6...
## $ temp <dbl> 8.2, 18.0, 14.6, 8.3, 11.4, 22.2, 24.1, 8.0, 13.1, 22.8, 17.8...
## $ RH <dbl> 51, 33, 33, 97, 99, 29, 27, 86, 63, 40, 51, 38, 72, 42, 21, 4...
## $ wind <dbl> 6.7, 0.9, 1.3, 4.0, 1.8, 5.4, 3.1, 2.2, 5.4, 4.0, 7.2, 4.0, 6...
## $ rain <dbl> 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...
## $ area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
So there are 517 and 13 in our data set.Let’s see some of the rows.
head(forest_fire, n=11)
## # A tibble: 11 x 13
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0 0
## 2 7 4 oct tue 90.6 35.4 669. 6.7 18 33 0.9 0 0
## 3 7 4 oct sat 90.6 43.7 687. 6.7 14.6 33 1.3 0 0
## 4 8 6 mar fri 91.7 33.3 77.5 9 8.3 97 4 0.2 0
## 5 8 6 mar sun 89.3 51.3 102. 9.6 11.4 99 1.8 0 0
## 6 8 6 aug sun 92.3 85.3 488 14.7 22.2 29 5.4 0 0
## 7 8 6 aug mon 92.3 88.9 496. 8.5 24.1 27 3.1 0 0
## 8 8 6 aug mon 91.5 145. 608. 10.7 8 86 2.2 0 0
## 9 8 6 sep tue 91 130. 693. 7 13.1 63 5.4 0 0
## 10 7 5 sep sat 92.5 88 699. 7.1 22.8 40 4 0 0
## 11 7 5 sep sat 92.5 88 699. 7.1 17.8 51 7.2 0 0
Here are descriptions of the variables in the data set and the range of values for each taken from the paper:
The X and Y variables are coordinates of fire locations.
The acronym FWI stands for “fire weather index”, a method used by scientists to quantify risk factors for forest fires.
#generic function to create bar graphs
create_bar <- function(dframe,x, y){
ggplot(data = dframe,
aes_string(x, y)
)+
geom_bar(stat="identity")+
labs(y="Total Forest Fires")+
theme(panel.background = element_rect(fill = "white"))
}
forest_fire_by_month <- forest_fire %>%
group_by(month) %>%
summarise(
count = n(),
.groups="drop"
)%>%
mutate(
month = factor(month,
levels=c("jan", "feb", "mar", "apr", "may",
"jun", "jul", "aug", "sep", "oct",
"nov", "dec")
)
)
forest_fire_by_day <- forest_fire %>%
group_by(day) %>%
summarise(
count = n(),
.groups="drop"
)%>%
mutate(
day = factor(day,
levels=c("sun", "mon", "tue", "wed", "thu",
"fri", "sat")
)
)
#plotting the bar graphs
create_bar(forest_fire_by_day,"day", "count")
create_bar(forest_fire_by_month, "month", "count")
create_box <- function( x, y){
ggplot(data=forest_fire,
aes_string(x=x,
y=y
)
)+
geom_boxplot()+
theme(panel.background = element_rect("white"))
}
#Creating various box plots by vectorization
x_var_month <- names(forest_fire)[3] ## month
x_var_day <- names(forest_fire)[4] ## day
y_var <- names(forest_fire)[5:12]
month_box <- map2(x_var_month, y_var, create_box)
day_box <- map2(x_var_day, y_var, create_box)
month_box
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
day_box
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
#generic function to create scatter
create_scatter <- function(x, y){
ggplot(data=forest_fire,
aes_string(
x=x,
y=y
))+
geom_point()+
theme(panel.background = element_rect(fill = "white"))
}
x_var_scatter = names(forest_fire)[5:12]
y_var_scatter = names(forest_fire)[13] ##area
scatters <- map2(x_var_scatter, y_var_scatter, create_scatter)
scatters
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
We can notice that the scatter plot is clustered in bottom because there are very few high values of area and more values are spread in lower range.
area columnsummary(forest_fire$area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.52 12.85 6.57 1090.84
Looking at the 3rd quartile we can conclude that more than 75% of the values of area is less than 6.57.
For more clarity let’s plot a histogram of area column
We conclude that how important it is to visualize data.It will help us in making various statistical models for performing predictions.