vignettes/foehnix.Rmd
foehnix.Rmd
This page shows a real world demo of the automated foehn classification method foehnix based on meteorological observations from two sites in Tyrol, Austria. One station is located in the Brenner valley south of Innsbruck on 1080 meters above mean sea level, the second station is located close to the main alpine ridge on 2107 meters above mean sea level.
## dd ff rh t
## 2006-01-01 01:00:00 171 0.6 90 -0.4
## 2006-01-01 02:00:00 268 0.3 100 -1.8
## 2006-01-01 03:00:00 115 5.2 79 0.9
## 2006-01-01 04:00:00 152 2.1 88 -0.6
## 2006-01-01 05:00:00 319 0.7 100 -2.6
## 2006-01-01 06:00:00 36 0.1 99 -1.7
## dd ff rh t
## 2006-01-01 01:00:00 180 10.8 100 -7.8
## 2006-01-01 02:00:00 186 12.5 100 -8.0
## 2006-01-01 03:00:00 181 11.3 100 -7.4
## 2006-01-01 04:00:00 178 13.3 100 -7.5
## 2006-01-01 05:00:00 176 13.1 100 -7.1
## 2006-01-01 06:00:00 184 10.0 100 -6.9
This example shows how to perform foehn classification for Ellbögen using the information of the Sattelberg station as additional source of information. The Sattelberg station is located south of Ellbögen close to the alpine crest and thus, during foehn events, measuring the upstream air mass descending down on the north side of the Alps.
To perform the classification we first of all have to prepare our
data sets. foehnix(...)
expects the input to be of class
zoo
(zoo time series object; based on the zoo
package). The data sets included in this package
(ellboegen
, sattelberg
, …; see demodata
) are already
zoo objects. Thus, we only have to load the meteorological observations
for Ellbögen (valley station) and Sattelberg (crest station):
# Load demo data sets (zoo time series objects)
data("ellboegen", package = "foehnix")
data("sattelberg", package = "foehnix")
As we want to use information from both stations: rename the
variables of the sattelberg
object and combine the two data
sets into one zoo
object called data
.
# Modify sattelberg variable names (crest_ identifies Sattelberg
# observations, our crest station) and combine both data sets.
names(sattelberg) <- paste0("crest_", names(sattelberg))
data <- merge(ellboegen, sattelberg)
# NOTE: for demonstration purpose we split the data set into
# two pieces: a training and a test data set.
# The training data set is used to estimate the mixture model, the
# test data set will be used for "predictions" (see "Predictions"
# section of this file).
train <- head(data, nrow(data) - 10)
test <- tail(data, 10)
rm(data) # Delete 'data'
In addition we are calculating the potential temperature difference between the two sites using a dry adiabatic lapse rate of one 10 degrees Celsius per 1000 meters. The difference in altitude between the two stations is 1027 meters which yields:
# Dry adiabatic temperature difference between
# Sattelberg (data$crest_t) and Ellboegen (data$t) corrected by
# 1027/10 degrees.
train$diff_t <- train$crest_t + 10.27 - train$t
South foehn flow at the valley station in Ellbögen has a wind direction of
about 133 degrees. We are using a (relatively weak) wind direction
filter with a wind sector of +/-90 degrees around 133 degrees. For the
classification only observations/times are used where the corresponding
wind direction (dd
) lies within the wind sector
>=43
and <= 223
(180 degrees sector). In
addition we can make use of the information provided by the crest
station (Sattelberg) in several
ways:
diff_t
).dd
) inside the wind sector of
>=43
and <=223
degrees for
classification (dd = c(43, 223)
).>= 90
and <= 270
,
sat_dd = c(90, 270)
).A set of S3 methods for object of class foehnix
are
available, This includes methods to access information criteria, methods
to check the estimated model (model settings), methods to return the
estimated coefficients and probabilities, and some default plots for
model assessment.
Let’s check our demo model mod
estimated above:
# Information criteria
# logLik: final log-likelihood sum of the model
# edf: effective degrees of freedom
# AIC: Akaike information criterion
# BIC: Bayesian information criterion
c(logLik(mod), edf(mod), AIC(mod), BIC(mod))
## loglik edf AIC BIC
## -55763.83 6.00 111539.67 111588.29
The print
and summary
methods show some
information about the model such as number of observations used for the
classification, the filter and its effect, and the corresponding
information criteria.
print(mod)
##
## Call: foehnix(formula = diff_t ~ ff, data = train, switch = TRUE, filter = filter,
## verbose = FALSE)
##
## Number of observations (total) 113942 (5527 due to inflation)
## Removed due to missing values 38700 (34.0 percent)
## Outside defined wind sector 50793 (44.6 percent)
## Used for classification 24449 (21.5 percent)
##
## Climatological foehn occurance 17.93 percent (on n = 75242)
## Mean foehn probability 17.69 percent (on n = 75242)
##
## Log-likelihood: -55763.8, 6 effective degrees of freedom
## Corresponding AIC = 111539.7, BIC = 111588.3
## Time required for model estimation: 2.7 seconds
##
## Use summary(object, detailed = TRUE) to get additional test statistics.
The coef
or coefficients
method returns the
estimated coefficients.
coef(mod)
## Coefficients of foehnix model
## Model formula: diff_t ~ ff
## foehnix family of class: Gaussian
##
## Coefficients
## mu1 sigma1 mu2 sigma2 (Intercept) ff
## 5.6735162 2.7399700 0.8667821 1.3340624 -5.7422844 1.0656787
## mu1
## 5.673516
print(c["mu2"])
## mu2
## 0.8667821
The summary
method allows to get the test statistics for
the estimated coefficients, namely the z statistics for the concomitant
model (if any) and the t statistics for the two location parameters of
the two components.
summary(mod, detailed = TRUE)
##
## Call: foehnix(formula = diff_t ~ ff, data = train, switch = TRUE, filter = filter,
## verbose = FALSE)
##
## Number of observations (total) 113942 (5527 due to inflation)
## Removed due to missing values 38700 (34.0 percent)
## Outside defined wind sector 50793 (44.6 percent)
## Used for classification 24449 (21.5 percent)
##
## Climatological foehn occurance 17.93 percent (on n = 75242)
## Mean foehn probability 17.69 percent (on n = 75242)
##
## Log-likelihood: -55763.8, 6 effective degrees of freedom
## Corresponding AIC = 111539.7, BIC = 111588.3
## Time required for model estimation: 2.7 seconds
##
## Cluster separation (ratios close to 1 indicate
## well separated clusters):
## prior size post>0 ratio
## Component 1 (foehn) 0.55 13488 19658 0.69
## Component 2 (no foehn) 0.45 10961 22457 0.49
##
## ---------------------------------
##
## Concomitant model: z test of coefficients
## Estimate Std. error z value Pr(>|z|)
## cc.(Intercept) 0.912295 0.075737 12.046 < 2.2e-16 ***
## cc.ff 3.996336 0.013613 293.567 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Number of IWLS iterations: 2 (algorithm converged)
## Dispersion parameter for binomial family taken to be 1.
The method plot
provides a set of model assessment
plots. Different types are available, if no which
argument
is set all will be shown one after another.
which = "loglik"
shows the log-likelihood path of the
two components of the model (the part for the two components of the
mixture model plus the one of the concomitant model) plus the full
likelihood path. The x-axis shows the log-iterations of the EM algorithm
(see statistical models for more
details). Increasing values indicate positive log-likelihood
contributions (improvement of the model).
plot(mod, which = "loglik", log = F)
which = "loglikcontribution"
shows the same information
as which = "loglik"
but with respect to the log-likelihood
of the initial parameters. Increasing values indicate positive
log-likelihood contributions (improvement of the model).
plot(mod, which = "loglikcontribution")
Similar to the plots above we can also check the paths of the
coefficients trough the iterations of the EM algorithm with
which = "coef"
:
plot(mod, which = "coef")
Last but not least the which = "hist"
plot shows the
conditional histograms of the two clusters. Two histograms are shown
given the main covariate of the foehnix
mixture model, in
this case (model mod
) it is delta_t
, the
potential temperature difference between the valley station Ellbögen and the crest station Sattelberg. The left panel
corresponds to the “no foehn” component or cluster, the right one to the
“foehn” component. If a filter
rule is used: only the data
which have been used for classification will be shown. Observations with
a higher probability to be in the left component are sown left,
observations with a higher probability to be a member of the second
component are shown right. On top (line) the estimated density of the
two components is shown.
plot(mod, which = "hist")
Fitted probabilities (the probabilities obtained by the model based
on the training data set) can be accessed by calling
fitted
:
## [1] "zoo"
## Index fitted(mod)
## Min. :2006-01-01 01:00:00 Min. :0.00
## 1st Qu.:2009-04-01 22:15:00 1st Qu.:0.00
## Median :2012-07-01 19:30:00 Median :0.00
## Mean :2012-07-01 19:30:00 Mean :0.18
## 3rd Qu.:2015-10-01 16:45:00 3rd Qu.:0.00
## Max. :2018-12-31 14:00:00 Max. :1.00
## NA's :38700
The method let’s you access the fitted probabilities
(which = "probability"
), a flag which comes with the fitted
values (which = "flag"
), or both
(which = "both"
).
## 2018-12-31 09:00:00 2018-12-31 10:00:00 2018-12-31 11:00:00 2018-12-31 12:00:00
## NA NA NA NA
## 2018-12-31 13:00:00 2018-12-31 14:00:00
## NA NA
## 2018-12-31 09:00:00 2018-12-31 10:00:00 2018-12-31 11:00:00 2018-12-31 12:00:00
## NA NA NA NA
## 2018-12-31 13:00:00 2018-12-31 14:00:00
## NA NA
## prob flag
## 2018-12-31 09:00:00 NA NA
## 2018-12-31 10:00:00 NA NA
## 2018-12-31 11:00:00 NA NA
## 2018-12-31 12:00:00 NA NA
## 2018-12-31 13:00:00 NA NA
## 2018-12-31 14:00:00 NA NA
Explanation of the flag: * NA
if we have had missing
data (missing covariates, or missing data to apply the
filter
rules provided when calling foehnix
). *
0
if observations have been outside the defined
filter
, not used for classification and thus the
probability has been set to 0
. * 1
if
observations have been used for classification with the corresponding
estimated probability.
In addition, windrose proides a convenient function to plot wind roses.
windrose(mod, ncol = 3)
The individual plots can also be plotted separately:
windrose(mod, which = "foehn", ncol = 2)
One type shows the density function (type = "density"
),
the other one a 2-D histogram (type = "histogram"
). The
argument which
allows to plot the unconditional windrose
(which = "unconditional"
), the ‘no foehn event’ windrose
(which = "nofoehn"
) which shows only observations where the
classification gives a probability of foehn below 0.5
, and
a wind rose for ‘foehn’ events where the classification returned a
probability of >= 0.5
.
The image
method applied to foehnix
objects
allows to plot Hovmöller diagrams with a set of options. Some of the
features are:
FUN = "mean"
: aggregates mean foehn probability.FUN = function(x) sum(x > 0.8, na.rm = TRUE)
: an
example of a custom function, in this case the occurrence of events with
a probability larger than 0.8.Example 1: Default plot, no aggregation on the time
axis (deltat = NULL
; takes increments of the underlying
data), weekly aggregation along the time of the year axis
(deltad = 7L
), the frequency (FUN = "freq"
) is
plotted without contours.
image(mod, FUN = "freq")
Example 2: Aggregated to three-hourly intervals
(deltat = 3600 * 3
), two-week mean values
(deltad = 14L
), occurrance of foehn
(FUN = "occ"
), with custom colors and contours (without
labels).
col <- rev(colorspace::sequential_hcl(51, h = c(-50, 120), c. = c(80, 10)))
image(mod, FUN = "occ", col = col,
deltat = 3 * 3600, deltad = 14L,
contours = TRUE, contour.col = "white", lwd = 2, drawlabels = FALSE,
main = paste("Hovmoeller Diagram, Occurrence of Foehn",
"2-Weekly/3-Hourly Sums", sep = "\n"))
A last important method is predict
. The term predict
refers to statistical prediction, applying a statistical model to a data
set which has never been seen by the model (not included in the training
data set).
As foehnix
is a statistical classification method a
prediction in this context is not a forecast (like a weather forecast),
but a classification to new data. This can, for example, be used to
retrieve foehn probabilities for new observation in an operational
context.
Assume we have estimated a foehnix
mixture model and a
set of new observations from our two stations (Ellbögen and Sattelberg) have become available.
We would like to get foehn probabilities this new data set. Luckily we
put aside 10 observations earlier (the test
data set):
# Calculate potential temperature difference (as we did above)
test$diff_t <- test$crest_t + 10.27 - test$t
print(test)
## dd ff rh t crest_dd crest_ff crest_rh crest_t diff_t
## 2018-12-31 15:00:00 NA NA 100 0.0 NA NA 100 -3.7 6.57
## 2018-12-31 16:00:00 NA NA 100 -0.2 NA NA 100 -4.0 6.47
## 2018-12-31 17:00:00 NA NA 100 -0.6 NA NA 100 -4.1 6.77
## 2018-12-31 18:00:00 360 2.5 100 -0.3 NA NA 100 -3.9 6.67
## 2018-12-31 19:00:00 358 0.7 100 -0.1 NA NA 100 -4.3 6.07
## 2018-12-31 20:00:00 360 0.9 100 -0.7 6 14.0 100 -4.2 6.77
## 2018-12-31 21:00:00 NA NA 100 -0.6 4 11.8 100 -4.2 6.67
## 2018-12-31 22:00:00 NA NA 100 -0.3 0 12.1 100 -4.5 6.07
## 2018-12-31 23:00:00 NA NA 100 0.1 4 5.3 100 -4.8 5.37
## 2019-01-01 00:00:00 NA NA 100 0.1 NA NA NA NA NA
Given our model we can now perform the classification on this new
data set which has not been used for training the foehnix
mixture model by calling:
predict(mod, newdata = test)
## prob flag
## 2018-12-31 15:00:00 NA NA
## 2018-12-31 16:00:00 NA NA
## 2018-12-31 17:00:00 NA NA
## 2018-12-31 18:00:00 NA NA
## 2018-12-31 19:00:00 NA NA
## 2018-12-31 20:00:00 0 0
## 2018-12-31 21:00:00 NA NA
## 2018-12-31 22:00:00 NA NA
## 2018-12-31 23:00:00 NA NA
## 2019-01-01 00:00:00 NA NA
For demonstration: we would have become the same (roughly the same,
as a slightly different training data set is used) if we would
re-estimate the model based on the whole data set (train
and test
combined) and check the estimated probabilities of
the last 10 entries, the very same times we have put aside as our
test
data set:
mod2 <- foehnix(diff_t ~ ff, data = rbind(train,test),
filter = filter, switch = TRUE, verbose = FALSE)
print(tail(fitted(mod2, 1:2), 10))
## prob flag
## 2018-12-31 15:00:00 NA NA
## 2018-12-31 16:00:00 NA NA
## 2018-12-31 17:00:00 NA NA
## 2018-12-31 18:00:00 NA NA
## 2018-12-31 19:00:00 NA NA
## 2018-12-31 20:00:00 0 0
## 2018-12-31 21:00:00 NA NA
## 2018-12-31 22:00:00 NA NA
## 2018-12-31 23:00:00 NA NA
## 2019-01-01 00:00:00 NA NA