Underlying concept in more details

The article “Getting started with gridded data” shows the basic use of the eupp gridded data interface. For those interested, this article shows some more insights how the package works under the hood.

Some functions might be useful for (i) debugging or (ii) adding additional functionality around the eupp package. In some more detail, the gridded dataset functionality works as follows:

  1. The user specifies the data set to be downloaded/retrieved using the eupp_config() function which returns an object of class eupp_config.
  2. The user calls eupp_download_gridded() or eupp_get_gridded() to retrieve the data in different formats (the first allows for GRIB version 1 and different NetCDF file formats; the latter for stars objects). Below the surface eupp performs the following steps:
    1. Defining the GRIB index files required to identify the necessary GRIB messages
    2. Downloading and parsing the GRIB index files to identify files and byte ranges
    3. Partially downloading the GRIB files (required messages via curl) and stores the requested messages in a new GRIB version 1 file.
    4. If a NetCDF file has been requested: making required manipulations on the GRIB file and converting it to NetCDF, wherefore ecCodes needs to be installed.
    5. If a stars object has been requested: read the NetCDF file. This goes trough the intermediate step of creating a NetCDF file; thus ecCodes is necessary.

When calling eupp_download_gridded() a file will be created on success (GRIB version 1 or NetCDF), while eupp_get_gridded() returns a stars object in the active R session. Temporary files are deleted as soon as no longer needed (stored in tempdir()).

Specify dataset to be downloaded

To demonstrate the intermediate steps listed above, a data set specification (configuration) is required. For this purpose a small subset of gridded surface ensemble forecast data is used.

  • cache = "_cache": Enables GRIB index caching which can be useful if the same GRIB indes file has to be accessed multiple times (as in this article).
  • Imagine not knowing which parameters, forecast steps, or perturbation numbers (members) are available.
library("eupp")
(conf <- eupp_config(product   = "forecast",                    # forecasts
                     type      = "ens",                         # ensemble forecasts
                     level     = "surface",                     # surface fields
                     date      = c("2017-05-05", "2017-06-05"), # 'random' dates; ISO YYYY-mm-dd
                     cache     = "_cache"))                     # enable caching
## EUPP Config
##    Product:             forecast (fcs)
##    Level:               surface
##    Type:                ens
##    Date(s):             2017-05-05,2017-06-05
##    Parameter:           all available
##    Steps:               all available
##    Members:             all available
##    Version:             0
##    Cache:               _cache
##    Area:                not defined

Until now an R object of class eupp_config has been created which is used further down in the pipeline to process the request.

Define required files

The next step is to define the URL(s) of the file(s) to be accessed to process the request. This is done by the function eupp_get_source_urls().

# Required GRIB index files:
eupp_get_source_urls(conf, fileext = "index")
## [1] "https://storage.ecmwf.europeanweather.cloud/eumetnet-postprocessing-benchmark-training-dataset/data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb.index"   
## [2] "https://storage.ecmwf.europeanweather.cloud/eumetnet-postprocessing-benchmark-training-dataset/data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb.index"
## [3] "https://storage.ecmwf.europeanweather.cloud/eumetnet-postprocessing-benchmark-training-dataset/data/fcs/surf/EU_forecast_ctr_surf_params_2017-06_0.grb.index"   
## [4] "https://storage.ecmwf.europeanweather.cloud/eumetnet-postprocessing-benchmark-training-dataset/data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb.index"

As shown above, four different files have to be accessed as we (i) are asking for forecasts issued on two different dates (date) and have not explicitly defined members wherefore we need both, control run forecasts (handled as member = 0) and perturbed forecasts (members 1, 2, …).

When fileext is not defined (fileext = NULL; default) one gets the URLs for the corresponding GRIB files for direct access.

Getting (full) inventory

In this scenario we imagined not having more information on what is available. To get more insights we can use the configuration conf from above to get a complete list of all messages in the GRIB index inventories listed above by calling eupp_get_inventory().

eupp_get_inventory() internally calls eupp_get_source_urls(..., fileext = "index"), downloads the index files (line-wise JSON strings), parses them, and puts them into an object of class c("eupp_inventory", "data.frame") (basic data.frame; no dedicated S3 methods so far).

# Getting inventory (based on `conf` from above)
inv <- eupp_get_inventory(conf)
class(inv)
## [1] "eupp_inventory" "data.frame"
dim(inv)
## [1] 292740     17
head(inv)
##                                                          path domain levtype
## 11001 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
## 11002 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
## 11003 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
## 11004 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
## 11005 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
## 11006 data/fcs/surf/EU_forecast_ctr_surf_params_2017-05_0.grb      g     sfc
##       step_char param class type stream expver leg_number    offset length
## 11001         0    2t    od   cf   enfo   0001          1 253676400  23412
## 11002         0   10u    od   cf   enfo   0001          1 253699920  23412
## 11003         0   10v    od   cf   enfo   0001          1 253723440  23412
## 11004         0   tcc    od   cf   enfo   0001          1 253746960  23412
## 11005         0    tp    od   cf   enfo   0001          1 253770480  23412
## 11006         0  100u    od   cf   enfo   0001          1 253794000  23412
##       param_id number       init step      valid
## 11001      167      0 2017-05-05    0 2017-05-05
## 11002      165      0 2017-05-05    0 2017-05-05
## 11003      166      0 2017-05-05    0 2017-05-05
## 11004      164      0 2017-05-05    0 2017-05-05
## 11005      228      0 2017-05-05    0 2017-05-05
## 11006   228246      0 2017-05-05    0 2017-05-05

As cache is enabled, the resulting data.frame is stored in _R_s RDS file format into the cache folder; using an md5 checksum of the original URL to keep track of the origin. When downloading another set of data stored in the same GRIB file (thus, same GRIB index file) the cached file will be used which can significantly increase the performance.

The object returned contains information about the path of the grib file (not full URL) alongside with a series of additional information which differ between different products. This inventory tells us that the following parameters (param), steps (step), and ensemble members (number; perturbation number) are available.

unique(inv$param)
##  [1] "2t"    "10u"   "10v"   "tcc"   "tp"    "100u"  "100v"  "cape"  "stl1" 
## [10] "sshf"  "slhf"  "tcw"   "tcwv"  "swvl1" "ssr"   "str"   "sd"    "cp"   
## [19] "cin"   "ssrd"  "strd"  "vis"   "10fg6" "mn2t6" "mx2t6"
unique(inv$step)
##   [1]   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [19]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
##  [37]  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
##  [55]  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
##  [73]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
##  [91]  90  93  96  99 102 105 108 111 114 117 120 123 126 129 132 135 138 141
## [109] 144 150 156 162 168 174 180 186 192 198 204 210 216 222 228 234 240
unique(inv$number)
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [26] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## [51] 50

The full URL to the grib files can be constructed based on inv$path and $BASEURL from eupp:::eupp_get_url_config() (can be redefined using system environment variable EUPP_BASEURL). eupp:::eupp_get_url_config() not only returns the BASEURL but a series of template strings for the different files on the bucket.

Refine data set specification

A more practical use is to more explicitly define the data set configuration (as we now know what’s needed). Given cache was used above, the GRIB index file should be loaded from disc in a few secs.

library("eupp")
(conf <- eupp_config(product   = "forecast",
                     type      = "ens",
                     level     = "surface",
                     date      = c("2017-05-05", "2017-06-05"),
                     parameter = c("tp", "sd"),                 # total precip + sunshine duration
                     steps     = seq(13, 15, by = 2L),          # +13 and +15 hour ahead forecast
                     members   = c(10, 14),                     # perturbation 10 and 14 (why not)
                     cache     = "_cache"))                     # use caching
## EUPP Config
##    Product:             forecast (fcs)
##    Level:               surface
##    Type:                ens
##    Date(s):             2017-05-05,2017-06-05
##    Parameter:           tp, sd
##    Steps:               13, 15
##    Members:             10, 14
##    Version:             0
##    Cache:               _cache
##    Area:                not defined

Getting the required part of the inventory given the configuration above:

(inv <- eupp_get_inventory(conf))
##                                                              path domain
## 14503  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 14515  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 14591  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 14603  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 16703  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 16715  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 16791  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 16803  data/fcs/surf/EU_forecast_ens_surf_params_2017-05-05_0.grb      g
## 158003 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 158015 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 158091 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 158103 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 160203 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 160215 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 160291 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
## 160303 data/fcs/surf/EU_forecast_ens_surf_params_2017-06-05_0.grb      g
##        levtype step_char param class type stream expver number leg_number
## 14503      sfc        13    tp    od   pf   enfo   0001     10          1
## 14515      sfc        13    sd    od   pf   enfo   0001     10          1
## 14591      sfc        13    tp    od   pf   enfo   0001     14          1
## 14603      sfc        13    sd    od   pf   enfo   0001     14          1
## 16703      sfc        15    tp    od   pf   enfo   0001     10          1
## 16715      sfc        15    sd    od   pf   enfo   0001     10          1
## 16791      sfc        15    tp    od   pf   enfo   0001     14          1
## 16803      sfc        15    sd    od   pf   enfo   0001     14          1
## 158003     sfc        13    tp    od   pf   enfo   0001     10          1
## 158015     sfc        13    sd    od   pf   enfo   0001     10          1
## 158091     sfc        13    tp    od   pf   enfo   0001     14          1
## 158103     sfc        13    sd    od   pf   enfo   0001     14          1
## 160203     sfc        15    tp    od   pf   enfo   0001     10          1
## 160215     sfc        15    sd    od   pf   enfo   0001     10          1
## 160291     sfc        15    tp    od   pf   enfo   0001     14          1
## 160303     sfc        15    sd    od   pf   enfo   0001     14          1
##           offset length param_id       init step               valid
## 14503  334505760  23412      228 2017-05-05   13 2017-05-05 13:00:00
## 14515  334788000  35036      141 2017-05-05   13 2017-05-05 13:00:00
## 14591  336534960  23412      228 2017-05-05   13 2017-05-05 13:00:00
## 14603  336817200  35036      141 2017-05-05   13 2017-05-05 13:00:00
## 16703  385237320  23412      228 2017-05-05   15 2017-05-05 15:00:00
## 16715  385519560  35036      141 2017-05-05   15 2017-05-05 15:00:00
## 16791  387266280  23412      228 2017-05-05   15 2017-05-05 15:00:00
## 16803  387548520  35036      141 2017-05-05   15 2017-05-05 15:00:00
## 158003 335894520  23412      228 2017-06-05   13 2017-06-05 13:00:00
## 158015 336176760  35036      141 2017-06-05   13 2017-06-05 13:00:00
## 158091 337931640  23412      228 2017-06-05   13 2017-06-05 13:00:00
## 158103 338213880  35036      141 2017-06-05   13 2017-06-05 13:00:00
## 160203 386824320  23412      228 2017-06-05   15 2017-06-05 15:00:00
## 160215 387106560  35036      141 2017-06-05   15 2017-06-05 15:00:00
## 160291 388860960  23412      228 2017-06-05   15 2017-06-05 15:00:00
## 160303 389143200  35036      141 2017-06-05   15 2017-06-05 15:00:00
dim(inv)
## [1] 16 17

The number of observations (rows) in conf matches our exception as asking for (i) two different initialization dates, (ii) two parameters, (iii) two forecast steps (lead times), and (iv) two different members (\(2^4 = 16\)).

Downloading data

The data sets can be retrieved in three different formats which, however, are connected (top down).

  1. GRIB version 1 (minimal requirements; curl/rcurl)
  2. NetCDF (requires ecCodes to be installed)
  3. stars (requires the stars package plus ecCodes)

Given the inventory above the eupp package first downloads segments of the original GRIB file via curl byterange. The result is stored in one GRIB file. If this what has been requested by the user, that’s it (1). If the user requests a NetCDF file the GRIB file is stored temporarily and then converted to NetCDF (the console tool grib_set is used to perform some ensemble-required manipulations; then converted to NetCDF via grib_to_netcdf). When asking for stars objects we go trough the two steps above before reading the data sets via read_stars() (stars package). The conversion GRIB > NetCDF > stars is required to do some naming manipulation.

Download data as GRIB Version 1:

Download data and store as NetCDF:

Getting data as [stars][stars] object: