Saturday, July 27, 2024

torch time sequence, last episode: Consideration

[ad_1]

That is the ultimate submit in a four-part introduction to time-series forecasting with torch. These posts have been the story of a quest for multiple-step prediction, and by now, we’ve seen three totally different approaches: forecasting in a loop, incorporating a multi-layer perceptron (MLP), and sequence-to-sequence fashions. Right here’s a fast recap.

  • As one ought to when one units out for an adventurous journey, we began with an in-depth examine of the instruments at our disposal: recurrent neural networks (RNNs). We educated a mannequin to foretell the very subsequent commentary in line, after which, considered a intelligent hack: How about we use this for multi-step prediction, feeding again particular person predictions in a loop? The consequence , it turned out, was fairly acceptable.

  • Then, the journey actually began. We constructed our first mannequin “natively” for multi-step prediction, relieving the RNN a little bit of its workload and involving a second participant, a tiny-ish MLP. Now, it was the MLP’s activity to challenge RNN output to a number of time factors sooner or later. Though outcomes have been fairly passable, we didn’t cease there.

  • As an alternative, we utilized to numerical time sequence a way generally utilized in pure language processing (NLP): sequence-to-sequence (seq2seq) prediction. Whereas forecast efficiency was not a lot totally different from the earlier case, we discovered the method to be extra intuitively interesting, because it displays the causal relationship between successive forecasts.

As we speak we’ll enrich the seq2seq method by including a brand new element: the consideration module. Initially launched round 2014, consideration mechanisms have gained monumental traction, a lot so {that a} current paper title begins out “Consideration is Not All You Want”.

The thought is the next.

Within the basic encoder-decoder setup, the decoder will get “primed” with an encoder abstract only a single time: the time it begins its forecasting loop. From then on, it’s by itself. With consideration, nevertheless, it will get to see the whole sequence of encoder outputs once more each time it forecasts a brand new worth. What’s extra, each time, it will get to zoom in on these outputs that appear related for the present prediction step.

It is a notably helpful technique in translation: In producing the following phrase, a mannequin might want to know what a part of the supply sentence to deal with. How a lot the method helps with numerical sequences, in distinction, will doubtless rely on the options of the sequence in query.

As earlier than, we work with vic_elec, however this time, we partly deviate from the way in which we used to make use of it. With the unique, bi-hourly dataset, coaching the present mannequin takes a very long time, longer than readers will need to wait when experimenting. So as a substitute, we mixture observations by day. With a purpose to have sufficient information, we practice on years 2012 and 2013, reserving 2014 for validation in addition to post-training inspection.

We’ll try to forecast demand as much as fourteen days forward. How lengthy, then, ought to be the enter sequences? It is a matter of experimentation; all of the extra so now that we’re including within the consideration mechanism. (I believe that it may not deal with very lengthy sequences so properly).

Under, we go together with fourteen days for enter size, too, however that will not essentially be the absolute best selection for this sequence.

n_timesteps <- 7 * 2
n_forecast <- 7 * 2

elec_dataset <- dataset(
  identify = "elec_dataset",
  
  initialize = perform(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- size(self$x) - self$n_timesteps - 1
    
    self$begins <- kind(pattern.int(
      n = n,
      measurement = n * sample_frac
    ))
    
  },
  
  .getitem = perform(i) {
    
    begin <- self$begins[i]
    finish <- begin + self$n_timesteps - 1
    lag <- 1
    
    checklist(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .size = perform() {
    size(self$begins) 
  }
)

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Mannequin-wise, we once more encounter the three modules acquainted from the earlier submit: encoder, decoder, and top-level seq2seq module. Nonetheless, there may be a further element: the consideration module, utilized by the decoder to acquire consideration weights.

Encoder

The encoder nonetheless works the identical method. It wraps an RNN, and returns the ultimate state.

encoder_module <- nn_module(
  
  initialize = perform(sort, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$sort <- sort
    
    self$rnn <- if (self$sort == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  ahead = perform(x) {
    
    # return outputs for all timesteps, in addition to last-timestep states for all layers
    x %>% self$rnn()
    
  }
)

Consideration module

In primary seq2seq, each time it needed to generate a brand new worth, the decoder took under consideration two issues: its prior state, and the earlier output generated. In an attention-enriched setup, the decoder moreover receives the whole output from the encoder. In deciding what subset of that output ought to matter, it will get assist from a brand new agent, the eye module.

This, then, is the eye module’s raison d’être: Given present decoder state and properly as full encoder outputs, acquire a weighting of these outputs indicative of how related they’re to what the decoder is at the moment as much as. This process leads to the so-called consideration weights: a normalized rating, for every time step within the encoding, that quantify their respective significance.

Consideration could also be carried out in plenty of other ways. Right here, we present two implementation choices, one additive, and one multiplicative.

Additive consideration

In additive consideration, encoder outputs and decoder state are generally both added or concatenated (we select to do the latter, beneath). The ensuing tensor is run by means of a linear layer, and a softmax is utilized for normalization.

attention_module_additive <- nn_module(
  
  initialize = perform(hidden_dim, attention_size) {
    
    self$consideration <- nn_linear(2 * hidden_dim, attention_size)
    
  },
  
  ahead = perform(state, encoder_outputs) {
    
    # perform argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)
    
    # multiplex state to permit for concatenation (dimensions 1 and a couple of should agree)
    seq_len <- dim(encoder_outputs)[2]
    # ensuing form: (bs, timesteps, hidden_dim)
    state_rep <- state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
    
    # concatenate alongside characteristic dimension
    concat <- torch_cat(checklist(state_rep, encoder_outputs), dim = 3)
    
    # run by means of linear layer with tanh
    # ensuing form: (bs, timesteps, attention_size)
    scores <- self$consideration(concat) %>% 
      torch_tanh()
    
    # sum over consideration dimension and normalize
    # ensuing form: (bs, timesteps) 
    attention_weights <- scores %>%
      torch_sum(dim = 3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized rating for each supply token
    attention_weights
  }
)

Multiplicative consideration

In multiplicative consideration, scores are obtained by computing dot merchandise between decoder state and the entire encoder outputs. Right here too, a softmax is then used for normalization.

attention_module_multiplicative <- nn_module(
  
  initialize = perform() {
    
    NULL
    
  },
  
  ahead = perform(state, encoder_outputs) {
    
    # perform argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)

    # enable for matrix multiplication with encoder_outputs
    state <- state$permute(c(2, 3, 1))
 
    # put together for scaling by variety of options
    d <- torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
       
    # scaled dot merchandise between state and outputs
    # ensuing form: (bs, timesteps, 1)
    scores <- torch_bmm(encoder_outputs, state) %>%
      torch_div(torch_sqrt(d))
    
    # normalize
    # ensuing form: (bs, timesteps) 
    attention_weights <- scores$squeeze(3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized rating for each supply token
    attention_weights
  }
)

Decoder

As soon as consideration weights have been computed, their precise utility is dealt with by the decoder. Concretely, the strategy in query, weighted_encoder_outputs(), computes a product of weights and encoder outputs, ensuring that every output could have applicable influence.

The remainder of the motion then occurs in ahead(). A concatenation of weighted encoder outputs (usually known as “context”) and present enter is run by means of an RNN. Then, an ensemble of RNN output, context, and enter is handed to an MLP. Lastly, each RNN state and present prediction are returned.

decoder_module <- nn_module(
  
  initialize = perform(sort, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
    
    self$sort <- sort
    
    self$rnn <- if (self$sort == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(2 * hidden_size + 1, 1)
    
    self$consideration <- if (attention_type == "multiplicative") attention_module_multiplicative()
      else attention_module_additive(hidden_size, attention_size)
    
  },
  
  weighted_encoder_outputs = perform(state, encoder_outputs) {

    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    # ensuing form: (bs * timesteps)
    attention_weights <- self$consideration(state, encoder_outputs)
    
    # ensuing form: (bs, 1, seq_len)
    attention_weights <- attention_weights$unsqueeze(2)
    
    # ensuing form: (bs, 1, hidden_size)
    weighted_encoder_outputs <- torch_bmm(attention_weights, encoder_outputs)
    
    weighted_encoder_outputs
    
  },
  
  ahead = perform(x, state, encoder_outputs) {
 
    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    
    # ensuing form: (bs, 1, hidden_size)
    context <- self$weighted_encoder_outputs(state, encoder_outputs)
    
    # concatenate enter and context
    # NOTE: this repeating is finished to compensate for the absence of an embedding module
    # that, in NLP, would give x the next proportion within the concatenation
    x_rep <- x$repeat_interleave(dim(context)[3], 3) 
    rnn_input <- torch_cat(checklist(x_rep, context), dim = 3)
    
    # ensuing shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
    rnn_out <- self$rnn(rnn_input, state)
    rnn_output <- rnn_out[[1]]
    next_hidden <- rnn_out[[2]]
    
    mlp_input <- torch_cat(checklist(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
    
    output <- self$linear(mlp_input)
    
    # shapes: (bs, 1) and (1, bs, hidden_size)
    checklist(output, next_hidden)
  }
  
)

seq2seq module

The seq2seq module is mainly unchanged (aside from the truth that now, it permits for consideration module configuration). For an in depth rationalization of what occurs right here, please seek the advice of the earlier submit.

seq2seq_module <- nn_module(
  
  initialize = perform(sort, input_size, hidden_size, attention_type, attention_size, n_forecast, 
                        num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(sort = sort, input_size = input_size, hidden_size = hidden_size,
                                   num_layers, encoder_dropout)
    self$decoder <- decoder_module(sort = sort, input_size = 2 * hidden_size, hidden_size = hidden_size,
                                   attention_type = attention_type, attention_size = attention_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  ahead = perform(x, y, teacher_forcing_ratio) {
    
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)
    encoded <- self$encoder(x)
    encoder_outputs <- encoded[[1]]
    hidden <- encoded[[2]]
    # checklist of (batch_size, 1), (1, batch_size, hidden_size)
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
    # (batch_size, 1)
    pred <- out[[1]]
    # (1, batch_size, hidden_size)
    state <- out[[2]]
    outputs[ , 1] <- pred$squeeze(2)
    
    for (t in 2:self$n_forecast) {
      
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      enter <- if (teacher_forcing == TRUE) y[ , t - 1, drop = FALSE] else pred
      enter <- enter$unsqueeze(3)
      out <- self$decoder(enter, state, encoder_outputs)
      pred <- out[[1]]
      state <- out[[2]]
      outputs[ , t] <- pred$squeeze(2)
      
    }
    
    outputs
  }
  
)

When instantiating the top-level mannequin, we now have a further selection: that between additive and multiplicative consideration. Within the “accuracy” sense of efficiency, my checks didn’t present any variations. Nonetheless, the multiplicative variant is so much quicker.

web <- seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
                      attention_size = 8, n_forecast = n_forecast)

Identical to final time, in mannequin coaching, we get to decide on the diploma of instructor forcing. Under, we go together with a fraction of 0.0, that’s, no forcing in any respect.

optimizer <- optim_adam(web$parameters, lr = 0.001)

num_epochs <- 1000

train_batch <- perform(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- web(b$x, b$y, teacher_forcing_ratio)
  goal <- b$y
  
  loss <- nnf_mse_loss(output, goal[ , 1:(dim(output)[2])])
  loss$backward()
  optimizer$step()
  
  loss$merchandise()
  
}

valid_batch <- perform(b, teacher_forcing_ratio = 0) {
  
  output <- web(b$x, b$y, teacher_forcing_ratio)
  goal <- b$y
  
  loss <- nnf_mse_loss(output, goal[ , 1:(dim(output)[2])])
  
  loss$merchandise()
  
}

for (epoch in 1:num_epochs) {
  
  web$practice()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.0)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, coaching: loss: %3.5f n", epoch, imply(train_loss)))
  
  web$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, imply(valid_loss)))
}
# Epoch 1, coaching: loss: 0.83752 
# Epoch 1, validation: loss: 0.83167

# Epoch 2, coaching: loss: 0.72803 
# Epoch 2, validation: loss: 0.80804 

# ...
# ...

# Epoch 99, coaching: loss: 0.10385 
# Epoch 99, validation: loss: 0.21259 

# Epoch 100, coaching: loss: 0.10396 
# Epoch 100, validation: loss: 0.20975 

For visible inspection, we choose a couple of forecasts from the take a look at set.

web$eval()

test_preds <- vector(mode = "checklist", size = size(test_dl))

i <- 1

vic_elec_test <- vic_elec_daily %>%
  filter(yr(Date) == 2014, month(Date) %in% 1:4)


coro::loop(for (b in test_dl) {

  output <- web(b$x, b$y, teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[21]]
test_pred2 <- c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[41]]
test_pred3 <- c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))

test_pred4 <- test_preds[[61]]
test_pred4 <- c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))

test_pred5 <- test_preds[[81]]
test_pred5 <- c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))


preds_ts <- vic_elec_test %>%
  choose(Demand, Date) %>%
  add_column(
    ex_1 = test_pred1 * train_sd + train_mean,
    ex_2 = test_pred2 * train_sd + train_mean,
    ex_3 = test_pred3 * train_sd + train_mean,
    ex_4 = test_pred4 * train_sd + train_mean,
    ex_5 = test_pred5 * train_sd + train_mean) %>%
  pivot_longer(-Date) %>%
  update_tsibble(key = identify)


preds_ts %>%
  autoplot() +
  scale_color_hue(h = c(80, 300), l = 70) +
  theme_minimal()

A sample of two-weeks-ahead predictions for the test set, 2014.

Determine 1: A pattern of two-weeks-ahead predictions for the take a look at set, 2014.

We will’t immediately evaluate efficiency right here to that of earlier fashions in our sequence, as we’ve pragmatically redefined the duty. The primary aim, nevertheless, has been to introduce the idea of consideration. Particularly, how you can manually implement the method – one thing that, when you’ve understood the idea, chances are you’ll by no means should do in apply. As an alternative, you’ll doubtless make use of present instruments that include torch (multi-head consideration and transformer modules), instruments we could introduce in a future “season” of this sequence.

Thanks for studying!

Picture by David Clode on Unsplash

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Collectively Studying to Align and Translate.” CoRR abs/1409.0473. http://arxiv.org/abs/1409.0473.
Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Consideration is Not All You Want: Pure Consideration Loses Rank Doubly Exponentially with Depth.” arXiv e-Prints, March, arXiv:2103.03404. https://arxiv.org/abs/2103.03404.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Consideration Is All You Want.” arXiv e-Prints, June, arXiv:1706.03762. https://arxiv.org/abs/1706.03762.
Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. “Grammar as a Overseas Language.” CoRR abs/1412.7449. http://arxiv.org/abs/1412.7449.
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Present, Attend and Inform: Neural Picture Caption Technology with Visible Consideration.” CoRR abs/1502.03044. http://arxiv.org/abs/1502.03044.

[ad_2]

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles