Thesis/Reports/Thesis/sections/results/models/diffusion.tex

\subsection{Diffusion}
Another type of model that can be used to generatively model the NRV is the diffusion model. This type of model is very popular for image generation. In the context of images, the diffusion model is trained by iteratively adding noise to a training image until there is only noise left. From this noise, the model tries to reverse the diffusion process to get the original image back. To sample new images using this model, a noise vector is sampled and iteratively denoised by the model. This process results in a new image.

This training process can also be used for other data types. An image is just a 2D grid of data points. A time series can be seen as a 1D sequence of data points. The diffusion model can thus be trained on the NRV data to generate new samples for a certain day based on a given input.

Once the diffusion model is trained, it can be used efficiently to generate new samples. The model can generate samples in parallel, which is not possible with autoregressive models. It combines the parallel sample generation of the non-autoregressive models while the quarter NRV values still depend on each other.  A batch of noise vectors can be sampled and passed through the model in one batch to generate the new samples. The generated samples contain the 96 NRV values for the next day without needing to sample every quarter sequentially.

The model is trained in a completely different way than the quantile regression models. A simple implementation of the Denoising Diffusion Probabilistic Model (DDPM) \cite{ho_denoising_2020} is used to perform the experiments. More complex implementations with more advanced techniques could be used to improve the results. This is out of the scope of this thesis. The goal is to show that more recent generative models can also be used to model the NRV data. These results can then be compared to the quantile regression models to see if the diffusion model can generate better samples.

First of all, the model architecture needs to be chosen. The model takes multiple inputs which include the noisy NRV time series, the positional encoding of the current denoising step and the conditional input features. The model needs to predict the noise in the current time series. The time series can then be denoised by subtracting the predicted noise in every denoising step. Multiple model architectures can be used as long as the model can predict the noise in the time series. A simple feedforward neural network is used. The neural network exists of multiple linear layers with ReLu activation functions. To predict the noise in a noisy time series, the current denoising step index must also be provided. This integer is then transformed into a vector using sine and cosine functions. The positional encoding is then concatenated with the noisy time series and the conditional input features. This tensor is then passed through the first linear layer and activation function of the neural network. This results in a tensor of the hidden size that was chosen. Before passing this tensor to the next layer, the positional encoding and conditional input features are concatenated again. This process is repeated until the last layer is reached. This provides every layer in the neural network with the necessary information to predict the noise in the time series. The output of the last layer is then the predicted noise in the time series. The model is trained by minimizing the mean squared error between the predicted noise and the real noise in the time series.

Other hyperparameters that need to be chosen are the number of denoising steps, number of layers and hidden size of the neural network. Experiments are performed to get an insight into the influence these parameters have on the model performance. Results are shown in Table \ref{tab:diffusion_results}.

\begin{figure}[h]
    \centering
    \begin{tikzpicture}
        % Node for Image 1
        \node (img1) {\includegraphics[width=0.45\textwidth]{images/diffusion/results/intermediates/Testing Intermediates 864_Sample intermediate 1_00000000.jpeg}};
        % Node for Image 2 with an arrow from Image 1
        \node[right=of img1] (img2) {\includegraphics[width=0.45\textwidth]{images/diffusion/results/intermediates/Testing Intermediates 864_Sample intermediate 2_00000000.jpeg}};
        \draw[-latex] (img1) -- (img2);

        % Node for Image 3 below Image 1 with an arrow from Image 2
        \node[below=of img1] (img3) {\includegraphics[width=0.45\textwidth]{images/diffusion/results/intermediates/Testing Intermediates 864_Sample intermediate 3_00000000.jpeg}};

        % Node for Image 4 with an arrow from Image 3
        \node[right=of img3] (img4) {\includegraphics[width=0.45\textwidth]{images/diffusion/results/intermediates/Testing Intermediates 864_Sample intermediate 4_00000000.jpeg}};
        \draw[-latex] (img3) -- (img4);

        % Complex arrow from Image 2 to Image 3
        \coordinate (Middle) at ($(img2.south)!0.5!(img3.north)$);
        \draw[-latex] (img2.south) |- (Middle) -| (img3.north);
    \end{tikzpicture}
    \caption{Intermediate steps of the diffusion model for example 864 from the test set. The confidence intervals shown in the plots are made using 100 samples.}
    \label{fig:diffusion_intermediates}0
\end{figure}

In Figure \ref{fig:diffusion_intermediates}, multiple intermediate steps of the denoising process are shown as an example from the test set. The model starts with noisy full-day NRV samples which can be seen in the first steps. These noisy samples are then denoised in multiple steps until realistic samples are generated. This can be seen in the last image in the figure. It can be observed that the confidence intervals get more narrow over time as the noise is removed from the samples.

\begin{table}[H]
    \centering
    \begin{adjustbox}{width=\textwidth,center}
    \begin{tabular}{@{}cccccccc@{}}
    \toprule
    Features & Diffusion Steps & Layers & Hidden Size & MSE & MAE & CRPS \\
    \midrule
    NRV & & & & & & & \\
    & 300 & 2 & 256 & 57129.71 & 185.56 & 81.00 \\
    & 300 & 2 & 512 & 48364.77 & 169.39 & 79.13 \\
    & 300 & 2 & 1024 & 43540.50 & 159.17 & 78.27 \\
    & 300 & 2 & 2048 & 41946.52 & 155.85 & 78.19 \\
    & 300 & 3 & 256 & 52741.73 & 177.09 & 79.55 \\
    & 300 & 3 & 512 & 45048.05 & 161.89 & 78.46 \\
    & 300 & 3 & 1024 & 42089.13 & 155.97 & 78.25 \\
    & 300 & 3 & 2048 & 41797.63 & 154.69 & 78.05 \\
    & 300 & 3 & 4096 & 39943.93 & 151.62 & 77.59 \\
    & 300 & 4 & 256 & 56939.68 & 185.07 & 81.16 \\
    & 300 & 4 & 512 & 46225.72 & 164.74 & 79.19 \\
    & 300 & 4 & 1024 & 42984.02 & 157.54 & \textbf{77.92} \\
    & 300 & 4 & 2048 & 41145.32 & 154.14 & 78.18 \\
    \midrule
    NRV + Load + Wind + PV + NP & & & & & & & \\
    & 300 & 2 & 256 & 63337.36 & 196.21 & 84.29 \\
    & 300 & 2 & 512 & 52745.92 & 177.16 & 81.57 \\
    & 300 & 2 & 1024 & 47178.91 & 166.89 & \textbf{80.30} \\
    & 300 & 3 & 256 & 66148.13 & 200.34 & 85.31 \\
    & 300 & 3 & 512 & 53159.99 & 178.46 & 81.95 \\
    & 300 & 3 & 1024 & 47815.13 & 167.22 & 81.16 \\
    & 300 & 3 & 2048 & 46448.90 & 164.50 & 81.06 \\
    & 300 & 4 & 1024 & 47483.05 & 166.97 & 81.32 \\
    & 300 & 4 & 2048 & 47076.77 & 166.06 & 81.06 \\

    \bottomrule
    \end{tabular}
    \end{adjustbox}
    \caption{Simple diffusion model results.}
    \label{tab:diffusion_results}
\end{table}

In Table \ref{tab:diffusion_results}, the results of the experiments for the diffusion model can be seen. The diffusion model that was used is a simple implementation of the Denoising Diffusion Probabilistic Model (DDPM) \cite{ho_denoising_2020}. The model itself exists of multiple linear layers with ReLU activation functions. The diffusion steps were set to 300 for the experiments. This number was determined by doing a few experiments with more and fewer steps. The model performance did not improve when more steps were used. This parameter could be further optimized together with the other parameters to find the best-performing model. This would take a lot of time and is not the goal of this thesis.

The first observation that can be made is the higher error metrics when more input features are used. This is counterintuitive because the model has more information to generate the samples. The reason for this behavior is not immediately clear. One reason could be that the model conditioning is not optimal. Now the input features are passed to every layer of the model together with the time series that needs to be denoised. The model could be improved by using more advanced conditioning mechanisms like classifier guidance \cite{dhariwal_diffusion_2021} and classifier-free guidance \cite{ho_classifier-free_2022}.
\\
\begin{figure}[ht]
    \centering
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_864.jpeg}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_4320.jpeg}
    \end{subfigure}

    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_6336.jpeg}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_7008.jpeg}
    \end{subfigure}
    \caption{The plots show the generations for the examples from the test set. The diffusion model used to generate the samples consists of 2 layers with a hidden size of 1024. The number of denoising steps is set to 300. The confidence intervals shown in the plots are made using 100 samples. All the available input features are used which includes the \acs{NRV}, Load, Wind, \acs{PV} and \acs{NP} data.}
    \label{fig:diffusion_test_set_examples}
\end{figure}

The examples of the test dataset are shown in Figure \ref{fig:diffusion_test_set_examples} using the diffusion model. The first observation that can be made from these plots is the narrow confidence intervals. The real NRV values are not always captured in the confidence intervals. Not enough variance is present in the generated samples. This issue originates from the overfitting during the training of the model. The model is, however, capable of capturing the general trend of the NRV data. In some cases, the peaks in the generated samples are very close to the real NRV values. This can be seen in the first example in the figure.

\begin{figure}[ht]
    \centering
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_864_Only_NRV.jpeg}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_864.jpeg}
    \end{subfigure}

    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_4320_Only_NRV.jpeg}
        \caption{Only NRV}
    \end{subfigure}
    \hfill
    \begin{subfigure}[b]{0.49\textwidth}
        \includegraphics[width=\textwidth]{images/diffusion/results/samples/Diffusion_Test_Example_4320.jpeg}
        \caption{NRV + Load + Wind + PV + NP}
    \end{subfigure}


    \caption{The plots show the generations for the first examples from the test set. Two diffusion models with 2 layers and 1024 hidden units are used. The first one is only conditioned on the NRV of the previous day while the second one uses all available input features.}
    \label{fig:diffusion_test_set_example_only_nrv_vs_all}
\end{figure}

The plots in Figure \ref{fig:diffusion_test_set_example_only_nrv_vs_all} show the difference in generated samples when only the NRV data is used as input and when all available input features are used. The model that is only conditioned on the NRV data generates samples that do not have much variance. The confidence intervals are quite smooth and do not contain many peaks. The model trained using all available input features, on the other hand, has another behavior. The confidence intervals contain more peaks and the generated samples have more variance. This proves the model does indeed take the other input features into account when generating the samples. When looking at the metrics, the performance of the model that uses all input features is worse than the model that only uses the NRV data. The most obvious reason for this behavior is overfitting. Another reason could be the way the input features are used in the model. The input features are concatenated with the NRV data at every layer of the neural network. This could result in the model not using the input features in the best way possible.