Thesis/Reports/Thesis/sections/background.tex

\section{Electricity market}
The electricity market consists of many different parties who all work together and want to make a profit in the end. An overview of the most important parties can be found in Table \ref{tab:parties}. Each of them has a different role in the market.

% table
\begin{table}[h]
    \centering
    \begin{tabularx}{\textwidth}{|C|C|}
        \hline
        \textbf{Party} & \textbf{Description} \\
        \hline
        Producers & Generates electricty. The electricity can be generated using coal, nuclear energy, wind parks etc. \\
        \hline
        Consumers & Uses electricity. This can be normal households, companies but also industry. \\
        \hline
        \acf{TSO} & Party responsible for reliable transmission of electricity from generation plants to local distribution networks. This is done over the high-voltage grid. In Belgium, this party is Elia.\\
        \hline
        \acf{DSO} & Party responsible for the distribution of electricity to the end users. Here, the electricity is transported over the low-voltage grid. \\
        \hline
        \acf{BRP} & These parties forecast the electricity consumption and generation of their clients. They make balanced nominations to Elia.
        \\
        \hline
        \acf{BSP} & Parties that provide the \ac{TSO} (Elia) with balancing services. They submit Balancing Energy Bids to Elia. If needed, they will provide balancing energy at a set price. \\
        \hline
    \end{tabularx}
    \caption{Overview of the most important parties in the electricity market}
    \label{tab:parties}
\end{table}

The most important aspect of the electricity market is that the grid needs to be balanced at all times. This means that the amount of electricity consumed and generated must be equal at all times. If this is not the case, the grid can become unstable which can lead to blackouts and disrupt equipment. One company is responsible for keeping the grid balanced. This company is called the Transmission System Operator (TSO). In Belgium, this party is Elia. The TSO keeps the grid balanced by activating reserves when needed. These reserves, however, are expensive and need to be paid by the market participants. The prices paid for the activations of these reserves are called the imbalance price.

At every access point of the grid, there is a designated \acf{BRP}. This party may be a producer, major consumer, energy supplier or trader. The BRP must take all reasonable measures to maintain the balance between injections, offtakes and commercial power trades within its portfolio. Each day, the BRP submits a daily balance schedule for the next day to the TSO. This schedule contains the expected physical injections and offtakes from the grid as well as the commercial power trades with other BRPs or other countries. These schedules are forecasts and are not always 100\% accurate. A lot of factors can influence the production and consumption of electricity like the weather, the economy, the time of day etc. The BRP must take all reasonable measures to be balanced on a quarter-hourly basis. This can be done by day-ahead or intra-day trading with other BRPs. If the BRP is not balanced for a certain quarter, it will need to pay the imbalance price for the deviation. The imbalance of a BRP is the quarter-hourly difference between total injections and offtakes from the grid.

The imbalance price, which is a crucial factor in the management of electricity grids, is set by the Transmission System Operator (TSO). This price is calculated based on the total imbalance within the grid. The net regulation volume (NRV) plays a key role in this process. The NRV represents the amount of energy that Elia, the TSO for Belgium, utilizes to ensure the stability and balance of the electricity grid within the Elia control area.

The Area Control Error (ACE) is another important concept in this context. It refers to the discrepancy between the planned (scheduled) and the actual power exchanges in the Belgian control area. Essentially, it measures how much the actual conditions deviate from what was anticipated.

The System Imbalance (SI) is derived by subtracting the NRV from the ACE. This value, the SI, directly influences the calculation of the imbalance price. The TSO uses the magnitude of the System Imbalance to determine the appropriate imbalance price, ensuring that costs are allocated to market participants based on their contribution to the overall grid imbalance. By calculating the imbalance price in this way, the TSO incentivizes market participants to adhere closely to their scheduled injections and offtakes, thereby promoting grid stability and reliability.

The Transmission System Operator (TSO) can activate reserves to maintain grid stability, and these reserves are supplied by entities known as Balancing Service Providers (BSPs). BSPs are crucial participants in the electricity market as they provide the necessary reserve capacity that the TSO can call upon in times of need. Each BSP submits bids to the TSO for the potential activation of these reserves. These bids are detailed and include several key components: the specific type of reserve being offered, the total volume of energy available for activation (measured in megawatt-hours, MWh), the price per MWh at which the BSP is willing to provide this reserve, and a start price which initiates the reserve's deployment. Through this bidding process, the TSO selects the most cost-effective and appropriate offers to ensure the grid's stability and balance.

Elia, the \acf{TSO} in Belgium, maintains grid stability by activating three types of reserves, each designed to address specific conditions of imbalance. These reserves are crucial for ensuring that the electricity supply continuously meets the demand, thereby maintaining the frequency within the required operational limits. The reserves include:

1) \textbf{ \acf{FCR}} \\
FCR is a reserve that responds automatically to frequency deviations in the grid. The reserve responds automatically in seconds and provides a proportional response to the frequency deviation. Elia must provide a minimal share of this volume within the Belgian control area. This type of volume can also be offered by the \acsp{BSP}.

2) \textbf{ \acf{aFRR}} \\
aFRR is the second reserve that Elia can activate to restore the frequency to 50Hz. The aFRR is activated when the FCR is not sufficient to restore the frequency. Every 4 seconds, Elia sends a set-point to the BSPs. The BSPs use this set-point to adjust their production or consumption. The BSPs have a 7.5-minute window to activate the full requested energy volume. This reserve can also be offered by the BSPs.

3) \textbf{ \acf{mFRR}} \\
Sometimes the FCR and aFRR are not enough to restore the imbalance between generation and consumption. Elia activates the mFRR manually and the requested energy volume is to be activated in 15 minutes. This reserve is the slowest and is used when the other reserves are not sufficient. This reserve can also be offered by the BSPs.

The order in which the reserves are activated is FCR, aFRR, and mFRR. The reserves are activated in this order because of the response time of the reserves. The FCR is the fastest reserve and can respond automatically in seconds. The aFRR is the second reserve and can respond in 7.5 minutes. The mFRR is the slowest reserve and can respond in 15 minutes. The reserves are activated in this order to ensure that the grid remains stable and that the frequency remains within the required operational limits.

Elia selects the bids based on the order of activation and then the price. The highest marginal price paid for upward or downward activation determines the imbalance price. This means that the last bid that is activated determines the imbalance price. The imbalance price calculation is shown in Table \ref{tab:imbalance_price}. Four possible scenarios can happen. The System Imbalance (SI) can be positive or negative and the imbalance of the balance responsible party can be positive or negative. These factors determine in which direction the payments are made. It is possible the BRP needs to pay Elia for the imbalance or that Elia needs to pay the BRP. A positive imbalance corresponds with a surplus of injections to the grid. On the other hand, a negative imbalance indicates a deficit in the injections or an excess of offtakes from the grid.

% list the scenarios
\begin{itemize}
    \item \textbf{Positive SI + Positive BRP Imbalance }\\
    This means that the BRP injects more energy into the grid than it takes out. The BRP has a positive imbalance. The System Imbalance is also positive which means that the grid has a surplus of injections. The BRP will need to pay Elia for the surplus injections. The price paid by the BRP is the Marginal price of downward activation (MDP) minus an extra parameter \(\alpha\).
    \item \textbf{Positive SI + Negative BRP Imbalance }\\
    The BRP takes more energy out of the grid than it injects. The BRP has a negative imbalance. The System Imbalance is positive which means that the grid has a surplus of injections. Elia will need to downward activate reserves to balance the grid. Elia needs to pay the BRP for the surplus of offtakes. The price paid by Elia is the Marginal price of downward activation (MIP) minus an extra parameter \(\alpha\).
    \item \textbf{Negative SI + Positive BRP Imbalance }\\
    The BRP injects more energy into the grid than it takes out. The BRP has a positive imbalance. The System Imbalance is negative which means that the grid has a deficit of injections. Elia will need to upward activate reserves to balance the grid. Elia needs to pay the BRP for the surplus of injections. The price paid by Elia is the Marginal price of upward activation (MIP) plus an extra parameter \(\alpha\).
    \item \textbf{Negative SI + Negative BRP Imbalance }\\
    The BRP takes more energy out of the grid than it injects. The BRP has a negative imbalance. The System Imbalance is negative which means that the grid has a deficit of injections. The BRP will need to pay Elia for the deficit of injections or surplus of offtakes. The price paid by the BRP is the Marginal price of upward activation (MIP) plus an extra parameter \(\alpha\).
\end{itemize}

\begin{table}[h]
    \centering
    \begin{tabular}{|c|c|c|}
        \hline
        & \multicolumn{2}{c|}{\textbf{System Imbalance}} \\
        \cline{2-3}
        \textbf{Imbalance of the balance responsible party} & \textbf{Positive} & \textbf{Negative or zero} \\
        \hline
        \textbf{Positive} & MDP - \(\alpha\) & MIP + \(\alpha\) \\
        \hline
        \textbf{Negative} & MDP - \(\alpha\) & MIP + \(\alpha\) \\
        \hline
    \end{tabular}
    \caption{Prices paid by the BRPs}
    \label{tab:imbalance_price}
\end{table}

The imbalance price calculation includes the following variables: \\
- MDP: Marginal price of downward activation \\
- MIP: Marginal price of upward activation \\
- \(\alpha\): Extra parameter dependent on System Imbalance \\
\\

TODO: Add more information about the imbalance price calculation, alpha?

Given the bids of the BSPs for a certain quarter or day and knowing System Imbalance, the imbalance price can be reconstructed using the calculation provided by Elia. During this thesis, the system imbalance is assumed to be almost the same as the Net Regulation Volume. This is a simplification but it is a good approximation. The goal of this thesis is to model the Net Regulation Volume which can then be used to reconstruct the imbalance price and to make decisions on when to buy or sell electricity.

\section{Generative modeling}
Forecasting the imbalance price is a difficult task. The price is influenced by many different factors like the weather, time of day, ... but also by the formulas used by the TSO to calculate the imbalance price. The formulas can change which results in a different imbalance price distribution. This makes it hard to train a model to forecast the imbalance price using historical data. Another method to forecast the imbalance price is to forecast the Net Regulation Volume (NRV) and then use the formulas provided by the TSO to calculate the imbalance price. This way, the model does not need to learn the imbalance price distribution but only the NRV distribution.

Another problem occurs when just forecasting the NRV. Forecasting a time series is a difficult task because of the uncertainty in the data and the many different factors that can influence the data. Simple forecasting of the NRV is often not accurate and defining a policy using this forecast will lead to wrong decisions. A better method would be to try to model the NRV and sample multiple full-day generations of the NRV. This can give a better understanding of the uncertainty of the NRV. Better decisions can then be made based on multiple generations of the NRV.

Generative modeling is a type of machine learning that is used to generate new data samples that look like the training data. The goal of generative modeling is to learn the true data distribution and use this distribution to generate new samples. Generative modeling is used in many different fields including image generation, text generation, audio generation etc.

In this thesis, generative modeling can be used to model the NRV of the Belgian electricity market using different conditional input features like the weather, the load forecast etc. The model can then be used to generate new full-day generations of the NRV that can be used to make better decisions on when to buy or sell electricity.

There exist many different types of generative models. Some of the most popular ones are:
\begin{itemize}
    \item Generative Adversarial Networks (GANs)
    \item Variational Autoencoders (VAEs)
    \item Normalizing Flows
    \item Diffusion models
\end{itemize}

\subsection{Quantile Regression}
Any feedforward neural network can also be used to output distributions for the target values. For example, if the distribution is assumed to be normal, the model can output the mean and the variance of the target value. This way, the model can output a distribution for the target value instead of a single forecast value. The outputted distribution allows for multiple samples to be drawn from the distribution. This can be used to generate multiple full-day generations of the NRV.

This method requires that the distributions of the target values be known in advance, or at least assumed. However, it is common for these distributions to be unknown. Fortunately, there is an alternative approach that can estimate the distribution of the target values without prior knowledge of the distribution. This technique is known as quantile regression.

Quantile regression is a method that uses feedforward neural networks to estimate multiple quantiles of the target values. A quantile is a statistical value of a random variable below which a certain proportion of observations fall. For example, the 25th quantile is the value below which 25\% of the observations fall. By estimating multiple quantiles using quantile regression, the distribution of the target values can be reconstructed. For each quarter of the day, the quantiles of the NRV are estimated by the model and used to reconstruct the distributions of the NRV. For each quarter of the day, a distribution can be reconstructed and samples can be drawn from this distribution. This way, multiple full-day generations of the NRV can be generated.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{images/quantile_regression/cdf_quantiles_example.png}
    \caption{Example of a cumulative distribution function and some quantiles. The quantiles are the values below which a certain proportion of observations fall.}
    \label{fig:quantile_example}
\end{figure}

The model outputs quantiles that can be used to reconstruct the cumulative distribution function of a target NRV value. This distribution can then be used to sample the NRV value for a quarter. An example of the output of a quantile regression model is shown in figure \ref{fig:quantile_regression_example}. The output values of the different quantiles are plotted and interpolated to get the cumulative distribution function. In this thesis, the quantiles used are 1\%, 5\%, 10\%, 15\%, 30\%, 40\%, 50\%, 60\%, 70\%, 85\%, 90\%, 95\%, and 99\%. These are chosen to get a good approximation of the cumulative distribution function. More quantiles at the tails of the distribution are used because the edges of the distribution are more important for the imbalance price calculation.
% TODO: edges important?

TODO: figure goes under 0, maybe use other values or other interpolation? + inverse the values to real values
\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{images/quantile_regression/reconstructed_cdf.png}
    \caption{Example of quantile regression output for one-quarter of the NRV, showing interpolated values for quantiles at 1\%, 5\%, 10\%, 15\%, 30\%, 40\%, 50\%, 60\%, 70\%, 85\%, 90\%, 95\%, and 99\%. These quantiles are used to reconstruct the cumulative distribution function.}
    \label{fig:quantile_regression_example}
\end{figure}

The NRV value for a quarter can be sampled from the reconstructed cumulative distribution function. A full-day prediction for the NRV exists of 96 values. This means 96 cumulative distributions need to be reconstructed and samples need to be drawn from each of the distributions.

The model needs to learn the quantiles of the NRV values. These, however, are not available in the training data. Only the historical NRV values are known. A special loss function is needed to train the model to output the quantiles of the NRV values. This loss function is called the pinball loss function. The loss function is defined as: \\
\begin{equation}
    L_\tau(y, \hat{y}) = \begin{cases}
        \tau(y - \hat{y}) & \text{if } y \geq \hat{y} \\
        (1 - \tau)(\hat{y} - y) & \text{if } y < \hat{y}
    \end{cases}
\end{equation}
\begin{align*}
    \textbf{Where:} \\
    \tau & = \text{Quantile of interest} \\
    y & = \text{Actual observed value of NRV} \\
    \hat{y} & = \text{Predicted quantile value of NRV} \\
\end{align*}

The loss function works by penalizing underestimation and overestimation of the quantile predictions differently. When a predicted quantile is lower than or equal to the actual value, the loss is calculated as the difference between the actual value and the predicted quantile value multiplied by the quantile of interest. This implies that the underestimations for high quantiles are penalized more heavily than for lower quantiles, as $\tau$ is larger for higher quantiles.

When the quantile value prediction is higher than the real NRV value, the loss is calculated as the difference between the predicted quantile value and the real NRV multiplied by $(1-\tau)$. This means that overestimations are penalized less for high quantiles of interest.

\begin{equation}
    L = \frac{1}{N} \sum_{i=1}^{N} \sum_{\tau \in T} L_\tau(y_i, \hat{y}_i)
\end{equation}

\begin{align*}
    \textbf{Where:} \\
    N & = \text{Number of samples} \\
    T & = \text{Quantiles of interest} \\
    y_i & = \text{Actual observed value of NRV for sample i} \\
    \hat{y}_i & = \text{Predicted quantile value of NRV for sample i} \\
\end{align*}

To calculate the pinball loss, the mean is taken over the quantiles of interest and the samples. This results in a scalar loss value that can be used for backpropagation. A lower pinball loss indicates a better modeling of the NRV distribution.

\subsection{Autoregressive vs Non-Autoregressive models}

Generative models can be broadly classified into two types: autoregressive and non-autoregressive models.

Autoregressive models generate samples sequentially, one step at a time. At each step, the model generates the next value based on the previously generated values. This sequential process ensures that the dependencies between values are naturally captured, but it also results in slower sample generation, as each value must be generated in order.

Non-autoregressive models, in contrast, generate the entire sample in a single step. Instead of generating values sequentially, these models produce all the values of the sample simultaneously, allowing for parallel generation. This significantly speeds up the sample generation process compared to autoregressive models. However, the complexity of non-autoregressive models is higher, making them more challenging to train. These models must accurately predict all values of the sample at once, which can be more difficult than predicting one value at a time.

Quantile regression can be applied to both types of models. For autoregressive models, the model outputs the quantiles for the next time step based on the given input features. From these quantiles, the cumulative distribution function (CDF) can be reconstructed and used to sample the NRV value. To obtain a full-day sample, the model needs to run sequentially for each quarter-hour, resulting in 96 iterations per day. Each sample for the next quarter depends on the sample of the previous quarter.

For non-autoregressive models, the model outputs the quantiles for all quarters of the day simultaneously based on the input features. The CDFs for each quarter are reconstructed, and samples are drawn from these distributions. Since the samples are generated in parallel, they are independent of each other. This independence can sometimes lead to unrealistic samples, as the sample for the next quarter does not depend on the sample of the previous quarter.

The input features for autoregressive and non-autoregressive models also differ. When using forecasted features, the autoregressive model utilizes forecasted values for the next quarter only, while the non-autoregressive model uses forecasted values for all quarters of the day. Although, in theory, the autoregressive model could use forecasted values for further future quarters, this complicates practical application. For instance, predicting the last quarter of a day would require forecasted values for the next day, which may not be available. Therefore, in this thesis, the autoregressive model is provided only with forecasted values for the next quarter to simplify the approach.

\subsection{Model Types}
\subsubsection{Linear Model}
A simple linear model can be used as a baseline to compare the more complex models. This model assumes a linear relation exists between the input features and the output. The relationship is modeled using the following formula:
\begin{equation}
    y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n
\end{equation}

\begin{align*}
    \textbf{Where:} \\
    y & = \text{Output value} \\
    \beta_0 & = \text{Intercept} \\
    \beta_1, ..., \beta_n & = \text{Coefficients} \\
    x_1, ..., x_n & = \text{Input features} \\
\end{align*}

This model needs to be adapted to be used for quantile regression. The model needs to output the quantiles for the target value. This can be done by training multiple linear models for each of the quantiles. The model can be trained using the pinball loss function. The number of parameters in this model is quite low which makes it easier and faster to train. The downside of this model is that it is very simple and might not be able to capture the complexity of the data. The number of parameters of this model is $\text{number of quantiles} \times (\text{number of input features} + 1)$.

\begin{equation}
    \hat{y}_\tau = \beta_{0, \tau} + \beta_{1, \tau} x_1 + \beta_{2, \tau} x_2 + ... + \beta_{n, \tau} x_n
\end{equation}

\begin{align*}
    \textbf{Where:} \\
    \tau & = \text{Quantile of interest} \\
    \hat{y}_\tau & = \text{Predicted quantile value for the target value} \\
    \beta_{0, \tau} & = \text{Intercept for the quantile of interest} \\
    \beta_{1, \tau}, ..., \beta_{n, \tau} & = \text{Coefficients for the quantile of interest} \\
    x_1, ..., x_n & = \text{Input features} \\
\end{align*}

\subsubsection{Non-Linear Model}
A more complex model can be used to model the NRV. A feedforward neural network with multiple hidden layers and activation functions can be used. This model can then capture the non-linear relationships between the input features and the output. This model has more parameters and is harder to train than the linear model. The non-linear model also has some hyperparameters that need to be chosen like the number of hidden layers, the number of neurons in each layer, the activation function etc. The model can be trained to output the quantiles for the NRV based on the input features. The same pinball loss function can be used to train the model.

\subsubsection{Recurrent Neural Network (RNN)}
Another more complex model that can be used is a Recurrent Neural Network (RNN). The RNN can be used to model the NRV data because of the sequential nature of the input features. The RNN keeps a hidden state that is updated at every time step using the new input data. The hidden state contains information about the previous time steps and can be used to make predictions for the next time step. These models are used in multiple fields like natural language processing, time series forecasting etc.

The RNN model can be used to model the NRV data. The input features are structured in a way that the model can learn the sequential patterns in the data. The model can be trained to output the quantiles for the NRV based on the input features using the pinball loss function.

Multiple types of RNN models exist. The two most common types of RNNs are the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). The GRU is a simpler version of the LSTM. The GRU has fewer parameters which results in faster training times. The GRU still can capture long-term dependencies in the data and can achieve similar performance to the LSTM. The GRU model has two gates, the reset gate and the update gate. The reset gate determines how much of the past information to forget, and the update gate determines how much of the new information to keep.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{images/quantile_regression/rnn/RNN_diagram.png}
    \caption{RNN model input and output visualization}
    \label{fig:rnn_model_visualization}
\end{figure}

The input features for the RNN model are carefully structured to capture the relevant information from the previous quarters and the forecasted values. Each input feature vector represents a quarter and consists of the following components:

\begin{itemize}
    \item The actual NRV value from the current quarter (T-1), which provides the model with the historical context of the NRV.
    \item The forecasted or real values for the next quarter (T), including load, PV, wind, and net position. If the next quarter is not the quarter to predict, the real values for that quarter are used. If the next quarter is the quarter to predict, the forecasted values are used.
    \item A quarter embedding vector representing the current quarter (T-1). The embedding vector gives the model information about the time of day, which can help it learn the daily patterns in the NRV data.
\end{itemize}

The input feature structure is designed to provide the model with a comprehensive view of the previous quarters and the forecasted values for the current quarter. By incorporating both historical and forecasted information sequentially, the model can learn to predict the NRV quantiles for the next quarter more accurately.

\subsection{Diffusion models}
\subsubsection{Overview}
Diffusion models are a type of probabilistic model designed to generate high-quality, diverse samples from complex data distributions. The way this type of model is trained is unique. The model is trained to reverse an iterative noise process that is applied to the data. This process is called the diffusion process. The model denoises the data in each iteration. During the training, the model learns to reverse the diffusion process. A training sample is transformed into a noise sample by applying the diffusion process. The model is then trained to recover the original sample from the noise sample. The model is trained to maximize the likelihood of the data given the noise. By doing this, the model learns to generate samples from the data distribution. Starting from the noise, the model can generate samples that look like the data. The model can also be conditioned on additional information to generate samples that follow other distributions.

\subsubsection{Applications}
Diffusion models gained popularity in the field of computer vision. They are used for inpainting, super-resolution, image generation, image editing etc. The paper introducing "Denoising Diffusion Probabilistic Models" (DDPM) \parencite{ho_denoising_2020} showed that diffusion models can achieve state-of-the-art results in image generation. This type of model was then applied to other fields like text generation, audio generation etc. The most popular application of diffusion models is still image generation. Many different models and products exist that make use of diffusion models to generate images. Some examples are DALL·E, Stable Diffusion, Midjourney, etc. These models can generate or edit images based on a given text description.

This method can also be applied to other fields like audio generation, text generation etc. In this thesis, diffusion models are explored to model time series data conditioned on additional information. A small example of the diffusion process is shown in Figure \ref{fig:diffusion_example}. An image of a cat is generated by starting from noise and iteratively denoising the image.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{images/diffusion/Generation-with-Diffusion-Models.png}
    \caption{Example of the diffusion process. The image of a cat is generated by starting from noise and iteratively denoising the image.}
    \label{fig:diffusion_example}
\end{figure}

\subsubsection{Generation process}
The generation process is quite different in comparison to other models. For example, GANs and VAE generate samples by sampling from a noise distribution and then transforming the noise into a sample that looks like the training data in one step using a generator network. Diffusion models generate samples by starting from a noise distribution and then applying a series of denoising steps to the noise. The diffusion process consists of 3 main components: the forward process, the reverse process and the sampling process.

\begin{itemize}
    \item \textbf{Forward process} \\
    This forward process is a Markov chain that starts from the data and applies a series of diffusion steps to the data. During this process, Gaussian noise is added to the data in each of the T time steps according to a variance schedule $\beta_1, ..., \beta_T$.

    $q(\mathbf{x}_{1:T}|\mathbf{x}_0) \coloneqq \prod_{t=1}^{T} q(\mathbf{x}_t|\mathbf{x}_{t-1}) \quad$ with $\quad q(\mathbf{x}_t|\mathbf{x}_{t-1}) \coloneqq \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t\mathbf{I})$


    This formula shows that the noisy data distribution after T diffusion steps is the product of the transition probabilities at each step t. The noise added in each time step is a Gaussian distribution with mean $\sqrt{1-\beta_t}\mathbf{x}_{t-1}$ and variance $\beta_t\mathbf{I}$. The variance schedule $\beta_1, ..., \beta_T$ is a hyperparameter that needs to be chosen or optimized during training.

    \item \textbf{Reverse process} \\
    The diffusion process must then be reversed. The model is trained to model the noise distribution given the data and timestep.

    $p_{\theta}(\mathbf{x}_{0:T}) \coloneqq p(\mathbf{x}_T) \prod_{t=1}^{T} p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t) \quad$ with $\quad p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t) \coloneqq \mathcal{N}(\mathbf{x}_{t-1}; \mu_{\theta}(\mathbf{x}_t, t), \Sigma_{\theta}(\mathbf{x}_t, t))$


    In the reverse process, each step aims to undo the diffusion by estimating what the previous, less noisy state might have been. This is done using a series of conditional Gaussian distributions $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)$. For each of these Gaussians, a neural network with parameters $\theta$ is used to estimate the mean $\mu_{\theta}(\mathbf{x}_t, t)$ and the covariance $\Sigma_{\theta}(\mathbf{x}_t, t)$ of the distribution. The joint distribution $p_{\theta}(\mathbf{x}_{0:T})$ is then the product the marginal distribution of the last timestep $p(\mathbf{x}_T)$ and the conditional distributions $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)$ for each timestep.

    \item \textbf{Training} \\
    TODO: explain better! \\
    The model training is done by optimizing the variational bound of the negative log-likelihood. This is also called the evidence lower bound (ELBO) in the context of generative models.
    \begin{align*}
        \log p(x) \geq & \mathbb{E}_q \left[ \log p_{\theta} (x_0 | x_1) | x_1 , x_0 \right] \\
        & - D_{KL} \left( q(x_T | x_0) || p(x_T) \right) \\
        & - \sum_{t=2}^{T} \mathbb{E}_q \left[ D_{KL} \left( q(x_{t-1} | x_t, x_0) || p_{\theta}(x_{t-1} | x_t) \right) | x_t, x_0 \right] \\
        = & L_0 - L_T - \sum_{t=2}^{T} L_{t-1}
    \end{align*}
    The formula shows that maximizing the likelihood can be done by minimizing the KL divergence between the noise distribution and the data distribution for each timestep. After a lot of math, it can be proven that this can be simplified further to minimize the mean squared error between the predicted noise by the model and the actual noise added in each timestep.

    \item \textbf{Conditioning} \\
    The model can be conditioned on additional information. This can be used to guide the generation process. In the context of image generation, this can be used to generate images of a certain class or with certain attributes. This requires some changes in the model architecture and training process. A simple way to condition the model is to add additional information to the input of the model. This can be done by concatenating the additional information to the input of the model. The model can then learn to generate samples that follow the distribution of the data conditioned on the additional information.
\end{itemize}

The diffusion process can be seen in Figure \ref{fig:diffusion_process}. The model is trained to reverse this process. Starting from the noise, the model learns to generate samples that look like the data.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\textwidth]{images/diffusion/diffusion_graphical_model.png}
    \caption[Diffusion process]{Diffusion process \parencite{ho2020denoising}.}
    \label{fig:diffusion_process}
\end{figure}

\subsection{Evaluation}
To evaluate the performance of the quantile regression models, multiple metrics can be used. The pinball loss itself can be used to compare models on the test set. Other metrics that can be used are the mean absolute error (MAE) and the mean squared error (MSE). This can be done by generating multiple full-day NRV samples for each day of the test set and calculating the error metrics for each of the samples. The mean can then be taken over the different samples to get a single value for the error metrics.

MAE does not consider the direction of the error. It is the average of the absolute differences between the predicted and actual values. The formula in our case with full-day NRV samples is:
\begin{equation}
    MAE = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{96} \sum_{j=1}^{96} |y_{ij} - \hat{y}_{ij}|
\end{equation}

\begin{align*}
    \textbf{Where:} \\
    N & = \text{Number of samples} \\
    y_{ij} & = \text{Actual observed value of NRV for sample i and quarter j} \\
    \hat{y}_{ij} & = \text{Sampled value of NRV for sample i and quarter j} \\
\end{align*}

MSE is more sensitive to outliers than MAE because it squares the error between the predicted and actual values. The formula in our case with full-day NRV samples is:
\begin{equation}
    MSE = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{96} \sum_{j=1}^{96} (y_{ij} - \hat{y}_{ij})^2
\end{equation}

The MAE and MSE metrics do not compare the distribution of the NRV to the real NRV value but only take into account the sampled values. Evaluating the outputted distribution for the NRV must be done differently. The Continuous Ranked Probability Score (CRPS) can be used to evaluate the distribution to the real NRV value. The CRPS metric is used to evaluate the accuracy of the predicted cumulative distribution function. The CRPS can be seen as a generalization of the MAE for probabilistic forecasts. The formula for the CRPS is:

\begin{equation}
    CRPS(F, x) = \int_{-\infty}^{\infty} (F(y) - \mathbbm{1}(y - x))^2 \, dy
\end{equation}

\begin{align*}
    \textbf{Where:} \\
    F & = \text{Predicted cumulative distribution function} \\
    x & = \text{Real NRV value} \\
    \mathbbm{1}(x) & = \text{Heavyside function} = \begin{cases}
        1 & \text{if } x \geq 0 \\
        0 & \text{if } x < 0
    \end{cases} \\
\end{align*}

The mean CRPS can be calculated over the different days to get a single value. The lower this value, the better the NRV is modeled. The CRPS metric can be visualized as shown in figure \ref{fig:crps_visualization}. The CRPS is the area between the predicted cumulative distribution function and the Heavyside function. The lower the area between the curves, the better the NRV is modeled.

TODO: improve visualisation? -> echte NRV + y as cummulative prob
\begin{figure}[H]
    \centering
    \includegraphics[width=0.8\textwidth]{images/quantile_regression/crps_visualization.png}
    \caption{Visualization of the CRPS metric}
    \label{fig:crps_visualization}
\end{figure}