DL Trading


awesome-deep-trading Awesome List of code, papers, and resources for AI/deep learning/machine learning/neural networks applied to algorithmic trading.
Deep Learning and Machine Learning for Stock Predictions
PyTrader – Drag & Drop MT4 & MT5 Python API Connector for Metatrader
Keras: Deep Learning library for Theano and TensorFlow

Краткое руководство по Matplotlib
NumPy Краткое руководство
NumPy reference_manual.html


Pandas User Guide
Pandas TA – A Technical Analysis Library in Python 3
Pandas Getting started
Введение в pandas: анализ данных на Python
Python для финансов, часть I: Yahoo и Google Finance API, pandas и matplotlib
Python для финансов, часть 2: Введение в количественные торговые стратегии
Python для финансов, часть 3: Торговая стратегия скользящей средней

Gluon Time Series (GluonTS) is the Gluon toolkit for probabilistic time series modeling, focusing on deep learning-based models.
Writing forecasting models in GluonTS with PyTorch
Interpretability in Safety-Critical Financial Trading Systems


How to Predict a Time Series Part 1
Forecasting Time Series with Multiple Seasonalities using TBATS in Python 2019
BATS and TBATS time series forecasting Package provides BATS and TBATS time series forecasting methods described in: De Livera, A.M., Hyndman, R.J., & Snyder, R. D. (2011), Forecasting time series with complex seasonal patterns using exponential smoothing, Journal of the American Statistical Association, 106(496), 1513-1527.
BATS and TBATS time series forecasting
Python’s Best Automated Time Series Models 2020
Автоматизированные модели временных рядов в Python (AtsPy)
Установить pip install atspy
Автоматизированные модели
ARIMA – Автоматическое моделирование ARIMA
Prophet – Моделирование множественной сезонности с линейным или нелинейным ростом
HWAAS – Экспоненциальное сглаживание с аддитивным трендом и аддитивной сезонностью
HWAMS – Экспоненциальное сглаживание с аддитивным трендом и мультипликативной сезонностью
NBEATS – Анализ расширения нейронной базы (теперь установлен на 20 эпох)
Gluonts – Модель на основе RNN (теперь исправлено на 20 эпох)
TATS – Сезонный и трендовый без Box Cox
TBAT – Тренд и Box Cox
TBATS1 – Трендовый, Сезонный (один) и Box Cox
TBATP1 – TBATS1, но сезонный вывод жестко закодирован периодичностью
TBATS2 – TBATS1 с двумя сезонными периодами
Почему AtsPy?
Visual Parameter Tuning with Facebook Prophet and Python

ARIMA Model – Complete Guide to Time Series Forecasting in Python
How to Create an ARIMA Model for Time Series Forecasting in Python

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible tools to build, tune and validate time series models for multiple learning problems, including: Forecasting, Time series classification, Time series regression. For deep learning, see our companion package: sktime-dl.

Comparison of ARIMA, ETS, NNAR, TBATS and hybrid models to forecast the second wave of COVID-19 hospitalizations in Italy

https://arxiv.org/pdf/2105.06643.pdf 14 may 2021 Monash Time Series Forecasting Archive
Building RNN, LSTM, and GRU for time series using PyTorch 2021

Tutorial on Univariate Single-Step Style LSTM in Time Series Forecasting

LSTM is also superior in short term data until 94% compared with ARIMA model that only has 56% ??????
Timeseries Forecast Comparison– RNN vs Neural vs ARIMA vs Tbats awanish kumar awanish kumar Aug 26, 2019
1. Точность, достигаемая существующими моделями, такими как ARIMA, Neural и Tbats, намного лучше, чем прогноз RNN LSTM.?????
2. Немного сложно реализовать RNN LSTM по сравнению с другими традиционными моделями.
3. Компромисс отрицательный. Это означает, что количество усилий, затраченных на внедрение RNN, не равно проценту получаемой выгоды.
4. Ошибочно предположение, что Deep Learning – RNN LSTM даст лучший результат во всех проблемах или бизнес-требованиях.
awanish kumar Aug 26, 2019
1. Accuracy achieved by existing models like ARIMA, Neural and Tbats is much better than RNN LSTM forecast.
2. It is bit tricky to implement RNN LSTM compare to other traditional models.
3. The trade off negative. It means the amount of effort given to implement RNN is not equal to percentage of benefit we are getting.
4. It is wrong assumption that Deep Learning — RNN LSTM will give better result in all problems or business requirement.

LSTM is also superior in short term data until 94% compared with ARIMA model that only has 56% ??????

Adler Haymans Manurung1 , Widodo Budiharto2 and Harjanto Prabowo3 1Management Department, BINUS Business School – Doctor of Research in Management 2Computer Science Department School of Computer Science 3Computer Science Department, BINUS Graduate Program – Doctor of Computer Science Bina Nusantara University Jl. K. H. Syahdan No. 9, DKI Jakarta 11480, Indonesia Adler.manurung@binus.ac.id; { wbudiharto; harprabowo }@binus.edu Received May 2018; accepted August 2018
Abstract. Time series analysis has significance in financial analytics and market forecasting and it can be utilized in any field. For stockbrokers, understanding trends and forecasting supported by software are very important to decision making and reacting to changes in behavioral patterns. This paper proposes an algorithm and model for market forecasting in Indonesian exchange based on the Long Short-Term Memory (LSTM) and compared with ARIMA model. We use data from Bank Central Asia (BCA) from 2013-2018 obtained from Yahoo finance. In our experiments, we predict and simulate the important prices called Open, High, Low and Closing (OHLC) with various parameters. Based on the experiment, the best accurate prediction in LSTM comes from the short term (1 year) with high epoch in training phase rather than using 3 years or 5 years of training data, and our model has better result compared with popular model such as ARIMA. These results should be very useful to be used in stock exchange office. Keywords: LSTM, Forecasting, Stock market, Finance, Deep learning, ARIMA

https://www.robots.ox.ac.uk/~parg/pubs/theses/bernardo_orozco.pdf Recurrent Neural Networks for
Time Series Prediction

https://github.com/bperezorozco/ordinal_tsf MOrdReD: Ordinal autoregression with recurrent neural networks Introduction This Python library accompanies our work. MOrdReD enables time series forecasting in an autoregressive and ordinal fashion. This simply means that each new sample is forecasted by looking at the last previous observations for some lookback T. Our framework provides an implementation of our ordinal autoregression framework (via Keras) described in the paper above; however, it also provides a flexible and amicable interface to set up time series forecasting tasks (parameter optimisation, model selection, model evaluation, long-term prediction, plotting) with either our prediction framework, or other well-established techniques, such as Gaussian Processes (via GPy) or Dynamic AR models (via statsmodels).

стационарность временных рядов и дробное дифференцирование
ARIMA Model Python Example — Time Series Forecasting 2019 reg need

BATS and TBATS for time series forecasting


14 may 2021
4. Baseline Forecasting Models
In this section, we evaluate the performance of different baseline forecasting models
including 6 traditional univariate forecasting models and a global forecasting model over the
datasets in our repository using a fixed origin evaluation scheme, so that researchers that use
the data in our repository can directly benchmark their forecasting algorithms against these
baselines. The following 7 baseline forecasting methods are considered for the experiments:
• Exponential Smoothing (ETS, Hyndman, 2008)
• Auto-Regressive Integrated Moving Average (ARIMA, Box and Jenkins, 1990)
• Simple Exponential Smoothing (SES)
• Theta (Assimakopoulos and Nikolopoulos, 2000)
• Trigonometric Box-Cox ARMA Trend Seasonal (TBATS, Livera et al., 2011
• Dynamic Harmonic Regression ARIMA (DHR-ARIMA, Hyndman, 2018)
• A globally trained Pooled Regression model (PR, Trapero et al., 2015)

Finally, we have evaluated the performance of seven baseline
forecasting models including six traditional univariate forecasting models: SES, Theta, ETS,
ARIMA, TBATS, DHR-ARIMA, and a global forecasting model, PR, over all datasets across
eight error metrics to enable other researchers to benchmark their own forecasting algorithms

В чем разница между методами прогнозирования?
Методы прогнозирования, применяемые к одному и тому же набору данных и прогнозируемые для одного и того же горизонта, дают разные результаты. В чем разница между Winter-Holt, ARIMA, TBATS (функция R), BATS (функция R) и ETS (функция R)? Я использовал эти методы для своих данных и пытаюсь выяснить причины противоречивых результатов. Я хочу знать, используются ли эти методы для конкретных целей, т.е.
A) когда данные ежедневные, еженедельные, ежемесячные или ежегодные
Б) меняющийся тренд
В) разная сезонность
Г) Сложный тренд или сезонность.
21 мая 2015 Ариэль Линден
Linden Consulting Group, ООО
Я прилагаю ссылку на статью, которую я написал, в которой представлены все эти методы анализа временных рядов. Я думаю, вы обнаружите, что он отвечает на ваши вопросы. Конечно, при необходимости ссылки дадут больше указаний.
надеюсь, это поможет
Статья Оценка эффективности программы управления заболеваниями: вводный …

https://github.com/topics/stock-trading stock-trading Here are 221 public repositories matching this topic.
Robin-Stocks API Library This library provides a pure python interface to interact with the Robinhood API, Gemini API, and TD Ameritrade API. The code is simple to use, easy to understand, and easy to modify

Medium Sofien Kaabar
Combining the SuperTrend Indicator With Moving Averages in a Trading Strategy
GitHub Trend-Following-Strategies-in-Python
Pandas TA – A Technical Analysis Library in Python 3

Python Tutorial
Intermediate Python
Лекции на Мат-Мехе СПбГУ
Лекции по математике Мех-Мат МГУ
SciPy is open-source software for mathematics, science, and engineering.
SciPy Tutorial
SciPy Cookbook
Numpy and Scipy Documentation
SciPy Python-based ecosystem of open-source software for mathematics, science, and engineering
Python Numpy Tutorial
Mayavi Python scripting for 3D plotting
Mayabi: mlab

How to Save a NumPy Array to File for Machine Learning
NumPy Array Shape
NumPy array object

Markowitzify will implement a variety of portfolio and stock/cryptocurrency analysis methods to optimize portfolios or trading strategies. The two primary classes are portfolio and stonks.

Учебник Форекс
github.com/twopirllc/pandas-ta Pandas TA – A Technical Analysis Library in Python 3
github.com/AtomMe/Quant.Strategy Quant.Strategy I will upload and update my quant strategies which include CTA strategy, stock strategy etc. Not only Python is the Programming language, but also MATLAB and VBA.

Welcome to python-binance v1.0.15

MyTT Technical Indicators implemented in Python only using Numpy-Pandas as Magic – Very Very Fast! to Stock Market Financial Technical Analysis Python library MyTT.py
github.com/AdamTibi/LSTM-FX Practical LSTM Time Series Prediction for Forex with TensorFlow and Algorithmic Bot
Here are 140 public repositories matching this topic.

Exchange-core is an open source market exchange core based on LMAX Disruptor, Eclipse Collections (ex. Goldman Sachs GS Collections), Real Logic Agrona, OpenHFT Chronicle-Wire, LZ4 Java, and Adaptive Radix Trees.
Start Building Your Trading Strategies in 5 Minutes With Python and MetaTrader The fastest way to integrate collecting and analyzing financial market data for your analysis.
github.com/arseniyturin/Capstone-Project Machine Learning for Day Trading
FinRL: Deep Reinforcement Learning for Quantitative Finance
FinRL for Quantitative Finance: Tutorial for Portfolio Allocation
Machine Learning for Day Trading


q-learning-trader / tensorflow-rl
q-learning-trader / q-trading-pytorch
AI4Finance-Foundation / FinRL

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.


TF Keras
9 вещей, которые вы должны знать о TensorFlow
Hello, TensorFlow. Библиотека машинного обучения от Google 2016

Переход с TensorFlow 1.x на TensorFlow 2

Tensorflow (2) Сохранить модель и восстановить

Шесть фреймворков бэктестинга для Python

Стандартная открытые платформы бэктестинга для Python обычно обладают рядом общих характеристик:

ориентированность на события;
гибкое лицензирование без особых ограничений;
обширный набор встроенных технических индикаторов;
стандартная функциональность для подсчета метрик производительности, визуализации и генерации отчетов.


PyAlgoTrade это уже устоявшийся фреймворк, включающий возможность как тестирования на исторических данных так и проведения симуляций в real-time. Поддерживает данные из Yahoo! Finance, Google Finance, NinjaTrade и любых источников, предоставляющих информацию в CSV (например, Quandl). Поддерживает приказы типов маркет, лимит, стоп и стоп-лимит.

PyAlgoTrade поддерживает торговлю биткоинами через Bitstamp, а также обработку информации из Twitter в режиме реального времени.
PyAlgoTrade GitHub
PyAlgoTrade 0.20 documentation

bt — Backtesting for Python

Создатели фреймворк bt стремятся облегчить разработку легко тестируемых, гибких и подходящих для повторного использования логических блоков торговых стратегий, что должно открывать возможность к созданию сложных автоматизированных финансовых приложений.

Фреймворк подходит для тестирования так называемых portfolio-based стратегий, включающих алгоритмы для взвешивания и ребалансировки портфолио. Модификация стратегий для запуска на различных временных интервалах и c использованием раличных весов инструментов в портфолио требует минимальных усилий по изменению кода. Кроме того, bt встроен в ffn — это популярная финансовая библиотека Python.
bt – Flexible Backtesting for Python
Страница проекта
Лицензия: MIT


Эта платформа превосходно документирована, разработчики ведут блог и развивают активное онлайн-коммьюнити, члены которого рады помочь найти ответ на интересующий вопрос. Backtrader поддерживает различные форматы данных, включая CSV, Pandas DataFrames, реалтайм фиды данных от нескольких зарубежных брокеров и различных итераторов. Обработка данных из разных источников может осуществляться одновременно и даже на разных временных интервалах.

Лицензия: GPL v3.0


Разработчик pysystemtrade Роб Карвер (Rob Carver) публиковал отличную статью о том, почему решил создать еще один фреймворк для бэктестинга на Python, в которой перечислил плюсы и минусы разработки нового фреймворка. pysystemtrade включает ряд важных функций, вроде модулей оптимизации и калибровки, а также позволяет реализовывать полностью автоматизированную торговлю фьючерсами.

Лицензия: GPL v3.0


Zipline — это симулятор алгоритмического трейдинга. Работать с ним можно через браузерный интерфейс IPython Notebook. Система представляет собой альтернативу инструментам на основе интерфейса командной строки. Сервис развивается и поддерживается командой проекта Quantopian, и его можно использовать как в качестве отдельного средства разработки бэктестеров, так и в связке со средой разработки и тестирования Quantopian. Платформа Zipline предоставляет доступ к десяти годам исторических данных по американским акциям в 1-минутном разрешении, также доступны несколько вариантов импорта информации.

Страница проекта
Гитхаб zipline GitHub
Лицензия: Apache 2.0


Еще один фреймворк с функциональностью реальной торговли, запущенный основателем ресурса для экспертов в сфере финансов QuantStart Майклом Халлс-Муром (Michael Halls-Moore). Он хотел создать инструмент, который бы подходил одновременно для использования крупными хедж-фондами и частными инвесторами. В настоящий момент QSTrader поддерживает «баровое» разрешение данных (OHLCV) на различных временных интервалов, однако использование тиковых данных пока недоступно.

Оба режима работы (бэктестинг и реальная торговля) полностью основаны на событиях (event-driven), что позволяет быстрее переходить от разработки стратегий к их тестированию и, затем, запуску в «боевом» режиме. Один из главных плюсов системы заключается в ее модульности, которая оставляет широкие возможности для кастомизации кода.

Страница проекта: QSTrader
Гитхаб qstrader GitHub
Лицензия: MIT

6 открытых фреймворков для создания бэктестеров торговых стратегий на Python

Событийно-ориентированный бэктестинг на Python шаг за шагом. Часть 1

Прогнозирование котировок фьючерсов на индекс РТС на основе машинного обучения
Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
Multi-Step LSTM Time Series Forecasting Models for Power Usage
Applied Machine Learning
Прогнозирование временных рядов с помощью рекуррентных нейронных сетей
Интервальное прогнозирование временных рядов с помощью рекуррентных нейронных сетей с долгой краткосрочной памятью…2020
Платформа машинного обучения визуализирует активные нейроны в режиме реального времени

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network

Анализ временных рядов LSTM на основе Keras на примере прогноза цен на акции Apple

Прогнозирование временных рядов с LSTM в Python
Анализ временных рядов с помощью LSTM с использованием библиотеки Keras Python

Временные ряды. Простые решения
Исследование глубоких нейронных сетей с LSTM архитектурой для прогнозирования финансовых временных рядов pdf
Прогнозирование временных рядов: прогнозирование цен на акции с использованием модели LSTM
Интервальное прогнозирование временных рядов с помощью рекуррентных нейронных сетей с долгой краткосрочной памятью…

Keras LSTM: многоступенчатое многофакторное прогнозирование временных рядов-плохие результаты

Прогнозирование временных рядов pdf

Адаптивные методы прогнозирования временных рядов
Адаптивные методы прогнозирования временных рядов

Jupyter Widgets

A Technical Guide on RNN/LSTM/GRU for Stock Price Prediction

Interpretability in Safety-Critical Financial
Trading Systems
Gabriel Deza Adelin Travers Colin Rowat and Nicolas Papernot

1 University of Toronto and Vector Institute, Toronto, Canada 2 University of Birmingham, Birmingham, England
Abstract. Sophisticated machine learning (ML) models to inform trading in the financial sector create problems of interpretability and risk
management. Seemingly robust forecasting models may behave erroneously in out of distribution settings. In 2020, some of the world’s most
sophisticated quant hedge funds suffered losses as their ML models were
first underhedged, and then overcompensated.
We implement a gradient-based approach for precisely stress-testing how
a trading model’s forecasts can be manipulated, and their effects on
downstream tasks at the trading execution level. We construct inputs
– whether in changes to sentiment or market variables – that efficiently
affect changes in the return distribution. In an industry-standard trading
pipeline, we perturb model inputs for eight S&P 500 stocks. We find our
approach discovers seemingly in-sample input settings that result in large
negative shifts in return distributions.
We provide the financial community with mechanisms to interpret ML
forecasts in trading systems. For the security community, we provide a
compelling application where studying ML robustness necessitates that
one capture an end-to-end system’s performance rather than study a
ML model in isolation. Indeed, we show in our evaluation that errors in
the forecasting model’s predictions alone are not sufficient for trading
decisions made based on these forecasts to yield a negative return.
Keywords: ML Interpretability
· Financial Trading
· Risk Management
1 Introduction
Deep Neural Networks (DNNs) have been proposed in finance for over three
decades [3, 35] due their ability to learn more complex and non-linear mappings
compared to classical time series models. However, the additional complexity
of DNNs makes their interpretability of particular importance when it comes to
widely adopting them in production settings. We study financial trading systems
built on such DNNs in order to develop a method for better interpretability.
Modern financial regulations (e.g. Basel III [1]) require stress testing models,
including on past catastrophic events. For instance, regulators may be interested
in learning the worst-case outcomes of a model given a scenario of adverse market
All correspondance to gabe.deza@mail.utoronto.ca
arXiv:2109.15112v1 [cs.LG] 24 Sep 2021
2 F. Author et al.
conditions [27]. Existing approaches for interpretability of DNNs are not directly
applicable to this setting for two reasons. First, time series forecasting models
used in finance output distributions rather than single-point predictions. Second,
these distributions predicted by the model are not the outcome of the pipeline;
instead, they are used as inputs to inform trading decisions.
To address the specificities of safety-critical financial trading systems, we
propose an interpretability and risk estimation method to synthesize adverse
market conditions. These synthetic market conditions reveal to a financial institution the factors which influence the trading decisions made by its systems. In
particular, our method shows how seemingly regular market conditions can be
manipulated to introduce market adversity undermining a DNN’s forecast. In an
application to an industry-standard trading pipeline, we for instance show how
our method provides insights on feature importance, choice of trading signals
and the pipeline’s robustness. To summarize, our contributions are as follows:
– We instantiate a gradient-based optimization method that informs model
owners on the sensitivity of their forecasting model to its input features.
– We develop an industry-standard daily stock trading pipeline. Our pipeline
integrates Twitter sentiment analysis together with several trading strategies
and achieves non-trivial performance and returns across 8 S&P 500 tickers
under realistic conditions. We open-source our pipeline (see Section 2.1).
– We demonstrate on our trading pipeline that our method synthesizes adverse
market conditions that illustrate the influence of different model inputs on
the trading pipeline’s returns. For instance, increasing adversity of the market conditions results in decreasing returns by almost 1/3 (21.3% to 7.7%)
whereas decreasing adversity yields a 5? larger return (2.2% to 10.3%).
– Our method provides controllability on the adversity (or lack of adversity) of
the market conditions synthesized via the perturbation amount  introduced
to each feature used to model market conditions. We show that varying  can
shift and change the distribution of returns, providing an intuitive means to
understand how each of these features influence trading decisions.
– Our method exposes how the mean of the output distribution from our forecasting model is easily manipulated (requiring a smaller total perturbation
) compared to other parameters of the distribution (such as the confidence).
Iterating on trading strategies that rely less on such non-robust parameters
directly can help make pipelines more robust to such adversity.
2 Background
2.1 Trading Pipelines for Stock Forecasting
A company may issue stock also called shares, units of ownership of the company,
which are traded (bought or sold) on exchanges. A ticker is used to designate
the stock on the exchange and look up price change information. We work with
tickers from the S&P 500, the largest 500 companies traded in the U.S. In first
approximation, a stock may only be traded during an exchange’s market hours.
(For example, the New York Stock Exchange’s opens to trading on weekdays at
Interpretability in Safety-Critical Financial Trading Systems 3
9:30 AM and closes at 4:00 PM.) For each trading day, a ticker’s open and
close price is the price quoted at market open and market close, respectively.
We consider a financial pipeline that tackles the problem of forecasting, i.e.,
tries to predict the price changes ahead of time. Formally, after collecting daily
open and close prices and potential price-change factors, e.g., sentiment data,
over kpast days for a given ticker, a forecast generates a prediction of the daily
open and close prices for the next kfuture days of trading. This forecast is then
used to make trading decisions.
General Trading Pipeline Formulation. The entire trading process, and resulting
algorithmic systems, can be abstracted into 3 major steps described below.
– Data Collection: Common data includes historical and/or real time financial data provided by exchanges, sentiment data directly collected from news
sources or provided by data vendors such as Bloomberg or Reuters [8, 30].
– Forecasting: Collected data is then analyzed for useful trading signals. This
analysis is performed using a model M, whose complexity can range from a
linear model or basic pattern analysis to Deep Neural Networks (DNNs).
– Trading Decisions: Using the forecasts from the previous step, a trading
decision is made to buy or sell V shares at price P.
Obtaining a realistic test bench encompassing all of these steps is complex
and time consuming. To aid future work in this setting, we include the code to
reproduce our pipeline and the results within our work.4
2.2 Deep Neural Networks for Probabilistic Forecasting
Finance is a data-rich domain [22] making methods that benefit from this information, such as DNNs, attractive. This potential has brought significant interest
from the financial and ML academic communities for finance and trading tasks
like forecasting or portfolio management[18, 3, 20, 13]. In addition, large bank
and investment firms have recently invested heavily in such research [17, 22].
When choosing a forecast model, it is possible to output either a single data
value (point forecast) [23] or a probability distribution [14]. Probabilistic forecasting is preferred over point forecasting as confidence scores or risk estimates
can be derived from the forecasted distribution [16]. While the forecast model
specifically outputs the parameters generating the distribution rather than the
distribution itself, for simplicity we henceforth conflate both. Architectures for
time series forecasting include recurrent neural networks (RNNs) [31, 33], convolutional neural networks [29, 11, 6] with 1-dimensional convolutions over time,
as well as transformers [36, 26, 25] using attention-based mechanisms. In our
work, we leverage RNNs for probabilistic forecasting, described in Section 4.1,
because of their superior performance in temporal settings like forecasting.
Nonetheless, these proof-of-concept research advances may be difficult to
readily deploy in production settings as current DNNs lack proper input-output
relationship transparency. We address this limitation with our work.
4 Our code: https://anonymous.4open.science/r/FinancialML Interpretability-68F4
4 F. Author et al.
2.3 Interpretability in ML
In select domains like computer vision and natural language processing, interpretability of ML models is well studied [37]. In vision, techniques visualize
how a prediction stems from different input pixels [32], the learned weights of a
model [28], or its receptive field [38]. To the best of our knowledge, leveraging
gradients for interpretability in domains like finance remains unexplored. Financial systems raise several challenges. First and foremost, DNNs only inform
trading strategies; this calls for an end-to-end outlook over the entire pipeline.
In addition, financial data is inherently more difficult for humans to interpret
when compared to computer vision and NLP.
3 Gradient-based Interpretability for Forecasting Models
In this section, we introduce the proposed interpretability method with a formal
end-to-end pipeline for financial trading. We leverage gradient information of
model M and characteristics of the entire trading pipeline to understand edge
case model behaviour. While introduced in the financial trading domain, our
method is not limited to it; it is applicable broadly to forecasting problems.
3.1 Preliminary
Formally, let z1:T be a single target time series of length T and X a set of
associated covariates series X1:N,1:T , {xi,1:T }
i=1. A (DNN-based) forecast is a
generated probability distribution over future (unobserved) values zT +1:T +? of
length ? conditioned on past time series values z1:T and covariates X1:N,T:T +?
using a neural network M parameterized by ?, the model’s weights and biases.
Pr(zT +1:T +? | z1:T , X1:N,1:T ) = M?(z1:T , X1:N,1:T ) (1)
It is common to follow the Markov assumption in forecasting where the next ?
days is likely not a function of the entire history length available T but rather a
smaller portion that represents the recent past. We only use the past k temporal
observations of z and X for the next ? days forecast. The estimated probability
distribution is defined by its n distribution parameters. Since the output of the
model is ? distributions, each defined by n parameters, we denote the entire
output as ?1:n,1:? , {?i,1:? }
i=1; this can be conceptualized as a n ? ? matrix.
Hence, Equation 1 can be rewritten as Equation 2.
?1:n,T +1:T +? = M?(zT ?k:T , X1:N,T ?k:T ) (2)
Definitions of interpretability vary with the goals of model owners and the
settings models are deployed in. In financial trading pipelines robustness is primordial, and thus we set out the following requirements for our interpretability
G1: Understanding of how a pipeline’s outputs are going to behave in the face of
deviations from their expected inputs. For instance, given an input x based
on average historical data and a model prediction M(x) = y, we want to
Interpretability in Safety-Critical Financial Trading Systems 5
understand M(x + ) for  capturing deviations from this historical data.
Note that we need to take care to upper bound  to ensure that we consider
edge cases that remain realistic (i.e., that could be realized).
G2: Developing intuition on how the pipeline will react to an unknown input
(i.e., For some regular input x develop intuition for M(x)).
G3: Developing intuition on what the pipeline input must have been to achieve a
specific output (i.e., For some output y develop intuition on {x|M(x) = y}).
3.2 Our algorithm for model interpretability
At a high level, our method uses model gradients to synthesize adverse settings by
manipulating the covariate features X resulting in forged features X? such that
the original output parameters ? change in a specific direction and amount.
The resulting changed output parameters are denoted ?? . These forged features
X? and the resulting outputs ?? both provide interpretability of the model by
quantifying its sensitivity to changes to the inputs of the forecasting problem.
Specifically, given a direction (up (^) or down (v)) and a distribution parameter index p is selected, the model inputs are perturbed from X to X? such
that the model outputs ?p,T +1:T +? are manipulated in that given direction,
resulting into ?? p,T +1:T +? . We accomplish this by leveraging partial derivative
computation through the model M. For instance, the partial derivative of the
output distribution parameter ?i,t with respect to the input Xj,l captures the
sensitivity of the i
th distribution parameter at time t to the j
th input feature at
time l. Our work is based on two (independently, well known) facts: (i) partial
derivatives provide a wealth of information on a model’s behaviour and (ii) a
model owner has full access to model parameters and can thus explicitly compute
any partial derivative with regard to any input they wish. Together this means
that a model owner can explicitly explore the entire range of model behaviours
for risk analysis by leveraging the information contained in the gradients.
In Algorithm 12, the features are perturbed as long as they stay within their
historical range (recall our requirement G1). This is checked by a function which
we refer to as Checkbounds; it serves as a simple non-ML heuristic for input
anomaly detection, likely on top of the already implemented security checks inherent to any realistic pipeline. Such anomalies are avoided as our interpretability
algorithm focuses on understanding how models behave in their most common
and close to valid setting (G1, G2, G3) and not in settings that are obviously
out of distribution. Otherwise, the feature with the second largest gradient is
selected until a feature that satisfies this criterion is chosen.5
Next, we outline two key design choices made for Algorithm 12. Together,
they address the lack of interpretability of ML models according to G1, G2, G3.
Gradients. Gradients offer rich information about the direction in which features
need to be perturbed to achieve a desired change in outputs (G1). Specifically,
we use the sign and magnitude of the partial derivative ?y
?x to decide on the
5 Albeit this case did not occur in our experiments, if all features cannot be perturbed
6 F. Author et al.
Algorithm 1: Interpretability of ML Forecasting Models Algorithm
input : Index of Distribution parameter p ? [1, n], direction to perturb distribution
parameter d ? {^, v}, Forecasting length ?, Historical length k, historical
covariate features X1:N,T?k:T , historical target values zT?k:T , perturbation
amount  and number of iterations R
output: X? 1:N,T?k:T
/* Initially start with benign features */
1 X? 1:N,T?k:T = X1:N,T?k:T
/* Iterate over number of perturbations */
2 for j < 1to R by 1 do /* Iterate over the prediction length */ 3 for t < T + 1 to T + ? do /* Iterate over the historical length */ 4 for s < T ? k to T do /* Find input with largest gradient magnitude */ 5 i ? = arg max i | ??? p,t ?X? i,s | /* Check bound characteristic of feature i ? */ 6 if Checkbounds(i ? ) then 7 X?i?,s = X?i?,s +  d ? sgn( ??? p,t ?X? i?,s )  ?  8 else /* Jump to line 8 but do not consider the argmax over i = i ? */ 9 end 10 end 11 end 12 end direction and magnitude of the perturbation. As Algorithm 12 only perturbs the features with the largest gradients, the resulting perturbed features characterize which features best explain a model’s reaction to unknown inputs (G2 and G3). Perturbing benign features. Algorithm 12 returns the original features perturbed minimally to achieve the intended change in model output. By perturbing from historical market conditions, we obtain conditions that seem like regular inputs to model owners yet can exhibit irregular output behaviour. Observe that this is similar to how mutation-based fuzzing[9] modifies valid inputs to identify edge case vulnerabilities rather than resorting to random input generation. 3.3 Exploring model behaviour with algorithm hyper-parameters Algorithm 12 accepts several hyper-parameter to allow for model owners to explore a specific section of the range of possible model behaviours. Parameter and direction. A practitioner can simulate different trading scenarios by controlling which direction an output distribution parameter is modified in. For instance, increasing the standard deviation implies lower confidence in forecasts and hence higher risk if a trade is performed based on such forecasts. Such a scenario allows model owners to (1) understand what are the features affecting model confidence and (2) investigate the robustness of trading strategies on top of low confidence forecasts (see Section 3.4). In a similar fashion, the mean of the forecasts represents the expected value of the returns. If we increase the mean of the learned distributions, the inputs are perturbed to reflect synthetic settings where the model is more likely to perform a trade. Interpretability in Safety-Critical Financial Trading Systems 7 Perturbation amount. Algorithm 12 is run for R iterations where the features are perturbed by a small value  at each iteration. Both hyperparameters allow model owners to tune the adversity of the synthesized setting as they control the how much the perturbed input deviates from historical data for the financial pipeline’s inputs. Small perturbations can help diagnose model sensitivity to specific edge cases that give a worse case in settings that are closer to historically regular data; this is particularly true if “small” perturbations result in a large impact on the model output. A large  can model the worst case scenario achievable by exposing the bounds of the distribution parameters that the model can achieve. Such perturbations simulate the worst case when the input features are outside their historical range. This is not entirely a hypothetical scenario: for instance, events 25 standard deviations away from the mean prediction were observed by major financial industry players during the 2007 financial crisis [2]. 3.4 Pipeline interpretability The forecasting model represents only a fraction of an end-to-end trading pipeline (recall Section 2.1). In practical deployments, trading strategies use the outputted forecast distribution parameters for profit generation. A number of related work have considered the effect of pre-processing on gradient-based perturbations of ML models [24, 12] but, to the best of our knowledge, we are the first to consider the effects of post-processing on a model’s robustness and interpretability. In our case, this means that a change in a distribution parameter might not result in any change in the trading strategy depending on its complexity. We thus now consider these trading strategies to ensure our interpretability algorithm is able to capture the end-to-end pipeline’s behaviour. Here, we concentrate on algorithmic trading strategies for which the trading decision is automated—in its simplest form, rule based decisions where a trade occurs if a condition is met, e.g., the forecast price passing a threshold which can be modeled as a step function. We make this decision for two reasons (i) human traders are known to correct the asset price of their models [19] and (ii) for risk estimation we require a large number of simulations which is only feasible algorithmically. For instance, a simple trading strategy might be that the forecasted mean must be positive to initiate a trade (buying and later selling). If the mean is originally 4% and decreased 2% due to Algorithm 12, although the mean has indeed changed, the trade still occurs. To specifically aim for a change in the profits of a trading strategy, Algorithm 12 can be modified to take the gradient through the return of the trading strategy as well. In such a case, a non-smooth trading function (e.g, a thresholding strategy in Section 4.1) results into a uninformative gradient.In our case, the thresholding trading strategies are a function of the mean and the standard deviation of the distribution. We manipulate both of these distribution parameters such that the threshold is met (or no longer met) which results into certain trades occurring or no longer occurring, ultimately changing the return of the strategy. such that they stay within the historical bounds, the algorithm terminates early. 8 F. Author et al. Tweets Stock Prices FinBERT Dataset DeepAR Threshold Strategy Fig. 1: Diagram of trading pipeline implemented. Ticker Company Sector Market Cap (B $) Volume of Tweets (K) HAS Hasbro Leisure Products 13.15 1200 ADSK AutoDesk Information Technology 60.40 764 XLNX Xilinx Semiconductors 30.93 84.7 CAH Cardinal Health Health Care Distributors 16.35 45.6 BWA BorgWarner Auto Parts & Equipment 13.02 37.9 CHTR Charter Communications Cable & Satellite 145.45 28.1 CE Celanese Specialty Chemicals 18.82 24.7 FANG DiamondBack Energy Oil & Gas Exploration & Production 15.55 16.5 Table 1: Information on the 8 randomly selected S&P 500 tickers considered. 4 A Sentiment-based Stock Forecasting and Trading Pipeline To benchmark our approach for interpretability, we first implement a working industry-standard financial pipeline in Section 4.1 along with methods and metrics to evaluate its effectiveness in Section 4.2. We note that this is a significant contribution in itself given that such pipelines are often proprietary and little details can be found in the public domain. 4.1 Trading System Data Collection. To avoid selection bias and test generalizability, we randomly select 8 S&P 500 tickers shown in Table 1 along with relevant ticker information. In the remainder of this manuscript. we use ticker and company names interchangeably to refer to the companies stock and price over time. – Tweets: For each ticker in Table 1, we collect tweets from January 1st 2016 to January 1st 2021 (5 year period). Collected tweets are searched to either contain the company name or ticker name of each company. – Prices: The open and close price for each ticker was collected for the same period as the tweets via the Yahoo Finance API. Prices were collected at a daily frequency for the past 5 years. At each interval, there is a quote for the open price and close price. In addition, price data is collected at a hourly frequency for the last 1.5 years. As prices are non-stationary (changing mean and variance over time), we forecast the log difference of each day’s close price and open price log( rclose ropen ) to get a stationary time series. – Sentiment Analysis: We extract sentiment scores for each tweet which will be used as inputs to our forecasting model. We use FinBERT, a BERT model trained on a financial sentiment dataset [5]. Given a tweet, FinBERT outputs 3 scores: (1) Positive (2) Neutral and (3) Negative sentiment score. Applying the softmax function we get a probability of a tweet having that sentiment. To forecast the log difference in open and close price for day T, we aggregate 13 features (see Table 4 in the Appendix) derived from tweets Interpretability in Safety-Critical Financial Trading Systems 9 up to 24 hours before 9:30 AM on day. Monday’s uses the past 72 hours to make use of tweets that occurred over the weekend while the markets were not open to trade. Forecasting. As mentioned in Section 3, probabilistic forecasting is interested in forecasting the ? distributions, denoted ?1:n,T +1:T +? : ?1:n,T +1:T +? = M?(zT ?k:T , X1:N,T ?k:T ) In our pipeline, we train a model M with input covariates X1:N,T ?k:T as the 13 features derived from Twitter and the target time series zT ?k:T as the log difference in price. We use the DeepAR [31] architecture (see Figure 7 in Appendix), an autoregressive RNN tailored for time series forecasting in the univariate setting. RNNs are particularly well suited for this task because they keep an internal state that allows them to output forecasts of variable length. Here, autoregressive refers to the forecasting model using past observations of its input to forecast future behaviour. We use the Python implementation of this model available in the GluonTS [4] library for our experiments. DeepAR allow for flexible parametric and non-parametric output distributions by using a projection layer to map the RNN output to parameters defining the chosen distribution function. Trading Decisions The forecasting models output ? distributions. All distribution parameters are in the log difference domain and hence we iteratively undo the log difference to get a forecasted distribution in the return space. As we consider daily trading, we start the trading day entirely in cash and end the day entirely in cash. Assuming we must buy a stock before selling it (i.e., only taking long positions), we only trade when the close price is greater than the open price. At a high level, a trading strategy involves two decisions: (1) When do we take an action (buy or sell)? (2) How much to invest per trade? When do we trade? We consider a simple thresholding strategy as defined: If the predicted difference ?y ? ? , for ? ? 0, we buy shares at 9:30 AM at the open price and sell them back at 4:00 PM at the close price for a return of y%. We work with two threshold values,? = 0 and ? = µy + ?y, where µy and ?y refer to the mean and standard deviation of the ground truth returns over the past k days, respectively. The ? = 0 strategy trades on any positive signal while the ? = µy + ?y strategy is more prudent, requiring the returns to be one standard deviation above those of the past k days. How much to invest? The Kelly criterion [21] is often used in financial mathematics when determining how much to invest. For each trading opportunity, we compute the Kelly fraction f which represents what fraction should be invested depending on the expected return. The Kelly fraction is calculated as shown in Equation 3 where the win percentage W is the percentage of trades that resulted into a profit and R is the ratio of positive returns to negative returns (sometimes referred to as the ratio between historical gains and losses). The Kelly fraction allows trading proportional to the strength of the signal from the forecast. f = W ? 1 ? W R (3) 10 F. Author et al. 4.2 Evaluation Setup We divide our evaluation metrics into error, accuracy and financial metrics. Error metrics include Root Mean Squared Error (RMSE), Mean Average Percent Error (MAPE) and Continous Ranked Probability Score (CRPS) where lower for all three is preferred. RMSE and MAPE measure the error of the forecasts in a point forecast setting while CRPS measures error of the forecasted distribution. Accuracy metrics include the binary accuracy of the sign of the mean of the forecast. Although 50% seems like a non-trivial baseline for the binary settings (i.e., price either goes up or down), the S&P500 has a historical upwards trend. Instead, we consider the historical binary accuracy as a non-trivial baseline. Financial metrics include the returns of the trading strategies and the return of a passive trading strategy as a baseline. An in depth explanation of all three classes of metrics are described in Section 6.2 in Appendix. 4.3 Validation of pipeline Results for both the training and testing splits are shown in Table 2 for the 8 tickers in the daily setting (hourly setting shown in Table 6 in Appendix). Accounting for the 8 tickers and both frequencies, we have 16 possibilities referred to as settings. Across all 16 settings, the performance on training data is strong for both ML and financial metrics. On the testing split, performance is weaker but we still have that 6 out of the 8 tickers have at least 1 strategy with returns greater than that of the passive strategy. Hence, we believe that our pipeline is a good approximation (albeit a significantly simpler one) for a pipeline that could potentially be implemented in industry for assisting intra-day trading. Section 6.3 in the Appendix discusses both tables in more depth. 5 Evaluation Results We now apply our interpretability method to our pipeline and evaluate the resulting interpretability benefits. In Section 5.1, we investigate the distribution parameters (µ, ? and ?) forecasted by our model in the presence of manipulations introduced by our algorithm. In Section 5.2, we analyze these changes with the entire pipeline in mind, that is we measure the performance of trading strategies in the synthetic market setting generated by our algorithm. We see how hyperparameters of Algorithm 12 provide controllability on the adversity of the synthetic setting, allowing us to draw insights into the forecasting pipeline’s modeling. In particular, we analyze the importance of different features in explaining the pipeline’s performance in Section 5.3. 5.1 Interpretability of the Forecasting Model In Table 3, we show the performance degradation resulting from manipulating the student-T distribution parameters in both directions for $ADSK and $BWA for daily trading. We include the testing performance for comparison. Performance for the remaining 6 tickers and the hourly setting are shown in Table 7 and 8 in the Appendix. Settings where performance on X? is worse than that of X are colored in green. In Table 3, all error metrics (RMSE, MAPE, CRSP) have their performance drop in all 6 possible synthetic setting. Interpretability in Safety-Critical Financial Trading Systems 11 Setting Accuracy Returns Ticker Set T P Passive ? = 0 Kelly ? = 0 ? = µy + ?y Kelly ? = µy + ?y ADSK train 55.7 70.4 260.7 6148.1 763.2 152.9 102.6 test 52.6 46.3 18.0 -11.3 -1.3 2.2 -1.4 BWA train 52.1 73.7 -16.4 3850.2 585.0 326.1 190.7 test 50.5 60.0 3.4 4.3 -0.4 7.4 -0.8 CAH train 52.6 56.3 -36.3 179.2 -2.5 10.0 1.3 test 55.8 42.1 -1.7 -11.7 0.6 8.3 4.4 CE train 51.4 73.7 47.2 1809.8 332.4 165.1 82.2 test 52.6 66.3 33.7 42.8 21.3 5.9 5.0 CHTR train 53.0 80.6 205.3 12382.9 3168.6 608.7 355.5 test 54.7 55.8 10.7 12.2 -0.5 5.1 -0.1 HAS train 50.0 63.3 7.2 661.6 -1.1 80.8 23.6 test 50.0 58.3 20.9 19.2 -0.3 11.8 6.4 FANG train 50.7 69.6 -29.5 5728.9 383.8 418.7 247.4 test 54.7 58.9 26.7 43.2 5.3 0.8 -6.1 XLNX train 52.3 54.6 104.1 104.9 -4.7 15.7 2.3 test 52.6 54.7 34.8 11.5 -0.7 1.7 2.3 Table 2: Performance of pipeline for the 8 tickers in the daily setting. Error metrics are omitted as they are not all comparable across the training and testing splits. Strategies with returns above that of the passive strategy are bolded. These drops in performance can be understood via Figure 2 where we plot the resulting forecasts for increasing the standard deviation. The adversarial confidence interval (red) overshadows that of the benign setting (green), which degrades the CRSP for instance. The same is shown for increasing the mean in Figure 3 where the adversarial mean (red) lies above the regular mean (green) which degrades the RMSE and MAPE. When considering binary accuracy and financial returns, we see a mix of small and large improvements or degradations in the synthetic setting. For instance, the return for $BWA kelly ? = 0 strategy increases by 2% in the synthetic setting when increasing µ in Table 3 but decreases by almost 10% for the non-kelly strategy. A similar trend is observed for all 8 companies and both daily and hourly frequency (in the Appendix). Performance of post-processing The change in performance of the several metrics provides intuition on the importance of considering an end-to-end pipeline and not an ML model in isolation. When calculating RMSE and MAPE, the distribution is collapsed to a point estimate by taking the mean of the distribution. Accordingly, as such metrics are directly related to the mean, any perturbation results into large change in such metrics. For instance, the RMSE for ADSK changes by 10.2% on average when manipulating the mean. When we consider distribution parameters that are not related to such metrics, we do not see a similar change in performance (only 0.21% when manipulating the standard deviation). On the other hand, the return of a trading strategy is a 12 F. Author et al. Setting Error Accuracy Returns Ticker Set Parameter Direction RMSE MAPE CRSP T P Passive ? = 0 Kelly ? = 0 ? = µy + ?y Kelly ? = µy + ?y ADSK Testing – – 0.01985 1.67056 0.91933 52.6 46.3 18.0 -11.3 -1.3 2.2 -1.4 Synthetic µ ^ 0.02382 4.1041 1.09991 49.5 18.0 -8.5 -0.4 -6.6 -0.4 µ v 0.02079 1.97913 0.95181 48.4 18.0 -8.1 -0.1 -4.9 -0.4 ? ^ 0.02009 2.13263 0.95911 48.4 18.0 -6.9 -1.9 -2.3 -0.4 ? v 0.02047 2.09409 0.96214 50.5 18.0 -11.8 -0.4 10.3 -1.4 ? ^ 0.02103 2.2685 1.01209 42.1 18.0 -16.8 -2.3 -3.5 -0.4 ? v 0.0207 2.43082 1.01386 52.6 18.0 -12.4 -0.4 3.1 -0.4 BWA Testing – – 0.0191 1.34892 0.86445 50.5 60.0 3.4 4.3 -0.4 7.4 -0.8 Synthetic µ ^ 0.02102 1.95462 0.99858 53.7 3.4 -4.6 -0.4 4.2 -0.1 µ v 0.01965 1.61631 0.90919 53.7 3.4 -4.1 1.3 6.3 -0.1 ? ^ 0.01921 1.69147 0.88259 62.1 3.4 8.7 -0.4 9.8 5.2 ? v 0.01947 1.52496 0.88966 54.7 3.4 -9.6 -0.4 6.4 -0.1 ? ^ 0.01901 1.62888 0.86666 54.7 3.4 -3.5 -0.1 3.0 -1.0 ? v 0.02233 2.65821 1.05353 53.7 3.4 -6.0 -0.4 5.4 -0.1 Table 3: Performance of pipeline on synthetic features when varying all three distribution parameters in both directions for $ADSK and $BWA in the daily frequency. Performance on testing distribution shown for comparison. Metrics that degraded in the synthetic setting are in green. function of several variables and potentially non-smooth such as the threshold strategies. For instance, if we originally have µ ? ? and we decrease µ to get µ?, if ?µ ? ? , the trade still occurs. Even if ?µ < ? and the trade does not occur, if the return of the trade y was negative, we improve the overall return. Lastly, as the correct value of a distribution parameter is not known, we can move a distribution parameter in the correct direction. These example explain why more complex post-processing steps can result into both positive and negative changes that are equally useful to analyze. 5.2 Looking into the Trading Strategies We now investigate the regular and synthetic settings at the trading level by look at the return distributions of performed trades. The distribution of returns is a standard method used to diagnose the returns of a portfolio and where the majority of profit or loss occurs. Parameter Distributions and Direction. We first investigate how the choice of distribution parameters and directions of change may simulate different market setting to elicit varying effects on trading strategies. Figure 4 shows the returns for the regular (blue) and synthetic (red) forecast when increasing the mean for both thresholding strategies. For ? = 0 (subfigure 4a), we find that the number of days traded (i.e., days when µ ? 0) almost doubles (40.0% to 74.7%). However, new trades usually have a negative return which lower the average return. For a threshold of µy + ?y (subfigure 4b), we see a similar but stronger effect as the number of trades almost quadruples and the total return thirds. Magnitude of Perturbation . In Section 5.1, we fixed the perturbation size . We now study our returns as this perturbation size grows. In Figure 5a and 5b, we report the returns distribution of both threshold strategies when increasing ? for varying epsilons ( = {0.01, 0.03, 0.1}). In Figure 5c, we fit a Gaussian kernel over each distribution to easily compare the distribution for varying  values. Interpretability in Safety-Critical Financial Trading Systems 13 Fig. 2: Regular (green) and synthetic (red) forecast for increasing the standard deviation ? of $BWA. Ground truth is shown in blue. The confidence intervals of the synthetic setting overshadows that of the regular setting. Fig. 6: Regular (blue) and synthetic (orange) features for $ADSK when increasing the mean µ. Synthetic features are minimally different to regular features. We see that the mean of the distributions are shifting to the left as  increases for both strategies in Figure 5. In addition, we are able to see larger negative tails as  increases, especially for the ? = µy + ?y strategy in Figure 5c. As epsilon increases the percentage of negative returns increase. This is clear for  = 0.1 (pink) for ? = µy + ?y strategy in Figure 5b where there are multiple trades with returns between -2% to -4% that do not occur in the regular setting (blue). Thus, the strength of the perturbation  offers an intuitive and calibrated knob to simulate the level of adversity of the synthetic setting. 5.3 Perturbations at the feature level We now investigate the perturbations at the feature level to understand their relative importance. An example of the regular and synthesized features are shown in Figure 6 (X in blue, X? in orange). We are able to determine which features are most important for a certain distribution parameter. Although the specific perturbation mask varied from in all settings, we find across several companies that increasing the standard deviation resulted into a significant perturbation in the positive sentiment features and decreasing the standard deviation was associated with negative sentiment features. Depending on the perturbation size , the synthetic setting can reflect varying adversity including anomalous behaviour. 14 F. Author et al. Fig. 3: Regular (green) and synthetic (red) forecast for increasing the mean µ of $HAS. Ground truth is shown in blue. The adversarial forecasts lie significantly above the regular forecasts. (a) ? = 0 strategy. Regular setting has a mean return of 0.32% and 40% days traded. Synthetic setting has mean 0.17% and 74.7% days traded. (b) ? = µy + ?y strategy. Regular setting has a mean return of 0.73% and 7.4% days traded. Synthetic setting has mean 0.27% and 26.3% days traded. Fig. 4: Distribution of returns for both thresholding strategies on $CHTR when increasing the mean. 5.4 Findings from our interpretability method We showed how specific distribution parameters affect downstream calculations such as simple error metrics and trading strategies differently. Section 5.2 demonstrates that the total perturbation amount in Algorithm 12 provides more control on how to evaluate the performance of a ML forecasting model and the entire pipeline end-to-end by controlling the adversity of the synthesized data. In our experiments, we found that small perturbations that manipulate the mean resulted into many non-profitable trades occurring despite these trades not occurring in the regular setting. Often, this resulted in large performance drops and financial returns. We encourage model owners to develop more robust models and pipelines able to more gracefully degrade their performance as the adversity of the synthetic setting increases. Acknowledgements This work was supported by CIFAR (through a Canada CIFAR AI Chair), by NSERC (under the Discovery Program, and COHESA strategic research network), and by a gift from Intel. We also thank the Vector Institute’s sponsors. Interpretability in Safety-Critical Financial Trading Systems 15 (a) ? = 0 strategy for  = {0, 0.01, 0.03, 0.1} (left to right). The mean returns are 0.09%, 0.05%, 0.04% and -0.23% and percentage of days traded are 57.89%, 57.89%, 56.84% and 68.42%, from left to right. (b) ? = µy + ?y strategy for  = {0, 0.01, 0.03, 0.1} (left to right). The mean returns are 0.8%, 0.8%, 0.49% and 0.15% and percentage of days traded are 9.47%, 9.47%, 10.53% and 18.95%, from left to right. (c) Fitting a Gaussian kernel on the above distributions for ? = 0 (left) and ? = µy+?y (right). Fig. 5: Distribution of returns for $BWA when increasing ? for varying levels of epsilon. As  increases (left to right), new (non-profitable) trades occur which decrease the mean return (vertical line). = 0 represents the regular forecast. 16 F. Author et al. 6 Appendix 6.1 Forecasting Model Architecture The model architecture of the DeepAR model is shown in Figure 7. When using DeepAR we consider the log difference in open and close price as our single target time series and the Twitter features as covariates. In the forecasting setting, covariates are assumed to be known over the entire time period under consideration, including the prediction horizon of length ? . This is not the case with Twitter data as we do not know future tweets. To alleviate this issue, we use lagged versions of the covariates such that no temporal violations occur. zi,t?2, xi,t?1 hi,t?1 l(zi,t?1|?i,t?1) zi,t?1 zi,t?1, xi,t hi,t l(zi,t|?i,t) zi,t zi,t, xi,t+1 hi,t+1 l(zi,t+1|?i,t+1) zi,t+1 inputs network likelihood output z?i,t?2, xi,t?1 hi,t?1 l(zi,t?1|?i,t?1) z?i,t?1 z?i,t?1, xi,t hi,t l(zi,t|?i,t) z?i,t z?i,t, xi,t+1 hi,t+1 l(zi,t+1|?i,t+1) z?i,t+1 Fig. 7: Overview of the DeepAR model (figure taken from [31]). Inputs zi,t?1 and xi,t as well as the previous RNN hidden state hi,t?1 are fed to the RNN’s current state to compute hi,t for each time step t. The RNN’s output is then mapped to the parameters ?i,t governing the likelihood function l(zi,t|?i,t) associated with a specific distributional assumption over zi,t. training is depicted on the left for which we require zi,t to be known; autoregressive prediction is shown on the right where a sample ?zi,t ? l(·|?i,t) is drawn from the predictive distribution at t and fed back into the prediction for t + 1. Features Average positive score Average negative score Average neutral score Percent &number of positive tweets Percent &number of negative tweets Percent &number of neutral tweets Average of each logit Volume of tweets Average likes Average replies Table 4: Sentiment and metadata based features used as inputs to the forecasting model. Interpretability in Safety-Critical Financial Trading Systems 17 6.2 Evaluation Details We provide details on our evaluation section below. We divide our historical data into a training, validation and testing split shown in figure 5. Frequency Train Validation Test Daily 0?1150 1150?1190 1190?1306 Hourly 0?2550 2550?2700 2700?2890 Table 5: Division of data into the training, validation and testing sets in the daily and hourly settings. For our error metrics (RMSE, MAPE, CRPS) we leverage the ground truth values (y) and our predictions (?y) or the distributions themselves to determine the quality of the forecasts. The Root Mean Squared Error (RMSE) and Mean Average Percent Error are defined in Equation 4 and 5, respectively. They measure the residual error and percentage error between ?y and y, respectively. Smaller RMSE and MAPE are ideal. Both RMSE and MAPE are effectively point-estimate metrics as they only leverage the mean of the distribution. The Continous Ranked Probability Score (CRPS) measures the difference between the cumulative distribution function (CDF) of the forecast and the ground truth observation, shown in Equation 6. A lower CRPS is preferred. RMSE = sPT t=1(yt ? y?t) 2 T (4) MAPE = 1 T X T t=1 yt ? y?t yt (5) CRPS(F, x) = Z ? ??  F(y) ? 1(y ? x) 2 x 2 dy (6) For our accuracy metrics, we consider binary accuracy. Binary accuracy refers to whether our point forecast ?y have the same sign as the ground truth observation y. In the binary setting, as the S&P 500 tickers have had a historical upwards trend, comparing binary accuracy to 50% is incorrect. Instead, we compare against the historical accuracy which is the maximum binary accuracy of a forecast that (1) always predicts up and (2) always predicts down. This is denoted by the T column in Table 2 (”T” for ”truth”). P in Table 2 refers to the predicted binary accuracy from the forecasts. For all trading strategies, we can calculate the net return of trading with such strategy. Similarly to binary accuracy, a return above 0% is not a sufficient baseline for determining financial strength given the historical upwards trend of S&P 500 tickers. Instead, we compare against the passive strategy baseline: buying shares at the open price on the very first day of the split and selling it at the close price of the very last day. A non-trivial strategy has a return above 0% and also beats the passive strategy. 18 F. Author et al. 6.3 Strength of trading pipeline We now provide a detailed discussion of the performance of the trading pipeline in the daily (Table 2) and hourly setting (Table 6). Strong performance on training split: All 16 settings perform strongly across all metrics on the training split in Table 2. All 16 setting haves predicted binary accuracies above 50% and 13 out of 16 beat the baseline binary accuracy, often by 10-15%. For instance, daily trading of $CHTR achieves a 80.6% binary predicted accuracy, 27.6% above the baseline of 53.0%. All 16 settings have at least 1 of the 4 strategies beat the passive returns. For example, $CHTR and and $ADSK in the daily setting have a passive return of 205.3% and 260.7% in the training set while the ? = 0 strategy beats them significantly at 12382.9% and 6148.1%, respectively. (ii) Weaker (but positive) performance on testing split: The performance drops significantly when looking at the testing set. In 7 out of 16 settings, the predicted binary accuracy beats the baseline predicted accuracy, often only by a couple percent. Even though the model is not trained for binary classification but rather the harder (probabilistic) regression task, performance on accuracy from point forecasts is biased. For the financial returns, all 16 settings have at least 1 of the 4 strategies have returns above 0%. Additionally, 6 out of the 8 ticker have at least 1 strategy over both frequencies that outperforms the passive strategy. In total, 7 out of 16 settings have at least 1 strategy that outperforms the passive returns. On 1 extreme, $ADSK and $XLNX have 0 out of 8 settings that outperform the passive returns (albeit the passive returns are relatively high at 18% and 27% while the best strategy was still positive at 8.3% and 8.4%,respectively). On the other hand, $BWA has 5 strategies out of 8 beat the passive returns (albeit the passive returns during that period were low at 3.4% and 1.7%, respectively). Are the forecasts meaningful? The weaker performance of the testing split compared to the training split is indicative of the difficulty of probabilistic regression tasks, especially of financial time series which are believed to be random walk models [15]. Even though several best practices were taken to avoid overfitting 6 , the results in Table 2 attests to the difficulty of generalization in financial forecasting when backtesting on out-of-sample splits. If we compare to most other works on financial forecasting, criteria of success are often predicted binary accuracies above 50% and returns above 0% (if trading strategies are implemented at all) [10, 7]. Under such conditions, our pipeline can be considered non-trivial as out of the 16 settings and 4 strategies (ie: 64 possibilities), 45 have above 0% returns. If we consider more stringent conditions (beating the baseline accuracy and passive trading strategy) that are more realistic, 6 out of the 8 tickers outperform the baseline accuracy and have at least 1 strategy that outperforms the passive strategy. 6 Dropout [34], weight decay, early stopping and validation splits Interpretability in Safety-Critical Financial Trading Systems 19 Setting Accuracy Returns Ticker Set T P Passive ? = 0 Kelly ? = 0 ? = µy + ?y Kelly ? = µy + ?y ADSK training 57.4 66.7 40.0 1103.9 -0.1 107.7 41.9 test 53.3 51.8 11.2 8.3 -0.1 3.3 1.1 BWA training 55.4 73.4 1.2 2836.0 352.7 309.9 141.4 test 56.4 54.4 1.7 10.9 -0.0 5.2 2.3 CAH training 55.6 74.1 9.7 2231.7 1.5 284.1 109.4 test 52.8 41.5 16.3 -3.9 -0.1 1.5 0.0 CE training 54.2 62.0 10.6 387.1 31.7 52.8 -0.7 test 51.8 49.7 13.4 3.8 -0.0 2.4 0.3 CHTR training 56.5 54.4 63.7 128.6 -0.0 32.6 0.0 test 54.9 47.2 14.8 1.8 -0.0 0.5 -0.6 HAS training 56.4 55.2 -20.5 151.0 0.7 33.4 -0.0 test 57.9 53.3 10.9 15.9 6.4 3.8 2.4 FANG training 54.6 71.8 -68.8 8969.7 0.2 546.1 -1.1 test 52.3 55.4 54.3 31.0 -0.2 3.5 2.0 XLNX training 55.7 52.9 -2.2 61.3 0.3 3.0 -0.3 test 60.5 52.8 27.0 8.4 1.6 1.6 1.0 Table 6: Performance of pipeline for the 8 tickers in the hourly setting. Error metrics are omitted as they are not all comparable across the training and testing splits. Strategies with returns above that of the passive strategy are bolded. 20 F. Author et al. Setting Error Accuracy Returns Ticker Set Parameter Direction RMSE MAPE CRSP T P Passive ? = 0 Kelly ? = 0 ? = µy + ?y Kelly ? = µy + ?y ADSK test – – 0.01985 1.67056 0.91933 52.6 46.3 18.0 -11.3 -1.3 2.2 -1.4 synthetic µ ^ 0.02382 4.1041 1.09991 49.5 18.0 -8.5 -0.4 -6.6 -0.4 µ v 0.02079 1.97913 0.95181 48.4 18.0 -8.1 -0.1 -4.9 -0.4 ? ^ 0.02009 2.13263 0.95911 48.4 18.0 -6.9 -1.9 -2.3 -0.4 ? v 0.02047 2.09409 0.96214 50.5 18.0 -11.8 -0.4 10.3 -1.4 ? ^ 0.02103 2.2685 1.01209 42.1 18.0 -16.8 -2.3 -3.5 -0.4 ? v 0.0207 2.43082 1.01386 52.6 18.0 -12.4 -0.4 3.1 -0.4 BWA test – – 0.0191 1.34892 0.86445 50.5 60.0 3.4 4.3 -0.4 7.4 -0.8 synthetic µ ^ 0.02102 1.95462 0.99858 53.7 3.4 -4.6 -0.4 4.2 -0.1 µ v 0.01965 1.61631 0.90919 53.7 3.4 -4.1 1.3 6.3 -0.1 ? ^ 0.01921 1.69147 0.88259 62.1 3.4 8.7 -0.4 9.8 5.2 ? v 0.01947 1.52496 0.88966 54.7 3.4 -9.6 -0.4 6.4 -0.1 ? ^ 0.01901 1.62888 0.86666 54.7 3.4 -3.5 -0.1 3.0 -1.0 ? v 0.02233 2.65821 1.05353 53.7 3.4 -6.0 -0.4 5.4 -0.1 CAH test – – 0.01752 1.25558 0.81226 55.8 42.1 -1.7 -11.7 0.6 8.3 4.4 synthetic µ ^ 0.0198 2.47616 0.92069 42.1 -1.7 -15.6 0.4 -8.1 -0.3 µ v 0.01723 1.19231 0.80999 42.1 -1.7 -13.4 0.4 5.4 2.8 ? ^ 0.01754 1.33194 0.81139 47.4 -1.7 -14.9 0.3 5.9 -0.5 ? v 0.01749 1.24788 0.81288 43.2 -1.7 -11.2 0.6 8.3 4.4 ? ^ 0.0176 1.16013 0.82313 45.3 -1.7 -14.2 0.3 8.4 3.6 ? v 0.01744 1.21399 0.8002 46.3 -1.7 -11.0 0.5 7.8 6.6 CE test – – 0.01507 1.4043 0.79985 52.6 66.3 33.7 42.8 21.3 5.9 5.0 synthetic µ ^ 0.01571 1.53213 0.84797 63.2 33.7 33.3 20.8 4.4 4.1 µ v 0.01503 1.40595 0.80708 64.2 33.7 39.7 7.9 5.4 3.9 ? ^ 0.01529 1.35397 0.80005 64.2 33.7 35.6 19.6 6.4 5.6 ? v 0.01603 1.62292 0.8911 55.8 33.7 21.1 7.7 9.2 8.9 ? ^ 0.01523 1.40196 0.81581 63.2 33.7 34.2 17.3 6.6 5.1 ? v 0.01594 1.3255 0.84064 57.9 33.7 20.6 8.6 -0.2 -1.2 CHTR test – – 0.01573 1.78706 0.90204 54.7 55.8 10.7 12.2 -0.5 5.1 -0.1 synthetic µ ^ 0.01739 1.86774 0.98482 53.7 10.7 11.5 0.1 -0.4 -0.0 µ v 0.01549 2.22429 0.91671 61.1 10.7 15.1 0.9 6.2 -0.1 ? ^ 0.01557 1.78755 0.90237 56.8 10.7 14.5 -0.5 2.1 -0.1 ? v 0.01636 1.95211 0.98315 52.6 10.7 6.1 -0.5 2.7 -2.2 ? ^ 0.01618 2.01247 0.94647 54.7 10.7 13.5 -0.0 -0.0 -1.0 ? v 0.01661 1.90177 0.95431 53.7 10.7 11.1 1.2 2.1 -0.0 HAS test – – 0.01735 1.90888 0.84553 50.0 58.3 20.9 19.2 -0.3 11.8 6.4 synthetic µ ^ 0.02635 4.9937 1.39095 52.1 20.9 5.6 3.1 12.4 -0.4 µ v 0.01834 2.05575 0.89084 53.1 20.9 15.1 3.3 10.5 9.2 ? ^ 0.01922 2.18287 1.08325 45.8 20.9 1.7 -0.3 12.1 5.4 ? v 0.01779 2.14564 0.92623 55.2 20.9 14.6 -0.3 2.3 1.1 ? ^ 0.01728 1.95595 0.86672 59.4 20.9 12.4 2.2 6.3 1.4 ? v 0.02097 2.43545 1.05765 47.9 20.9 -3.0 -0.3 11.2 -0.0 FANG test – – 0.03847 1.97975 0.81061 54.7 58.9 26.7 43.2 5.3 0.8 -6.1 synthetic µ ^ 0.04187 3.16024 0.91733 51.6 26.7 -2.1 -0.7 -7.9 -0.4 µ v 0.04246 2.5618 0.93104 49.5 26.7 -10.3 -0.4 11.2 -2.0 ? ^ 0.03882 2.66116 0.83269 58.9 26.7 38.9 4.9 0.7 -2.1 ? v 0.0393 1.9289 0.8365 53.7 26.7 0.7 -0.7 0.8 -6.1 ? ^ 0.0408 2.0224 0.87012 58.9 26.7 33.6 3.8 1.8 -4.5 ? v 0.03878 2.45537 0.82008 60.0 26.7 41.2 -0.6 0.0 -3.2 XLNX test – – 0.01807 1.13326 0.80098 52.6 54.7 34.8 11.5 -0.7 1.7 2.3 synthetic µ ^ 0.01809 1.20686 0.80813 50.5 34.8 4.2 0.6 3.6 2.6 µ v 0.01976 1.53148 0.88331 51.6 34.8 -0.3 -0.7 1.7 2.3 ? ^ 0.01836 1.0886 0.81061 52.6 34.8 5.0 1.7 1.7 2.3 ? v 0.01837 1.10096 0.81082 49.5 34.8 -3.2 -0.7 1.7 2.3 ? ^ 0.0182 1.10483 0.80442 52.6 34.8 7.5 1.3 -0.2 0.7 ? v 0.01837 1.15269 0.8134 55.8 34.8 4.7 -0.7 1.7 2.3 Table 7: Results of manipulating the 3 parameters of Student-T distribution (µ, ? and ?) in both directions using Algorithm 12 for all 8 tickers for the DeepAR model on daily forecasts. The metrics on the testing distribution are shown for comparison. Perturbations that degraded a metric are shown in green. Interpretability in Safety-Critical Financial Trading Systems 21 Setting Error Accuracy Returns Ticker Set Parameter Direction RMSE MAPE CRSP T P Passive ? = 0 Kelly ? = 0 ? = µy + ?y Kelly ? = µy + ?y ADSK training – – 0.00731 1.04171 0.66758 57.4 66.7 40.0 1103.9 685.0 107.7 83.1 test – – 0.0067 1.31622 0.85351 53.3 51.8 11.2 8.3 1.9 3.3 1.8 synthetic µ ^ 0.00673 1.34907 0.86056 51.3 11.2 8.1 1.6 2.2 0.8 µ v 0.0068 1.61973 0.87885 50.8 11.2 6.3 1.2 9.3 7.7 ? ^ 0.00673 1.35349 0.86564 50.3 11.2 8.3 2.9 3.1 1.8 ? v 0.00674 1.28695 0.86225 51.8 11.2 7.6 0.9 1.0 -0.5 ? ^ 0.0068 1.6105 0.8788 50.8 11.2 8.1 2.9 2.6 1.4 ? v 0.00675 1.29013 0.86307 51.3 11.2 7.5 1.3 2.2 0.5 BWA training – – 0.00594 0.8368 0.54563 55.4 73.4 1.2 2836.0 2836.0 309.9 309.9 test – – 0.00558 3.58545 0.94565 56.4 54.4 1.7 10.9 10.9 5.2 5.2 synthetic µ ^ 0.0058 3.9357 0.97229 52.3 1.7 9.1 9.1 3.3 3.3 µ v 0.00549 3.55744 0.93809 49.2 1.7 6.4 6.4 5.0 5.0 ? ^ 0.00625 6.32844 1.09095 47.2 1.7 7.8 7.8 8.2 8.2 ? v 0.00562 3.64542 0.9523 52.8 1.7 8.9 8.9 5.1 5.1 ? ^ 0.00574 4.12034 0.99432 50.8 1.7 7.8 7.8 4.7 4.7 ? v 0.00558 3.70664 0.94011 52.8 1.7 10.0 10.0 5.1 5.1 CAH training – – 0.0057 1.26388 0.55608 55.6 74.1 9.7 2231.7 2231.7 284.1 284.1 test – – 0.00516 2.74455 0.98187 52.8 41.5 16.3 -3.9 -3.9 1.5 1.5 synthetic µ ^ 0.00519 2.77402 0.98636 40.5 16.3 -4.0 -4.0 2.1 2.1 µ v 0.00514 2.79521 0.98467 37.9 16.3 -5.8 -5.8 -0.2 -0.2 ? ^ 0.00569 2.99766 1.07947 44.1 16.3 -1.3 -1.3 -2.7 -2.7 ? v 0.0052 3.05535 0.99033 39.5 16.3 -5.2 -5.2 1.3 1.3 ? ^ 0.00517 2.7665 0.98229 40.5 16.3 -4.4 -4.4 1.5 1.5 ? v 0.00517 2.68121 0.99001 39.0 16.3 -5.4 -5.4 1.7 1.7 CE training – – 0.00633 1.1547 0.74773 54.2 62.0 10.6 387.1 387.1 52.8 52.8 test – – 0.00476 1.3358 0.87501 51.8 49.7 13.4 3.8 3.8 2.4 2.4 synthetic µ ^ 0.00485 1.66177 0.88968 47.7 13.4 -2.1 -2.1 1.3 1.3 µ v 0.00476 1.36677 0.88134 46.7 13.4 1.0 1.0 1.9 1.9 ? ^ 0.00479 1.8357 0.87928 47.2 13.4 -0.7 -0.7 1.2 1.2 ? v 0.00478 1.33705 0.8826 47.7 13.4 -0.2 -0.2 2.4 2.4 ? ^ 0.00476 1.3745 0.87603 49.2 13.4 2.8 2.8 2.0 2.0 ? v 0.0048 1.3534 0.87865 46.7 13.4 1.3 1.3 1.6 1.6 CHTR training – – 0.00498 1.47279 0.78936 56.5 54.4 63.7 128.6 66.1 32.6 13.5 test – – 0.00444 1.2516 0.86571 54.9 47.2 14.8 1.8 1.4 0.5 0.3 synthetic µ ^ 0.00447 1.28213 0.87256 44.1 14.8 -0.6 -0.8 0.1 -0.1 µ v 0.00446 1.26073 0.86816 46.2 14.8 0.4 0.0 0.4 0.2 ? ^ 0.00442 1.38467 0.87888 44.1 14.8 -0.5 -0.4 0.1 -0.0 ? v 0.00442 1.25654 0.86151 48.7 14.8 3.5 3.0 -0.4 -0.7 ? ^ 0.00446 1.26407 0.8685 44.1 14.8 -0.9 -1.1 0.2 0.1 ? v 0.00445 1.28078 0.86863 45.6 14.8 0.7 0.2 1.0 0.7 HAS training – – 0.00697 1.10188 0.7614 56.4 55.2 -20.5 151.0 151.0 33.4 33.4 test – – 0.00564 6.14252 0.8151 57.9 53.3 10.9 15.9 15.9 3.8 3.8 synthetic µ ^ 0.00573 10.10177 0.83015 51.3 10.9 9.8 9.8 3.1 3.1 µ v 0.00575 10.95073 0.84026 50.8 10.9 11.1 11.1 1.2 1.2 ? ^ 0.00566 9.35223 0.81801 49.2 10.9 10.1 10.1 3.7 3.7 ? v 0.00565 5.96019 0.8164 50.8 10.9 13.5 13.5 3.1 3.1 ? ^ 0.00575 4.12416 0.82732 47.7 10.9 8.4 8.4 3.6 3.6 ? v 0.00565 5.78568 0.81653 50.8 10.9 13.5 13.5 3.1 3.1 FANG training – – 0.00953 1.40412 0.5487 54.6 71.8 -68.8 8969.7 8969.7 546.1 546.1 test – – 0.01225 7.92493 0.91119 52.3 55.4 54.3 31.0 31.0 3.5 3.5 synthetic µ ^ 0.01229 7.86536 0.91588 54.9 54.3 30.5 30.5 2.3 2.3 µ v 0.0123 8.34055 0.91471 54.9 54.3 30.9 30.9 3.4 3.4 ? ^ 0.01219 7.65769 0.91185 54.9 54.3 30.6 30.6 3.3 3.3 ? v 0.01224 7.86871 0.90992 54.9 54.3 27.3 27.3 3.5 3.5 ? ^ 0.01212 7.44283 0.9023 51.8 54.3 24.0 24.0 2.8 2.8 ? v 0.01227 7.74312 0.91352 54.9 54.3 26.1 26.1 3.5 3.5 XLNX training – – 0.00665 1.446 0.81742 55.7 52.9 -2.2 61.3 61.3 3.0 3.0 test – – 0.00566 1.20706 0.85306 60.5 52.8 27.0 8.4 8.4 1.6 1.6 synthetic µ ^ 0.00575 1.45487 0.85287 50.8 27.0 1.1 1.1 0.3 0.3 µ v 0.00567 1.21223 0.85517 51.3 27.0 7.2 7.2 1.1 1.1 ? ^ 0.00566 1.20815 0.85172 51.3 27.0 7.3 7.3 1.1 1.1 ? v 0.00567 1.21264 0.85555 51.3 27.0 6.4 6.4 1.6 1.6 ? ^ 0.00569 1.22304 0.8551 50.8 27.0 7.9 7.9 1.1 1.1 ? v 0.00571 1.20055 0.85365 52.3 27.0 7.8 7.8 1.1 1.1 Table 8: DeepAR Hourly Bibliography [1] Basel framework, https://www.bis.org/basel framework/ [2] Goldman pays the price of being big https://www.ft.com/content/ d2121cb6-49cb-11dc-9ffe-0000779fd2ac [3] Agostino Capponi, C.A.L.: Machine Learning in Financial Markets: A guide to contemporary practices. Cambridge University Press (2021) [4] Alexandrov, A., Benidis, K., Bohlke-Schneider, M., Flunkert, V., Gasthaus, J., Januschowski, T., Maddix, D.C., Rangapuram, S., Salinas, D., Schulz, J., et al.: Gluonts: Probabilistic time series models in python. arXiv preprint arXiv:1906.05264 (2019) [5] Araci, D.: Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019) [6] Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018) [7] Basak, S., Kar, S., Saha, S., Khaidem, L., Dey, S.: Predicting the direction of stock market prices using tree-based classifiers. The North American Journal of Economics and Finance 47 (07 2018). https://doi.org/10.1016/j.najef.2018.06.013 [8] Bloomberg: Finding novel ways to trade on sentiment data. https://www.bloomberg.com/professional/blog/ finding-novel-ways-trade-sentiment-data/ (2017) [9] B?ohme, M., Pham, V.T., Roychoudhury, A.: Coverage-based greybox fuzzing as markov chain. IEEE Transactions on Software Engineering 45(5), 489–506 (2017) [10] Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. Journal of computational science 2(1), 1–8 (2011) [11] Borovykh, A., Bohte, S., Oosterlee, C.W.: Conditional time series forecasting with convolutional neural networks. arXiv preprint arXiv:1703.04691 (2017) [12] Carlini, N., Wagner, D.: Audio adversarial examples: Targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW). pp. 1–7. IEEE (2018) [13] Chen, L., Pelger, M., Zhu, J.: Deep learning in asset pricing (2021) [14] Dang-Nhu, R., Singh, G., Bielik, P., Vechev, M.: Adversarial attacks on probabilistic autoregressive forecasting models. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 2356–2365. PMLR (13–18 Jul 2020), https://proceedings.mlr.press/v119/dang-nhu20a. html [15] Fama, E.F.: Random walks in stock market prices. Financial Analysts Journal 21(5), 55–59 (1965), http://www.jstor.org/stable/4469865 Interpretability in Safety-Critical Financial Trading Systems 23 [16] Fraccaro, M., Sonderby, S.K., Paquet, U., Winther, O.: Sequential neural models with stochastic layers. arXiv preprint arXiv:1605.07571 (2016) [17] H.Burhani, Ding, G., P.Hernandez-Leal, Prince, S., D. Shi, S.S.: Aiden – reinforcement learning for order execution. https://www.borealisai.com/en/ blog/aiden-reinforcement-learning-for-order-execution/ (2020) [18] Heaton, J.B., Polson, N.G., Witte, J.H.: Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry 33(1), 3–12 (2017). https://doi.org/https://doi.org/10.1002/asmb.2209, https:// onlinelibrary.wiley.com/doi/abs/10.1002/asmb.2209 [19] Hull, J.: Options, futures, and other derivatives. Pearson Prentice Hall, Upper Saddle River, NJ [u.a.], 6. ed., pearson internat. ed edn. (2006), http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT= YOP&IKT=1016&TRM=ppn+563580607&sourceid=fbw bibsonomy [20] Jiang, Z., Liang, J.: Cryptocurrency portfolio management with deep reinforcement learning. In: 2017 Intelligent Systems Conference (IntelliSys). pp. 905–913 (2017). https://doi.org/10.1109/IntelliSys.2017.8324237 [21] Kelly, J.: A new interpretation of information rate. IRE Transactions on Information Theory 2(3), 185–189 (1956). https://doi.org/10.1109/TIT.1956.1056803 [22] Kolanovic, M., Krishnamachari, R.T.: Big data and ai strategies. research report, JP Morgan (18 May 2017) [23] Krauss, C., Do, X.A., Huck, N.: Deep neural networks, gradientboosted trees, random forests: Statistical arbitrage on the s&p 500. European Journal of Operational Research 259(2), 689–702 (2017). https://doi.org/https://doi.org/10.1016/j.ejor.2016.10.031, https://www. sciencedirect.com/science/article/pii/S0377221716308657 [24] Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016) [25] Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.X., Yan, X.: Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In: Advances in Neural Information Processing Systems. pp. 5243–5253 (2019) [26] Lim, B., Arik, S.O., Loeff, N., Pfister, T.: Temporal fusion transformers for interpretable multi-horizon time series forecasting. arXiv preprint arXiv:1912.09363 (2019) [27] McNeil, A.J., Frey, R., Embrechts, P.: Quantitative risk management: concepts, techniques and tools-revised edition. Princeton university press (2015) [28] Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems 29, 3387–3395 (2016) [29] Oord, A.v.d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., Kavukcuoglu, K.: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016) 24 F. Author et al. [30] Reuters, T.: Thomson reuters adds unique Twitter and news sentiment analysis to Thomson Reuters Eikon. https://www.thomsonreuters.com/en/press-releases/2014/ thomson-reuters-adds-unique-twitter-and-news-sentiment-analysis-to-thomson-reuters-eikon. html (2014) [31] Salinas, D., Flunkert, V., Gasthaus, J., Januschowski, T.: Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36(3), 1181–1191 (2020) [32] Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) [33] Smyl, S.: A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting 36(1), 75–85 (2020) [34] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting 15(1), 1929–1958 (Jan 2014) [35] Trippi, R.R., Turban, E.: Neural networks in finance and investing: Using artificial intelligence to improve real world performance. McGraw-Hill, Inc. (1992) [36] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017) [37] Zhang, Y., Ti?no, P., Leonardis, A., Tang, K.: A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence (2021) [38] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014)