An Assessment of the Prediction Quality of VPIN

VPIN is a tool designed to predict extreme events like flash crashes. Some concerns have been raised about its reliability. In this chapter we assess VPIN prediction quality (precision and recall rates) of extreme volatility events including its sensitivity to the starting point of computation in a given data set. We benchmark the results with the ones of a “ naive classifier. ” The test data used in this study contains 5.6 year ’ s worth of trading data of the five most liquid futures contracts of this time period. We found that VPIN has poor “ flash crash ” prediction power with the traditional 0.99 decision threshold. Increasing the decision threshold does not significantly improve overall prediction quality. Nevertheless we found VPIN has a more interesting predictive power for flash events of lower amplitude. Finally, we found that, for practice, the last bar price structure is the least sensitive to the starting point of computation.


Main study purpose
Easley et al. [1] designed a tool, nicknamed volume-synchronized probability of informed trading (VPIN), with the aim to predict flash crashes. It appeared it could predict the "flash crash" of May 6, 2010, a few hours before it happened [2]. A lot of papers were published [3][4][5], and it was proposed to use it for regulation through a VPIN contract [2,6]. However, critics pointed out some flaws, questioning its reliability [7][8][9][10][11] but without providing a quantitative evaluation of the prediction quality (e.g., in terms of precision and recall rates). In this study, we design a framework to detect flash crashes and thereby assess the behavior of the VPIN tool enabling as well as comparing and benchmarking with other predictive algorithms.

Motivation
The amount of trading data has exploded in finance thanks to the continuing progress of high-frequency techniques. It constrains practitioners to use more and more state-of-the-art algorithms to deal with this overwhelming amount of information. Computers and algorithms are more and more efficient, but still decision-making is highly dependent on both the quantity and the quality of information. Thus, errors and speculations that can make the financial market toxic, i.e., conducive to crashes, are possible. Examples in the past, such as the "flash crash" of May 6, 2010, have shown that this new paradigm in finance has made it possible to introduce a new kind of crashes characterized by their suddenness. Such quick crashes seem dangerous because of a kind of inherent unpredictability. However, predictive models to model this new framework do exist.

Model
Easley et al. [12] designed a model of the high-frequency financial market based on flows of informed and uninformed traders. They showed that information is a key parameter of the spread between ask and bid of prices. The model works as follows. Each day, general conditions and circumstances may or may not result in events that can help predict the evolution of the price of a future. More precisely, for each day, nature decides whether or not there is an event that can help predict the evolution of the price of a future. This is modeled with a Bernoulli law of parameter α. If an event occurs, nature also decides with a Bernoulli law of parameter δ if this event is a low signal. With these conditions, buys and sells for this future come then from flows of informed and uninformed traders. They are modeled by Poisson processes of respective parameters μ and ϵ. This framework can be summarized by the following tree in Figure 1 [13].
The whole trading process studied is thus a mixture of Poisson processes. It enabled authors to compute ask and bid and then the spread. They showed that for reasonable cases the spread is linearly linked with the following probability they named probability of informed trading (PIN) [12]: Later, Easley et al. [2] designed a new framework to easily compute this probability. Indeed PIN numbers come from a parametrized framework, and one does not have access to all these parameters. They showed however that PIN can be well approximated through a volume-clock paradigm [14], thanks to data of futures with a new formula. The approximated version of PIN was then called the volume-synchronized probability of informed trading (VPIN). It appeared that this new tool could predict the "flash crash" of May 6, 2010, a few hours before it happened [2]. Nevertheless, the model has received a lot of critics. For example, Andersen and Bondarenko have shown [7] that VPIN is quite sensitive to the starting point of when one starts computing VPIN on a data set. It indeed questions VPIN prediction quality. Moreover, they have also shown that VPIN is sensitive to other parameters, such as the trade classification rule used [8] or how one defines the average daily volume of trades [9]. Changing the classification rule may drastically change VPIN behavior [9]. Pöppe et al. have reached the same conclusions with a different approach. Using a different classification rule can change VPIN prediction power toward a crash (in their paper a German blue-chip stock [11]). Besides, controlling ex ante parameters seems to give poorer prediction quality [8,9]. This point has also been checked by Abad et al. [10]. Controlling ex ante realized volatility, and trading intensity, as did Andersen and Bondarenko [9], prediction quality seems to vanish. More deeply, they have also underlined that it is not obvious how one should define a VPIN prediction, analyzing more precisely toxic and nontoxic halts, as well as toxic events. Furthermore, Torben G. Andersen and Oleg Bondarenko interpret VPIN as being too sensitive to trading intensity. They have also explained that VPIN metric is sometimes unexpectedly correlated with other usual ones (such as VIX or RV) [7,8]. More recently, it has been shown theoretically that the volume-clock paradigm of VPIN framework does not enable to really approximate fully the PIN value, although the proposed formula is close [15,16].
More generally, all these critics have pointed out that: • First, it is not obvious how one should use VPIN.
• Second, prediction quality has not been studied sufficiently to assess it as being reliable.
• Third, the study lacks objective benchmark.

Goal
The purpose of this chapter is to quantify the prediction quality of VPIN in order to enable practitioners to assess whether or not it can be used in the real world (e.g., for trading or regulation). That's why: • First, we want to design a proper framework to compute precision and recall rates as well as prediction length of VPIN. This will be possible by providing a formal definition of flash crashes. To be more precise, we will use the maximum of intermediate return (MIR) [5] to define it.
• Second, we want to study through this framework how sensitive VPIN is to the starting point of the data set.

Plan
In the following, we first recall VPIN model and propose a definition for flash crashes (Section 2). Second, we assess within this framework VPIN prediction quality (Section 3). Finally, we assess VPIN sensitivity to the starting point of the data set (Section 4).

VPIN software and formal flash crash definition
In this section, we first recall the VPIN model. Second, we propose a definition of flash crashes used to compute precision and recall rates. Finally, we present the data used in our tests.

VPIN software
Easley et al. [12] designed a model of the high-frequency financial market based on informed and uninformed traders. It is then possible to compute a probability of informed trading (PIN). Easley et al. [1] use these results and define an easy way to compute PIN only through the data of trades. We describe briefly VPIN model used in previous literature. The theoretic study of the model is treated in another research study.

Bars
Following Easley et al. [1], a bar is a fixed volume of trades that are successive in time. With such a definition, one can associate the following quantities with each bar: • A nominal price, computed according to a given technique (mean price, median price, closing price, opening price, etc.) • A nominal time (first trade time, last trade time) • Local maximum and minimum values of trades In practice, the last few trades that do not fill up a bar are dropped to the next bar.

Bulk volume classification
The computation of VPIN requires to determine directions of trades, i.e., classifying each trade as a buy or a sell. The method used here is the following: bulk volume classification (BVC) [1,5]. Let us note V b the volume of a bar and j the label of bar number j (j>0) and P j its price (closing, opening, median, mean). Then the number of buys V b j within bar j is determined according to this formula: where Z is the cumulative function of a given law (usually student or normal distribution) and σ is the standard deviation of the numerator on successive number of bars. In our test, σ is computed once on all successive values of the data set, and the student law is of parameter one. Within bar j the number of sells V s j is obviously

Buckets
A bucket is defined to be a fixed number of successive trades. Here to simplify, as bars are defined also as a fixed number of trades, a bucket will be m successive bars. Let us note V bucket the fixed volume of a bucket. We naturally have V bucket ¼ mV b .

VPIN formula
VPIN formula is computed on n successive buckets, where n is VPIN support. A buffer is defined as n successive buckets. Here is VPIN formula, approximating (1) upon bucket number j (j ≥ n): For a given bucket i: In order to distribute all VPIN values between 0 and 1, in practice, VPIN is normalized through a normal law. We thus consider VPIN normalized in the following:

VPIN event
A VPIN event is declared when the following occurs: where θ VPIN is a given decision threshold. In practice [5] θ VPIN ¼ 0:99.

Formal definition
Let p t À Á t be a time series (e.g., of prices). Here is the definition of MIR: A flash crash will depend on two things here: • The amplitude of the crash, which means extreme MIR values (e.g., 10%) • The shortness of the fall, which means the shortness of the time window within η that computes MIR t, η (e.g., 10 minutes), more precisely, noting , the fall has length |j * À i * |

Empiric definition
We reported in this data set only one flash crash, i.e., on May 6, 2010, which lasted approximately 10 minutes according to media and financial institutions. Our definition of flash crash will obviously take into account this event.

Futures used
In this work, we use a comprehensive set of liquid futures trading data to illustrate the techniques to be introduced. More specifically, we will use 67 months' worth of tick data of the five most liquid futures traded on all asset classes. The data comes to us in the form of 5 CSV files, one for each futures contract traded. The source of our data is TickWrite, a data vendor that normalizes the data into a common structure after acquiring it directly from the relevant exchanges. The total size of the comma-separated value (CSV) files is about 45.1 GB. They contain about millions of trades spanning from the beginning of January 2007 to the end of July 2012. The data set contains five of the most heavily traded futures contracts. Each has more than 100 million trades during this 67-month period. The most heavily traded futures, the file containing E-mini SP500 futures, symbol ES, has about 500 million trades involving a total number of about 3 billion contracts. The second most heavily traded futures is Euro exchange rates, symbol EC, which is 188 million trades. The next three are Nasdaq 100 (NQ), 173 million trades; light crude oil (CL), 165 million trades; and E-mini Dow Jones (YM), 110 million trades. In Figure 2, one can see an evolution of prices with time (here each tick corresponds to a bucket).

Definition of flash crash
We want to define empirically a flash crash using the tools of VPIN framework, namely, bars and buckets. As volume-clock paradigm does not allow to control filling times of fixed volume of trades, here below is a summary of the steps we have followed to manage to detect flash crashes using MIR. As it is quite long and the main purpose of study is the prediction of results of the following section, we present principles and do not go into technical details: • To be sure not to miss a flash crash because of being too long in time bar or bucket, we have chosen a reasonable granularity level as in [5] (buckets per day, 200, and bars per bucket, 30).
• For each financial instrument, we have recorded the number of bars necessary to capture the local 10 minutes of maximum fall of May 6, 2010, known as the "flash crash"; we refer to these numbers as "window lengths" below.
• As the window lengths defined above do not have a stable distribution in time (because of the volume-clock paradigm), we have arbitrarily filtered out all events in which the time difference between minimum and maximum within a window length is longer than 20 minutes, in order to capture only quick events. Indeed, one given window length may be too big and thus allow at some date to measure a time difference between local minimum and maximum which is longer than 10 minutes whereas it would be a true flash crash with a smaller window length. 1 • For each instrument we recorded the amplitude of the "flash crash" and their respective MIR values.
The results made it possible to classify the five financial instruments into two groups: • Data sets where the "flash crash" and other flash crashes are significantly present: ES, NQ , and YM.
• Data sets where the "flash crash" and other flash crashes are not really present. More precisely, the "flash crash" is not a rare event in the data set, and generally magnitude levels of flash crashes are low compared to other instruments.

Assessing VPIN prediction quality
In this section, first we present our methodology to find VPIN optimal prediction quality (for which recall and precision rates are maximal and more useful for practice). Second, we present all the results: best parameters, associated remarks, and prediction lengths.

Parameters to test
Here are the parameters we will test: 1 This is not perfect because we can still miss some crashes (whereas in this data set, it will not be that much, and it will be with a smaller probability), but first we do not want to change too much the definition in time of a flash crash (we will not increase the tolerance level to 1 day), and second this problem is inherent to the fact that fixing volume of bars and of buckets prevents us from controlling precisely filling bar and bucket times. Finding a solution for this precise data set does not guarantee at all a general solution neither for one data set nor for a financial instrument.
• Bar price: mean, median, last price, first price • MIR decision threshold θ MIR to detect a flash crash • VPIN support n • VPIN classifier (student, normal) • Prediction window ω (described below) • VPIN decision threshold θ VPIN to predict a flash crash

Defining true positive events
Here we describe how we define true positive, false-negative, and false-positive events. For a given prediction window length ω: , then we consider it as a true positive event. 2 Otherwise it is a false-negative event.
• From a VPIN event at a bucket j (i.e., , then we consider it as a true positive event. 3 Otherwise it is a false-positive event.

Choosing the maximum value of ω
To make a useful deep search, we have computed the distribution of time difference between different amounts ω of buckets. Indeed, we want to control a temporal time window reasonable for practitioners and still sufficiently wide so that we can analyze which events VPIN can detect or not. We have focused this research to have a stable bounded distribution of time difference between ω buckets of about 1 month. Below one can see the respective distribution for the S&P500 instrument; the four other distributions of the instruments studied look the same ( Figure 3).
In Table 1 one can see the medians of the different distributions.
Remark: we first try to maximize precision+recall rate. If the local maximum found is interesting for practice (at least superior or equal to 1.2) and more powerful than a "naive" algorithm, then it sounds worth making a more serious search of precision and recall rates separately to find a good trade-off between them (e.g., thanks to a ROC curve).

Best parameters found
In Tables 2-5 one case see the best parameters that maximize precision+recall for each financial instrument and bar price structure studied.

Remarks and first interpretation
We remark overall the following: • The choice of bar structure does not really affect the optimal choice of other parameters; nevertheless mean and median bar price structures have best precision+recall rate on average.
• Recall rates are very close to 1.
• Since ES, NQ, and YM precision rates are "low", thus precision + recall rates are "low." • Since EC and CL precision rates are "high," thus precision + recall rates are "high" since recall is already "high." • CL and EC had on May 6, 2010, a very low flash crash threshold, which increases a lot the number of crash of same magnitude detected in the data set.
• CL and EC obtain their maximum value to the minimum bound of the deep search (respectively, a 2.2% fall and 0.6% fall). It is not the case for other instruments (in NQ cases, precision+recall optimal rate is constant from 0.8 to 0.9).
The results give two first findings: • When the flash crash is significantly present for the instrument, i.e., of high magnitude and rare in the data set (ES, YM, and NQ cases), then recall is high, which means that VPIN makes a prediction before this happens, but precision is low: VPIN detects other events that are not flash crashes.
• When the flash crash is not significantly present for the instrument, i.e., of low magnitude and not rare (there are a lot of events of 10-20-minute length of same magnitude), then recall and precision are high.
This may suggest one of the following hypotheses: • VPIN seems to be a poor indicator of flash crash prediction with the usual recommended threshold 0.99.
• VPIN can be a better indicator of another type of event (crashes of less important amplitude).
We will compare the results of the same deep search with the one of a naive classifier, to see whether or not the good prediction results in CL and ES cases are relevant.

Benchmark with a "naive classifier"
We made a comparison of VPIN prediction quality result with a "naive classifier," which randomly chooses whether or not there will be a crash from each bucket of the data set. In Table 6 one can see the results of the naive classifier for the first deep search set of parameters. 6 As it is a naive classifier, results do not depend on direction of prices (bar price classifier) and bar price structure. We remark the following: • "Naive classifier" has poor results comparable to those of VPIN for ES, NQ , and YM instruments; although poor, VPIN predictions are better than "naive algorithm" on ES cases.
• "Naive classifier" has better results than VPIN on EC instrument.
• "Naive classifier" has worse results than VPIN on CL instrument.
We can interpret it as follows: • EC flash crash definition is barely inconsistent, with a MIR threshold of 0.006%; it is obvious that a naive algorithm does better results as the constraint is very small to detect a "flash crash" of such a magnitude.
• On CL and ES cases though, VPIN predictions are better, and these results are obtained when θ MIR threshold was on the lower bound of the deep search. It might indicate that VPIN software has a better predictive power than a "naive algorithm" not on a "flash crash" amplitude basis but on a lower amplitude level. Nevertheless, one may wonder whether or not this level of amplitude is useful for practitioners.
Anyway, previous results may conclude that for "flash crash" prediction, VPIN has overall equivalent poor power prediction with the traditional threshold θ VPIN = 0.99, as a "naive" algorithm.
That's why in the next paragraph, we benchmark predictive power of "naive" and VPIN algorithms:   Table 6.
Best parameters maximizing precision+recall rate for different futures for the naive classifier. 6 First tests conducted with EC instrument have been realized with an average to get more robust results. They are really close to the one obtained here with a single realization of randomness.
Indeed, the first hypothesis is that there are too many false VPIN predictions, i.e., false-positive events, as precision rate is too low and recall rate is too high. That's why one may hope that making θ VPIN constraints higher may reduce the number of VPIN "useless" predictions while not reducing too much recall rate.

Deep search allowing higher bounds for θ VPIN
In the following we have looked to higher bounds for θ VPIN from 0.99 to 0.99999. All other parameters of the deep search are the same. Below, one can see the results in Tables 7-10. The results for the naive algorithm are indeed the same.
We remark the following: • Precision rate has increased for each bar price structure for ES instrument, maintaining recall rate constant to θ VPIN = 0.99 case.
• Precision + recall rate has increased for YM instrument only with a last or first bar price structure, but recall decreased a bit compared to θ VPIN = 0.99 case.  Table 9.
Best parameters maximizing precision+recall rate for different futures and median bar price structure allowing higher bounds for θ VPIN .
• Compared to the "naive" algorithm, VPIN results are effectively better in ES case. In YM case we still find comparable results.
• On average, mean and median bar price structures have the best precision +recall rate.
To verify whether or not we can get at least better results than a naive algorithm in data sets with a real flash crash, we study in the following first the results allowing lower bounds on θ MIR while θ VPIN = 0.99 and second the results allowing lower bounds on θ MIR and higher constraints on θ VPIN . Indeed, the intuition is that on NQ case, the "flash crash" amplitude constraints are far too high to have a good precision rate, because in this case there are too few events detected with MIR algorithm.

Deep search allowing lower bounds for θ MIR
We remark the following in Tables 11-14: • Results have changed for every instrument except the ES one which has kept the same local maximum as in the first deep search.
• Precision is far higher than before, while recall is still high. Therefore, overall precision + recall rates are "high." • Optimal θ MIR is around 0.015 for ES, CL, NQ, and YM financial instruments, whereas for EC the previous local maximum around 0.006 remains higher.
• On average, median bar price structure has the best precision+recall rate.  Table 11. Best parameters maximizing precision+recall rate for different futures and last bar price structure allowing higher bounds for θ MIR .
In the following, we will first compare the results to the case where we allow higher bound on θ VPIN , to see if there is a difference. Second, we will benchmark both results to the one of a "naive" classifier. Tables Table 15.

We remark in
Best parameters maximizing precision+recall rate for different futures and last bar price structure allowing lower bounds for θ MIR and higher bounds for θ VPIN .
• There are changes only for NQ and YM instruments in, respectively, last, median, and mean bar price structures and first bar price structure, where θ VPIN equals 0.999.
• There is no general trend for precision or recall rates with the increase of θ VPIN .
• On average median bar price structure has the best precision+recall rate.

Benchmark with a "naive" classifier
We remark the following for the "naive" classifier (Table 19): • It has worse results than VPIN on ES and YM cases.
• It has comparable results than VPIN on NQ case. • It has better results than VPIN on EC and CL cases, where the flash crash is not really effective.
• It reaches obviously best local results on lowest MIR bound of the deep search.
We may partially conclude that: • VPIN has an interesting predictive behavior on flash events of magnitude far lower (around 1.5%) than what would be considered as a crash for specific financial instrument (relatively liquid such as NQ, YM, or ES).
• But VPIN has poor results comparable to those of a "naive" classifier (precision +recall rate inferior to 1.2) on flash crash events for these financial instruments.
• For other instruments such as CL or EC, VPIN behaves worse than a naive classifier for these flash events. On flash events of higher amplitude (at least 1.5%), VPIN behaves better than a "naive" classifier for CL instrument.

VPIN sensitivity to the starting point of a data set
In this section, first we present the problem of VPIN's sensitivity to the starting point of the bucketing process. Second, we present different calibrations to test its sensitivity. Third we make a summary of our results.

The problem
VPIN received among critics one which is important to precisely assess. Indeed, Bodarenko and Anderson [7] pointed out in their work that VPIN is sensitive to the starting point of the bucketing process. More precisely, if one removes the first buckets of the data set, results change. It is indeed right. We would like to know to which extent one can or cannot mitigate this effect. One idea is to test the different price bar structures. Indeed a bar structure influences trade imbalance and thus influences the appearance of VPIN events.

Methodology
There are at least two interesting ways of analyzing the sensitivity to the starting point of a data set: • Study the sensitivity of best precision+recall rate to the number of trades erased and to the bar price option.
• Given one set of local optimal parameters, study the sensitivity of precision and recall rates to bar price option and data removed.
We have removed l ∈ 0; 1000; 2000; 3000 number of bars to study the sensitivity in the two previous cases, which corresponds to several hours of trading data removed. Indeed one does not want to erase first flash crash detected in the data set and erase more buckets than the average prediction length to detect it. Moreover we would like to study to which extent VPIN is locally sensitive.

Sensitivity of precision+recall rate
We summarize in Table 20 for each bar price structure the average percentage change of local new best precision+recall rates with the number of bar erased.
We remark the following: • The sensitivity mentioned by Bodarenko and Anderson does exist.
• Its amplitude is not very big, at least for best precision+recall rate, as the maximum change is about 6%.
• Median bar price structure is far less sensitive than other price structure.

Sensitivity to local best parameter choice
In Table 21 we summarize for each bar price structure the average percentabe change of the initial local best precision+recall rates with the number of bar erased.
We remark the following: • Again the amplitude of the sensitivity is not very large as the maximum change is about 3.5%.
• Last bar price structure is less sensitive than other price structure to this phenomenon.

Conclusion
In this last section, we present first a general summary of our findings. Then we propose new suggestion of research concerning this precise subject.

Summary of results
We found that: • VPIN has interesting predictive power (i.e., better than a naive algorithm and at least of local prediction+recall maximum higher than 1.2) for flash events of lower amplitude than flash crashes (about 1.5%) for a certain class of instruments, where flash crashes are at least present (which is not the case for currency Euro FX or Energy Light Crude NYMEX).
• VPIN is sensitive to the starting point of computation, but the amplitude of this sensitivity is not really high. For practice, which means not changing local best parameters while erasing some data, last bar price structure is the least sensitive to this phenomenon.

Suggestion for further studies
For further studies, this might be worth analyzing: • Define a bigger constraint to capture crashes taking into account, for example, their V-shape. It would indeed filter out more events and enable analyzing more accurately which kind of crash VPIN predicts better.
• Benchmark within this framework other predictive tools between them (VIX with a naive algorithm, with VPIN, etc.).
• Analyze VPIN time-clock version predictive power.
• If previous predictive power of lower amplitude flash events is interesting for practitioners, analyze more precisely parameters that would be interesting for them.
• Describe more precisely to which class of financial instrument VPIN predictive power is most effective (if such one is worth being more studied for practitioners).
• Define a normalization of events defining crash events within a whole cluster of instruments. It is not easy to put in place as instruments are more or less correlated by crashes and response times are not trivial to analyze, but it would be also interesting indeed to assess prediction quality on common events shared by different instruments of a same cluster. It would make it possible to see whether or not VPIN predictive power is effective beyond different financial instruments embedding different aspects of the financial world to which VPIN is sensitive to.
• This area of research studies a very particular class of events: those that are potentially very rare. Taking into account this setting and that the algorithms used are fed with previous information and are sensitive to the starting point of computation, is it possible to build a consistent cross-validation approach? This aspect has not been treated yet as others needed to be first addressed, but it is still important to be studied.

Author details
Antoine