# A Trader’s Crash Course to Statistics

This morning, I found myself hit upon a very true realisation. A realization that, all the theory I put forth (in Part 1,2,3 and 4), will not help anybody other than the most quant-savvy people around. And even then, they already might know it. But well, having thought all that,I seriously want to cut some slack for myself. Because,all this is going to set stage for something truly grande, that much I can assure you. We will be (by all chance) will be developing a seriously quantitative *shit *to trade upon.

Alright, now, that all these apparently has caught your eye, I realized I will put a small, swift, almost kung-fu swift blog post, introducing you to various *weapons of choice* for a system trader.

This post as well will set stage for the next blog post to arrive on SYSTEM TRADING.

**Mean, Median, Mode**

The mean is a another word for average(those familiar with the Quantitative Finance series will inherently know what I am talking about).

You have some sequence,

The mean of this sequence is going to be :

The median of this sequence will be the term, which breaks the sequence in exactly 2 sequence of equal length, whether or not present. So it will be . Readers note that is called the greatest integer function.

Median often gives a better average of a sequence than mean. Mean is extraordinarily prone to tail-effects. That is, the presence of extraordinarily high or low observations when compared to all others. This is usually because the observation follows a distribution with excess kurtosis(i.e fatter tails than standard normal distribution. Refer Part 2).

For calculation of median you have to first sort the sequence in an order.

Mode is the number in a sequence which has the highest frequency. A sequence can have zero, one or more modes.

**Variance, Standard Distribution**

Standard Distribution is discussed in Part 3, and we will discuss variance here.

Variance is nothing but the square of standard distribution.

Effectively what it tells, is how “fat” the distribution is. Mind you, this is not fat as in fat tail effect. The tails are asymmetrically wide in that. Here, “fat” as in how thick is the stem of the distribution.

Have a look at the picture towards your right. One is a normal distribution with variance 1, centered around the same mean and the other(Fig 2) is a normal distribution with standard deviation of 3.

**Hypothesis Testing**

Before I start what exactly is hypothesis testing, we need to realise one thing about normal distribution. A perfectly normal random variable, approaches that of normal distribution, only when the population is *large.*

Now when we mathematically comment its large, it actually means,practically close to non realisable. So how do we actually work with it?

And here in lies the problem. When we work in real life with r.v its usually in small sets of observations. So this *can* and *does* induce problems regarding distributions.

Hence the need for hypothesis testing. Hypothesis testing, is the method of determining if a set of observations match up with our assumption or not. This assumption can be a.) the observed rv matches up with a particular distribution b.) the observed rv has certain statistical characteristics

There are many forms of hypothesis testing, each catering to one particular field.

We will be talking of the most popular, common and useful of the tests for discrete rv.

*Z-Test*

Suppose we have a trading system with 35 trades, average trade being 200bucks worth. We want to prove statistically, that the average trade is greater than 0 (why? because we are not sure, if these observations were part of a normal distribution,and what if, we unwittingly selected this set from the right hand side tail of a standard normal distribution with average trade mean :zero)

Hence we assume, a. that the trade distributions are from a standard normal distribution,** and we want to test if the average trade size is positive**

So, we form the Z- test:

where,

M= mean of the sample

D= test based on our assumptions(here zero)

V= variance of the sample(lets say,standard dist= 50 or var=2500)

N= sample set (35)

Hence, this gives Z=23.66, hence now we should check this against the standard distribution tables, which gives us that for certain (>99% confidence) we can say, that the average trade size is positive.

Lets us change the test that, D=180, which yields Z=2.366. Now to check refer this, we see that for Z=2.36 (look for 2.3 vertically downwards and adjust 0.06 by moving horizontally right along the table), we get a probability of 99%. That is, there is 1% chance that, this sample set, is picked up from the extreme right hand side corner of a distribution with average trade size of 180.

Bottomline: You can be pretty confident that *over the long run* your average trade size will be bigger than 180;

**Chi Squared Test**

Till now we worked with continous variables, like average trade size for example.

But for example we have found a certain pattern in our markets and found that on that particular setup, out of 200 instances, markets rose 134 times. We want to check if this is a fluke(i.e a random instance, even tossing a coin 100 times can yield slightly skewed number of heads and tails) or significant(something which we can exploit and profit)

So the

: Observed Frequency {134 times out of 200}

: Expected Frequency {100 times out of 200, for it to be random}

Hence statistics gives us, a test statistic of 11.56 (now refer this) Here the market could have gone only one way at any instant, i.e either up or down. Hence there is only one degree of freedom. If, there are two simulataneous variables deciding, at any instant, then the degrees of freedom will be 2.

So here against 1 degree of freedom, check the statistics. Hence there is less than 0.1% that this phenomenon was a fluke, since 11.56>10.83.

Hence it can be exploitable!

**t-test**

The previous test gave us, the probability that, a given distribution is similar to the expected one or not.

In this test we want to find, if two observed distributions have statistically similar characteristics or not, namely mean and standard distribution.

The formula is as follows:

: average trade size

: standard deviation

So, this is another test which tells us, if its possible to have uncovered something which is otherwise a pure random event(another test uncovering the same)

So putting average trade size as 200, deviation as 50 and *n* as 35, we get, t= 5.916

Anything above 1.6 is great! Had the deviation been around 800, we would have faced some problems, t statistics of 1.4. Almost significant, not just there. You *can *think of moving ahead.

So here it is, my dear readers a brief crash course in statistics. Hope you enjoyed it.

Further reading, for interested:

Wikipedia Entry on Chi Squared, t- test.

How to choose a statistical test

Great article,esp the parts abt how to find the observations were part of a distribution,and whether the exp freq is random or not

Also one Q: Quote “market rose 134 out of 200-might not be fluke” but what gives us the confidence measure that this might continue? Any measures for this?

@Narasimha

Thanks.

The only measure of continuity is, robustness. Go back and test long enough history to really have confidence that your setup is robust under varying market conditions. Because the more things change, the more they remain the same.

Finding confidence measures for continuity is certainly novel, and as such not there to my limited knowledge

Soham,

I have further doubt about the statements in Chi-square tests, yes the observed frequency of 134 if greater that expected frequency of 100. But what if the process itself if quite random, as you said a random coin toss itself can lead to +ve number of heads or tails. But this may not make it exploitable right?

Consider the coin toss experiment itself, I know at the end there could a +ve number of heads(say bull market) or +ve number of tails(say bear market). That information does not help me right? Because when I invest it might have been raining heads, but suddenly it might start raining tails.

I know the market is not as a simple as coin toss, there are people involved and peoples expectations lead to a bias and can be probably exploited. But if you consider a true random process like coin toss, I dont think it can be exploited? Your views?

There are two subquestions in your question.

I will address it one by one. A discrete binary random process can indeed generate a skew when the number of trials is small. And thats exactly what we want to check if 134/200 success rate, is significant enough, to merit an attention or should be ignored because, it shows a slight skew(in a way, we are trying to find out, how serious is the skew. The higher it is, the less probability that its just a random process)

So our Chi Square tells there is less than 0.1% chance of it being the result of a random process.

The second subquestion is: what you are asking for when you said “raining heads” and “raining tails” , is simply the concern/ calculation of streaks- winning or losing. You will have streaks everywhere. Winning or losing. So you can’t do anything else other than have effective money management.

Though you can estimate the maximum probable streak length using advanced mathematical simulations like Monte Carlo and etc, but at the end of the day you have to address it through m0ney mgmt

Good article in simple words and I feel it is essential for all market participants to have knowledge of statistical concepts and thinking.

Recently I am reading Dr Brett’s ” The Daily Trading Coach ” where he uses simple statistics and Excel to find a trading Edge. Your lessons very much help developing traders understand market data and expereienced traders in refreshing their understanding.

*bows down and tips the hat*

Very well written article. I am embarrassed to say that I need to re-read it a couple of times slowly to really get the meaning.

Anyway, as I recently got a job at an investment bank, do you mind to recommend me a couple of good statistics books? Thanks!