Sunday, August 1, 2010

Genetic Program Generated Strategy

I've been experimenting with Genetic Programming for some time now and we're starting to see some interesting results. I recently tried some USD/EUR currency data with 350 days training and 150 days validation, one of the resulting strategies was:

COS(ADD(Low, Close))

In simple words, add the low and the close then take the cosine of the result. A value greater than 0 is a Buy signal, a value less than 0 is a Sell signal, and a value equal to 0 is a Hold signal.

Over the 150 validation days this strategy made 13.7% and projected over a year it yielded around 25% profit. In the last year the USD/EUR has gone up about 9.2% and using that simple strategy is projected to produce 2.5 times more gains than the buy and hold.

Saturday, June 26, 2010

Function Memoization Performance

For the last month I've been very busy with wrapping up a couple of projects for my Master's Degree, but now I'm done and I have some to start posting again. In my last post I demonstrated how to memoize a function and in this post I will discuss the performance of memoization.

Memoization offers significant performance benefits when the parameters of the function are relatively constant, the memoized function can retrieve the return value almost as fast as returning a constant. When the input values are diverse then the performance can be dramatically decreased.

There were two tests conducted to determine the performance of memoization. In the first test a function was memoized and was called multiple times without changing the input parameters. In the second test, the same function was memoized but was called with a randomly varying parameter (diverse). All the measurements are in CPU ticks to eliminate interference from any other processes running on the system at the given time. Below is the sample code for the tests where the input parameters were varied:

static Func<List<int>, int, int> twoParam = new Func<List<int>, int, int>(
(v, p) =>
{
return TwoParamFunc(v, p);
}).Memoize();

static int TwoParamFunc(List<int> vals, int period)
{
int sum = 0;
foreach (int val in vals)
{
sum += ((int)Math.Sqrt((double)val) * period);
}
return sum;
}

static void PerformanceTest()
{
Random rng = new Random(DateTime.Now.Millisecond);
int numRuns = 1000;
Stopwatch sw = new Stopwatch();
List<List<int>> valArrs = new List<List<int>>();
for (int i = 1; i < 10; ++i)
{
valArrs.Add(CreateList(10 * i, 1000 * i));
}

// two param memoized
double sumTicks = 0.0;
long count = 0;
for (int i = 0; i < numRuns; ++i)
{
foreach (List<int> vals in valArrs)
{
sw.Start();
twoParam(vals, rng.Next(1, numRuns) );
sw.Stop();
sumTicks += sw.ElapsedTicks;
sw.Reset();
count++;
}
}
Console.WriteLine("Avg ticks memoized:\t" + sumTicks / count);

// two param non-memoized
sumTicks = 0;
count = 0;
for (int i = 0; i < numRuns; ++i)
{
foreach (List<int> vals in valArrs)
{
sw.Start();
TwoParamFunc(vals, rng.Next(1, numRuns));
sw.Stop();
sumTicks += sw.ElapsedTicks;
sw.Reset();
count++;
}
}
Console.WriteLine("Avg ticks non-memoized:\t" + sumTicks / count);
}
A memoized function with uniform input parameters takes an average of 2.29 CPU ticks to execute, while a memoized function with a varying input can be over 40 times slower. If the input is diverse enough, then the memoized function will exceed the execution time of the direct call.



The benefit of the direct call is that the amount of CPU ticks it takes to execute will never change unless the function itself changes, but the drawback is that no matter what happens the performance cost will always be incurred. With memoization the performance cost can be significantly reduced, but it requires the user to analyze the input parameters and ensure that they’re uniform.

One more thing to note is that every time a memoized function is called with new input values, those values and the output will be cached in the memoization map. The map will grow indefinitely and the memory consumption is going to grow indefinitely... in order to reduce the memory consumption the map size must be limited. I came up with a simple solution where I keep track of the how frequently each value in the map is accessed and I remove those values which are used least frequently.


private static void RemoveLeastUsed<A,R>(Dictionary<A, Pair<R, int>> map)
{
List<A> orderedKeys = (from key in map.Keys
orderby map[key].AccessCount ascending
select key).Distinct().ToList<A>();

// Ignore the most frequently used keys
orderedKeys.RemoveRange(orderedKeys.Count/2, orderedKeys.Count/2);

// Remove teh least frequently used keys from the map
foreach (A key in orderedKeys)
{
map.Remove(key);
}

// Reset the access count
foreach(Pair<R, int> p in map.Values)
{
p.AccessCount = 0;
}
}

public static Func<A, R> Memoize<A, R>(this Func<A, R> function)
{
var map = new Dictionary<A, Pair<R, int>>();
return a =>
{
R r;
Pair<R, int> pair;
lock (sync)
{
if (!map.TryGetValue(a, out pair))
{
r = function(a);
map.Add(a, new Pair<R, int>(r, 0));
}
else
{
r = pair.Result;
pair.AccessCount++;
}

if (map.Count > Properties.Settings.Default.MemoizeMapSize)
{
RemoveLeastUsed<A,R>(map);
}
}
return r;
};
}

Tuesday, May 18, 2010

Function Memoization and Genetic Programming

In a GP nearly all of the computation is in the evaluation of the expression trees and any performance bumps in the area are always welcome. We can do a lot of optimization in the code, but we can only optimize to a certain point and after that we just have to take the performance hit for the function execution. Function memoization (not memorization) can be very helpful in this case. Wikipedia defines memoization as:
"an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously-processed inputs."

Genetic Programming is a particularly good domain for memoization, because you generally have a lot of specimens executing the same functions by providing similar or identical input parameters and the expected output only varies within the range of those parameters.

We've all seen the memoization examples with the Fibonacci sequence, but our world consists of more than just one parameter functions: we have two, three, four and more parameter functions, so we need a solution that provides memoization for a function with arbitrary number of parameters. Perhaps arbitrary number of parameters is an overstatement, but we should be able to at least get the same number of parameters as the C# Func delegate (up to four parameters). Of course if you want to memoize functions with more parameters, then you have to make your own Func-style delegate that supports more parameters.

First we'll start with the standard one parameter and one return value memoization function and we'll build on top of that:

public static Func<A, R> Memoize<A, R>(this Func<A, R> function)
{
var map = new Dictionary<A, R>();
return a =>
{
R r;
if (!map.TryGetValue(a, out r))
{
r = function(a);
map.Add(a, r);
}
return r;
};
}
As you can see there is nothing new there, but in order for two, three and four parameter functions to be memoizated we'll have to use anonymous types, lambdas and some tricky yet very innocent-looking programming :
private static Func<T, R> CastByExample<T, R>(Func<T, R> f, T example)
{
return f;
}

// Two parameter memoize function
public static Func<A, B, R> Memoize<A, B, R>(this Func<A, B, R> function)
{
var example = new { A = default(A), B = default(B) };
var tuplified = CastByExample(t => function(t.A, t.B), example);
var memoized = tuplified.Memoize();
return (a, b) => memoized(new { A = a, B = b });
}
When I first saw that I though it was a bunch of nonsense, but there is a method to the madness!

You can read up on the CastByExample function here and tuples + anonymous types here, but I'll tell you what it specifically achieves for us: it creates a tuple out of the input parameters, so since all the parameters are now a tuple we can pass the tuple into the one-param memoize function and the entire tuple will be treated as the key in the dictionary (i.e. all the parameters are automatically a key). In .NET 4.0 there is actually a Tuple class, but since I'm using .NET 3.5 I had to make do with anonymous types.

Here are the three and four parameter memoize functions:
// Three parameter memoize function
public static Func<A, B, C, R> Memoize<A, B, C, R>(this Func<A, B, C, R> function)
{
var example = new { A = default(A), B = default(B), C = default(C) };
var tuplified = CastByExample(t => function(t.A, t.B, t.C), example);
var memoized = tuplified.Memoize();
return (a, b, c) => memoized(new { A = a, B = b, C = c });
}

// Four parameter memoize function
public static Func<A, B, C, D, R> Memoize<A, B, C, D, R>(this Func<A, B, C, D, R> function)
{
var example = new { A = default(A), B = default(B), C = default(C), D = default(D) };
var tuplified = CastByExample(t => function(t.A, t.B, t.C, t.D), example);
var memoized = tuplified.Memoize();
return (a, b, c, d) => memoized(new { A = a, B = b, C = c, D = d });
}
That exhausts all the overloads of Func, so I'll show you a quick preview of how you can use these functions:

static Func<List<int>, int, int> twoParam = new Func<List<int>, int, int>(
(v, p) => { return TwoParamFunc(v, p); });

static int TwoParamFunc(List<int> vals, int period)
{
int sum = 0;
foreach (int val in vals)
{
sum += ((int)Math.Sqrt((double)val)*period);
}
return sum;
}

static List<int> CreateList(int start, int end)
{
List<int> vals = new List<int>();
for (int i = start; i < end; ++i)
{
vals.Add(i);
}
return vals;
}

static void Main(string[] args)
{
if (TwoParamFunc(CreateList(0, 1000), 10) == twoParam(CreateList(0, 1000), 10))
{
Console.WriteLine("The return values are correct!");
}
Console.ReadKey();
}
In the next blog I'll show you some of the tests and the performance analysis... I hope you enjoyed this.

References:

Monday, May 3, 2010

Genetic Programming: C# Reflection Performance and Improvements on Methods and Fields

I've already done three posts on reflection and the performance of the various method invocation approaches, so if we've learned anything from those posts it's that when it comes to Genetic Programming we can't use methods to return constants (like it wasn't obvious). If we want to have GP's that can keep up with the real-time market and are flexible enough to modify on the fly, then we should look for optimal performance.

I'd say that about 99% of all the computation occurring in a GP is in the execution of individual specimens (i.e. evaluating their expression trees). If we add functions and constants that are going to be evaluated millions of times, then these functions and constants should be REALLY fast! Using reflection to discover the fields, properties and methods that can be used by the GP does give us the ability to change the supported functions on-the-fly, but that ability comes at a price.

Using reflection to access the values of a field or a property is done in the same amount of ticks as accessing the field or property directly from the object:



The code that was used to get the above statistics is this:

static void VariablesAndPropertiesTest()
{
int numRuns = 100 * 1000;
var val = 0.0;
Functions fn = new Functions();
FieldInfo numTwo = fn.GetType().GetField("constNumTwo");
PropertyInfo numOne = fn.GetType().GetProperty("ConstNumOne");

Console.WriteLine("Reflection field");
_stopwatch.Start();
for (int i = 0; i < numRuns; ++i)
{
val = (double)numTwo.GetValue(fn);
}
_stopwatch.Stop();
Console.WriteLine("Avg Ticks \t " + _stopwatch.ElapsedTicks / numRuns);

Console.WriteLine("Direct field");
_stopwatch.Start();
for (int i = 0; i < numRuns; ++i)
{
val = fn.constNumTwo;
}
_stopwatch.Stop();
Console.WriteLine("Avg Ticks \t " + _stopwatch.ElapsedTicks / numRuns);

Console.WriteLine("Reflection property");
_stopwatch.Start();
for (int i = 0; i < numRuns; ++i)
{
val = (double)numOne.GetValue(fn, null);
}
_stopwatch.Stop();
Console.WriteLine("Avg Ticks \t " + _stopwatch.ElapsedTicks / numRuns);

Console.WriteLine("Direct property");
_stopwatch.Start();
for (int i = 0; i < numRuns; ++i)
{
val = fn.ConstNumOne;
}
_stopwatch.Stop();
Console.WriteLine("Avg Ticks \t " + _stopwatch.ElapsedTicks / numRuns);

Console.WriteLine("Direct method");
_stopwatch.Start();
for (int i = 0; i < numRuns; ++i)
{
val = fn.GPConstNumZero();
}
_stopwatch.Stop();
Console.WriteLine("Avg Ticks \t " + _stopwatch.ElapsedTicks / numRuns);
}

Sunday, May 2, 2010

Algorithmic Trading and Legos

I am a member of a High Frequency Trading group on linkedin and I found a post that was really interesting:
Statistical Modeling is like building with LEGOS

LEGOS:
1. Square red pieces can be stacked to build towers.
2. Long skinny pieces can be used as roads.
3. Long skinny pieces on top of square red towers can build bridges.
4. Curved skinny pieces can make turning roads or bridges.
5. Tubes, motors, wheels, etc. can make pretty much anything you can imagine.

STATISTICS:
1. Univariate statistics can be used to clean errors out of your data.
2. Correlations of past data with current data can predict future events.
3. Linear Regression can combine many correlations to predict even better.
4. Logistic Regression can predict curved relationships.
5. Simultaneous equations, Non-Linear Regression, etc. can be applied to improve pretty much any business process you can imagine.

THOUGHT PROVOKING QUESTION
Which of the following best describes how is your company is organized?

A. We hire really creative builders who have access to all the pieces?
B. We hire creative people who have only ever seen the red square pieces to tell the builders what to build.
C. We hire builders who have a big sack of Legos with lots of different pieces, but they’re not too interested in building anything they haven’t seen before.
D. We use robots to automatically build “stuff” that we use as bridges. The bridges usually work, but sometimes we fall through.
E. We don’t have any Legos.
Here is the original post: http://www.linkedin.com/groupAnswers?viewQuestionAndAnswers=&gid=86999&discussionID=16118824&sik=1272838667655&trk=ug_qa_q&goback=.ana_86999_1272838667655_3_1

I'm working on filling my bag with Legos!

Saturday, May 1, 2010

Genetic Programming: C# Reflection Performance and Improvements with Delegate

I think this is going to be the final installment in the Reflection Performance "Trilogy". I found several blogs, articles and discussions that show ways to get better performance from reflection by utilizing delegates. Hooking up a delegate using reflection provides a very noticeable increase in performance.

Here are the results:


Here are the resources I used:
Jon Skeet's article is pretty good: http://msmvps.com/blogs/jon_skeet/archive/2008/08/09/making-reflection-fly-and-exploring-delegates.aspx
Joel Pobar's article on reflection: http://msdn.microsoft.com/en-us/magazine/cc163759.aspx

Here is the code:

class Program
{
static Dictionary <int, List <Delegate>> _delegate =
new Dictionary <int, List <Delegate>>();

static Stopwatch _stopwatch = new Stopwatch();

static void Main(string[] args)
{
Console.WriteLine("Frequency = " + Stopwatch.Frequency);
DyanmicInvocationTest();
Console.WriteLine("Press any key to exit!");
Console.ReadKey();
}

static void DyanmicInvocationTest()
{
Functions fn = new Functions();
Type typeFn = fn.GetType();
MethodInfo[] methods = typeFn.GetMethods();
foreach (MethodInfo method in methods)
{
if (method.Name.StartsWith("GP"))
{
ParameterInfo[] pi = method.GetParameters();

if (!_delegate.ContainsKey(pi.Length))
{
_delegate.Add(pi.Length, new List());
}

switch(pi.Length)
{
case 0:
_delegate[pi.Length].Add(
Delegate.CreateDelegate(
typeof(Functions.ZeroParam), fn, method));
break;
case 1:
_delegate[pi.Length].Add(
Delegate.CreateDelegate(
typeof(Functions.OneParam), fn, method));
break;
default:
break;
}
}
}
int numRuns = 100 * 1000;
long time = 0;

// Delegate
Console.WriteLine("*** Running dynamic delegate test ***");
for (int i = 0; i < numRuns; ++i)
{
time += DynamicInvokeDelegate(fn);
_stopwatch.Reset();
}

Console.WriteLine("Avg dynamic delegate invoke = " + (double)time / (double)numRuns);
time = 0;

Console.WriteLine("Average time normal = " + (double)time / (double)numRuns);
}

static Int64 DynamicInvokeDelegate(Functions fn)
{
Type typeFn = fn.GetType();
object[] zeroParam = new object[0];
object[] oneParam = new object[1] { 1.0 };

_stopwatch.Start();
foreach (int key in _delegate.Keys)
{
//Console.WriteLine(
// String.Format("num param {0}, num fn {1}",
// key, _functions[key].Count));
foreach (Delegate method in _delegate[key])
{
switch (key)
{
case 0:
method.DynamicInvoke(zeroParam);
break;
case 1:
method.DynamicInvoke(oneParam);
break;
default:
break;
}
}
}

_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}
}

Genetic Programming: C# Reflection Performance Revisited

I noticed that I made some mistakes in my first test so I fixed them and I added another test. The results are not significantly different but let me demonstrate the changes: I added a dictionary that will contain the methodInfo for reach function which will allow me to Invoke the methods directly without looking them up by string.


class Program
{
static Dictionary<int, List<String>> _functions =
new Dictionary<int, List<string>>();
static Dictionary<int, List<MethodInfo>> _methods =
new Dictionary<int, List<MethodInfo>>();
static Dictionary<int, List<MethodBase>> _methodBases =
new Dictionary<int, List<MethodBase>>();
static Dictionary<int, List<Delegate>> _delegate =
new Dictionary<int, List<Delegate>>();

static Stopwatch _stopwatch = new Stopwatch();

static void Main(string[] args)
{
Console.WriteLine("Frequency = " + Stopwatch.Frequency);
DyanmicInvocationTest();
Console.WriteLine("Press any key to exit!");
Console.ReadKey();
}

static void DyanmicInvocationTest()
{
Functions fn = new Functions();
Type typeFn = fn.GetType();
MethodInfo[] methods = typeFn.GetMethods();
foreach (MethodInfo method in methods)
{
if (method.Name.StartsWith("GP"))
{
ParameterInfo[] pi = method.GetParameters();
if (!_functions.ContainsKey(pi.Length))
{
_functions.Add(pi.Length, new List<string>());
}

_functions[pi.Length].Add(method.Name);

if (!_methods.ContainsKey(pi.Length))
{
_methods.Add(pi.Length, new List<MethodInfo>());
}
_methods[pi.Length].Add(method);

if (!_methodBases.ContainsKey(pi.Length))
{
_methodBases.Add(pi.Length, new List<MethodBase>());
}
_methodBases[pi.Length].Add(method);

if (!_delegate.ContainsKey(pi.Length))
{
_delegate.Add(pi.Length, new List<Delegate>());
}

MemberInfo[] members = typeFn.GetMembers();
switch(pi.Length)
{
case 0:
_delegate[pi.Length].Add(
Delegate.CreateDelegate(
typeof(Functions.ZeroParam), fn, method));
break;
case 1:
_delegate[pi.Length].Add(
Delegate.CreateDelegate(
typeof(Functions.OneParam), fn, method));
break;
default:
break;
}
}
}
int numRuns = 100 * 1000;
long time = 0;

// Member
Console.WriteLine("*** Running dynamic member invoke test ***");
for (int i = 0; i < numRuns; ++i)
{
time += DynamicInvokeMember(fn);
_stopwatch.Reset();
}

Console.WriteLine("Avg dynamic member invoke = " + (double)time/(double)numRuns);
time = 0;

// Method
Console.WriteLine("*** Running dynamic method invoke test ***");
for (int i = 0; i < numRuns; ++i)
{
time += DynamicInvokeMethod(fn);
_stopwatch.Reset();
}

Console.WriteLine("Avg dynamic method invoke = " + (double)time / (double)numRuns);
time = 0;

// Method Base
Console.WriteLine("*** Running dynamic method base invoke test ***");
for (int i = 0; i < numRuns; ++i)
{
time += DynamicInvokeMethodBase(fn);
_stopwatch.Reset();
}

Console.WriteLine("Avg dynamic method base invoke = " + (double)time / (double)numRuns);
time = 0;

// Delegate
Console.WriteLine("*** Running dynamic delegate test ***");
for (int i = 0; i < numRuns; ++i)
{
time += DynamicInvokeDelegate(fn);
_stopwatch.Reset();
}

Console.WriteLine("Avg dynamic delegate invoke = " + (double)time / (double)numRuns);
time = 0;

// Normal
Console.WriteLine("*** Running normal invocation test ***");
for (int i = 0; i < numRuns; ++i)
{
time += NormalInvoke(fn);
_stopwatch.Reset();
}

Console.WriteLine("Average time normal = " + (double)time / (double)numRuns);
}

static Int64 DynamicInvokeMember(Functions fn)
{
Type typeFn = fn.GetType();

object[] zeroParam = new object[0];
object[] oneParam = new object[1] { 1.0 };

_stopwatch.Start();
foreach (int key in _functions.Keys)
{
//Console.WriteLine(
// String.Format("num param {0}, num fn {1}",
// key, _functions[key].Count));
foreach (String function in _functions[key])
{

switch (key)
{
case 0:
typeFn.InvokeMember(function,
BindingFlags.Default | BindingFlags.InvokeMethod,
null,
fn,
zeroParam,
null);
break;
case 1:
typeFn.InvokeMember(function,
BindingFlags.Default | BindingFlags.InvokeMethod,
null,
fn,
oneParam,
null);
break;
default:
break;
}
}
}

_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}

static Int64 DynamicInvokeMethod(Functions fn)
{
Type typeFn = fn.GetType();
object[] zeroParam = new object[0];
object[] oneParam = new object[1]{1.0};


_stopwatch.Start();
foreach (int key in _methods.Keys)
{
//Console.WriteLine(
// String.Format("num param {0}, num fn {1}",
// key, _functions[key].Count));
foreach (MethodInfo method in _methods[key])
{

switch (key)
{
case 0:
method.Invoke(fn, zeroParam);
break;
case 1:
method.Invoke(fn, oneParam);
break;
default:
break;
}
}
}

_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}

static Int64 DynamicInvokeMethodBase(Functions fn)
{
Type typeFn = fn.GetType();
object[] zeroParam = new object[0];
object[] oneParam = new object[1] { 1.0 };

_stopwatch.Start();
foreach (int key in _methodBases.Keys)
{
//Console.WriteLine(
// String.Format("num param {0}, num fn {1}",
// key, _functions[key].Count));
foreach (MethodBase method in _methodBases[key])
{
switch (key)
{
case 0:
method.Invoke(fn, zeroParam);
break;
case 1:
method.Invoke(fn, oneParam);
break;
default:
break;
}
}
}

_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}

static Int64 DynamicInvokeDelegate(Functions fn)
{
Type typeFn = fn.GetType();
object[] zeroParam = new object[0];
object[] oneParam = new object[1] { 1.0 };

_stopwatch.Start();
foreach (int key in _delegate.Keys)
{
//Console.WriteLine(
// String.Format("num param {0}, num fn {1}",
// key, _functions[key].Count));
foreach (Delegate method in _delegate[key])
{
switch (key)
{
case 0:
method.DynamicInvoke(zeroParam);//.Invoke(fn, zeroParam);
break;
case 1:
method.DynamicInvoke(oneParam);
break;
default:
break;
}
}
}

_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}

static long NormalInvoke(Functions fn)
{
_stopwatch.Start();
fn.GPConstNumPI();
fn.GPConstNumZero();
fn.GPMACD(1.0);
fn.GPSin(1.0);
_stopwatch.Stop();
return _stopwatch.ElapsedTicks;
}
}


The DynamicInvokeMethod directly invokes the method, rather than looking it up by string from the Type, which offers a pretty large performance increase. There is no change to the NormalInvoke method.

So the results for the run with calculations are as follows:
  • The normal method invocation took an average of 213 CPU Ticks.
  • The dynamic method invocation took on average about 272 CPU Ticks.
  • The dynamic member invocation took an average of 304 CPU Ticks.

The results for the run without calculations is:
  • The normal method invocation took an average of 2.52 CPU Ticks.
  • The dynamic method invocation took on average about 61.71 CPU Ticks.
  • The dynamic member invocation took an average of 97.93 CPU Ticks.


I found some resources from other people and their performance analysis:

Simon Lucas at the University of Essex:

Mattias Fagerlund's Coding Blog:

Genetic Programming: C# Reflection Performance

With Genetic Programming it's always good to have a flexible set of functions which one can plug in and out of the framework in order to test out different solution hypothesis.  After asking around what might be a good way to tackle this problem I heard about Dynamic Method Invocation via Reflection. With reflection it's extremely easy to find all the methods of a given class, so if we want to add a new method then all we have to do is write it and let the GP will automatically pick it up. I wrote a little program to test this out...

First I wrote a class containing some functions:
class Functions
{
    private readonly String _symbol;
    public Functions()
    {
        _symbol = "INVALID";
    }

    public Functions(String symbol)
    {
        _symbol = symbol;
    }
    
    #region ConstNumbers
    public double GPConstNumZero()
    {
        LongOperation();
        return 0.0;
    }
    
    public double GPConstNumPI()
    {
        LongOperation();
        return Math.PI;
    }

    #endregion

    #region OneParamFunctions
    public double GPSin(double val)
    {
        LongOperation();
        return Math.Sin(val);
    }
    
    public double GPMACD(double period)
    {
        LongOperation();
        // Do the MACD calculations here
        return Double.MaxValue;
    }
    #endregion

    #region TwoParameterFunctions
    
    // N/A
    #endregion

    private void LongOperation()
    {
        // I'll explain what goes here a little later
    }
}

Then I wrote my test program:
class Program
{
    // A dictionary that will map functions to the number of parameters
    // where the key is the number of parameters and the functions
    // that have the same number of parameters are added in the list
    // corresponding to the key.
    static Dictionary> _functions = 
        new Dictionary>();

    // A stopwatch for performance analysis
    static Stopwatch _stopwatch = new Stopwatch();

    static void Main(string[] args)
    {
        // Start the Dynamic Method Invocation Test
        DyanmicInvocationTest();
        Console.WriteLine("Press any key to exit!");
        Console.ReadKey();
    }

    static void DyanmicInvocationTest()
    {
        // Get all the method information
        Functions fn = new Functions();
        Type typeFn = fn.GetType();
        MethodInfo[] methods = typeFn.GetMethods();
        foreach (MethodInfo method in methods)
        {
            if (method.Name.StartsWith("GP"))
            {
                ParameterInfo[] pi = method.GetParameters();
                if (!_functions.ContainsKey(pi.Length))
                {
                    _functions.Add(pi.Length, new List());
                }

                _functions[pi.Length].Add(method.Name);
            }
        }
        
        // Setup the performance analysis
        int numRuns = 100 * 1000;
        long time = 0;

        // Run the dynamic invocation performance test
        Console.WriteLine("Running dynamic invocation performance test");
        for (int i = 0; i < numRuns; ++i)
        {
            time += DynamicInvoke(fn);
            _stopwatch.Reset();
        }

        Console.WriteLine("Average time dynamic = " + (double)time/(double)numRuns);
        time = 0;

        // Run the normal invocation performance test
        Console.WriteLine("Running normal invocation performance test");
        for (int i = 0; i < numRuns; ++i)
        {
            time += NormalInvoke(fn);
            _stopwatch.Reset();
        }

        Console.WriteLine("Average time normal = " + (double)time / (double)numRuns);
    }

    static Int64 DynamicInvoke(Functions fn)
    {
        Type typeFn = fn.GetType();
        foreach (int key in _functions.Keys)
        {
            //Console.WriteLine(
            //    String.Format("num param {0}, num fn {1}", 
            //    key, _functions[key].Count));
            foreach (String function in _functions[key])
            {
                _stopwatch.Start();
                switch (key)
                {
                    case 0:
                        typeFn.InvokeMember(function,
                            BindingFlags.Default | BindingFlags.InvokeMethod,
                            null,
                            fn,
                            new object[0],
                            null);
                        break;
                    case 1:
                        typeFn.InvokeMember(function,
                            BindingFlags.Default | BindingFlags.InvokeMethod,
                            null,
                            fn,
                            new object[1] { 1.0 },
                            null);
                        break;
                    default:
                        break;
                }
                _stopwatch.Stop();
            }
        }
        return _stopwatch.ElapsedTicks;
    }

    static long NormalInvoke(Functions fn)
    {
        _stopwatch.Start();
        fn.GPConstNumPI();
        fn.GPConstNumZero();
        fn.GPMACD(1.0);
        fn.GPSin(1.0);
        _stopwatch.Stop();
        return _stopwatch.ElapsedTicks;
    }
}

Earlier I didn't show you the LongOperation method of the Functions class because I wanted to explain what I was testing. Since remote method invocation can be expensive, I wanted to see when I was going to be hit the hardest if I'm going to do dynamic method invocation and I contrived two test cases:
  1. Invoke all the functions, but only return the constants inside them (e.g. when a lot of the data is pre-calculated or it already exists as constants).
    1. Invoke the methods dynamically.
    2. Invoke the methods directly.
  2. Invoke all the functions, do some calculations and return the constants inside them (e.g. when a lot of the data is being computed at run-time).
    1. Invoke the methods dynamically.
    2. Invoke the methods directly.
So for the 1st run with no calculations I just ran the  code as you see it, but for the second test I made a small change and let the LongOperation do some mathematical calculations (take the square root of a number):
private void LongOperation()
{
    for (double i = 0.0; i < 1000.0; ++i)
    {
        Math.Sqrt(i);
    }
}

When running the functions with the calculations I found that the performance hit was about 50% (i.e. methods invoked dynamically through reflection were about 50% slower than the methods invoked directly).




Table 1.0 Avg Ticks: dynamic (315.11694) normal (211.98494)


The truly horrifying part was when I ran the functions without any calculations and the performance hit was 3,967%... yes, you read it right! Invoking methods through reflection is nearly a 4,000% slower if you're not doing any calculations, but simply returning constants. See the graph below:



Table 2.0 Avg Ticks: dynamic (102.74611) normal (2.52593) 
 


So naturally those numbers are mortifying, but there is some hope on the horizon: I found a blog by Gerhard Stephan (no clue who he is) where he discusses how to get a 2000% performance increase over the methods invoked with reflection. I'm still trying to understand what he's doing, but for now it's not looking very promising.

Conclusion:
I try to have as much of the data pre-calculated as possible which reduces my calculation time during the evolution of the strategies and maximizes the evolution iterations, so reflection is not looking very promising for me. If I had a lot more calculations in my functions, then I might consider reflection as a decent alternative. For now I'll keep looking for any ways to get that performance increase and if I can't figure out something within the next couple of days I'll have to drop this idea.

Thursday, April 29, 2010

Getting Data: Real-Time Quotes

As promised I'll discuss the options for real-time data... it seems that getting real-time data has become increasingly easier and some brokerages actually have APIs that help you get real-time data or they allow you to download small interval historical prices (minutes, seconds or even ticks).

In this post I'll discuss how to get free "real-time" data and how you can use that data to build your own historical prices database. Yahoo offers delayed quotes (up to 15 minutes), but it also has "real-time" data and we can take advantage of that with a little programming.

First I'd like to introduce you to the Yahoo "API", which is nothing more than a collection of tags. You can see all the available tags here: http://www.gummy-stuff.org/Yahoo-data.htm

Getting real-time quotes is much like getting historical data from Yahoo: you will build a URL with the tags you're interested in and you'll use that URL to get a CSV file containing a quote. Here is how to construct a URL:

http://finance.yahoo.com/d/quotes.csv?s= + STOCK_SYMBOL(S) + &f= + TAG(S)


Let's try it with one symbol and one tag: the symbol will be "GOOG" (Google) and the tag wil be "g" (day's low):

http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=g


If we want to get multiple tags we can just string them together (will return a CSV file with the high and the low of the day):

http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=hg


Let's fetch a CSV file containing Google's high and low values for the day on the first line and Microsoft's high and low values for the day on the next line:

http://finance.yahoo.com/d/quotes.csv?s=GOOG+MSFT&f=hg


Here is an example that downloads all the data for the specified tags and prints them to the screen alongside their description:

// A dictionary with tags where the key is the tag
// and the value is the description of the tag.
private Dictionary _tags;

private void DownloadData(String symbol)
{
string url = String.Format(
"http://finance.yahoo.com/d/quotes.csv?s={0}&f=", symbol);

//Get page showing the table with the chosen indices
HttpWebRequest request = null;
DFDataSet ds = new DFDataSet();
Random rand = new Random(DateTime.Now.Millisecond);
try
{
while (_running)
{
foreach (String key in _tags.Keys)
{
lock (_sync)
{
request = (HttpWebRequest)WebRequest.CreateDefault(
new Uri(url + key));
request.Timeout = 30000;

using (var response = (HttpWebResponse)request.GetResponse())
using (StreamReader input = new StreamReader(
response.GetResponseStream()))
{
Console.WriteLine(String.Format("{0} {1} = {2}",
symbol, _tags[key], input.ReadLine());
}
}
}
Console.WriteLine(Thread.CurrentThread.Name + " running.");
Thread.Sleep(60*1000); // 60 seconds
}
}
catch (Exception exc)
{
Console.WriteLine(exc.Message);
}
}


At this point you should have everything you need to build your own historical database:
1. Download the CSV file with all the tags you're interested in.
2. Open a connection to your database.
3. Bulk insert the CSV file into your database.

Friday, April 23, 2010

Getting Data: EOD Historical Prices

One of the first issues that people encounter when trying to experiment with ML/AI algorithms in the trading field is getting (free) training data. It's pretty easy to get end of day (EOD) prices from google and yahoo- both providers allow you to download historical data in a csv format.

Let's take a look at both providers:

Google
  • Posts the EOD data earlier than Yahoo (2-3 hours before Yahoo posts the data).
  • Sample URL: http://www.google.com/finance/historical?q=NYSE:GOOG&output=csv
  • Has a limit on how many requests you can make per minute (don't know the exact figure, but if you request data too often google will block you for a period of time).
  • Good if you download the data manually and automatically (at low frequency).
Yahoo
  • Posts the EOD data a little later (around 7:00 p.m. CST).
  • Sample URL: http://ichart.finance.yahoo.com/table.csv?s=YHOO&d=3&e=23&f=2010&g=d&a=3&b=12&c=1996&ignore=.csv
  • Has no limit on how many requests you can make at all!
  • Good if you download the data both manually and automatically (no frequency limit).
The entire set of up-to-date EOD prices from Yahoo for all the stocks on all of the major exchanges is around 4GB. Google's database size is about the same.

I chose to use Yahoo, because it was important that there is no limit on the number of requests I can make... this allowed me to automate the process of downloading the data and populating my database.

Here is sample code on how to download the data:

void DownloadDataFromWeb(string symbol)
{
DateTime startDate = DateTime.Parse("1900-01-01");

string baseURL = "http://ichart.finance.yahoo.com/table.csv?";
string queryText = BuildHistoricalDataRequest(symbol, startDate, DateTime.Today);
string url = string.Format("{0}{1}", baseURL, queryText);

//Get page showing the table with the chosen indices
HttpWebRequest request = null;
HttpWebResponse response = null;
StreamReader stReader = null;

//csv content
string docText = string.Empty;
string csvLine = null;
try
{
request = (HttpWebRequest)WebRequest.CreateDefault(new Uri(url));
request.Timeout = 300000;

response = (HttpWebResponse)request.GetResponse();

stReader = new StreamReader(response.GetResponseStream(), true);

stReader.ReadLine();//skip the first (header row)
while ((csvLine = stReader.ReadLine()) != null)
{
string[] sa = csvLine.Split(new char[] { ',' });

DateTime date = DateTime.Parse(sa[0].Trim('"'));
Double open = double.Parse(sa[1]);
Double high = double.Parse(sa[2]);
Double low = double.Parse(sa[3]);
Double close = double.Parse(sa[4]);
Double volume = double.Parse(sa[5]);
Double adjClose = double.Parse(sa[6]);
// Process the data (e.g. insert into DB)
}
}
catch (Exception e)
{
throw e;
}
}

string BuildHistoricalDataRequest(string symbol, DateTime startDate, DateTime endDate)
{
// We're subtracting 1 from the month because yahoo
// counts the months from 0 to 11 not from 1 to 12.
StringBuilder request = new StringBuilder();
request.AppendFormat("s={0}", symbol);
request.AppendFormat("&a={0}", startDate.Month-1);
request.AppendFormat("&b={0}", startDate.Day);
request.AppendFormat("&c={0}", startDate.Year);
request.AppendFormat("&d={0}", endDate.Month-1);
request.AppendFormat("&e={0}", endDate.Day);
request.AppendFormat("&f={0}", endDate.Year);
request.AppendFormat("&g={0}", "d"); //daily

return request.ToString();
}

Thursday, April 22, 2010

Hello World!

Since the topic is Machine Learning and Artificial Intelligence applied to trading, why don't we start by looking at how nature deals with trading: http://www.rattraders.com/

I thought it was an interesting concept... they're essentially letting nature evolve trading strategies. While rat's tend to reproduce quite fast it's still quite slow compared to what your average computer can do.

In the spirit of evolution I made my own application that uses Genetic Programming algorithms to create trading strategies and I use them to trade on the stock market. I post my results on twitter in order to get a time-stamp: www.twitter.com/darwins_finches

So far the results have been quite promising: from 09-09-09 to 4-13-2010 we have accumulated about 25% profit when the stocks we trade have only gone up 18.77%. The S&P 500 has gone up 16.76% in the same period, so we're outperforming both the Buy & Hold strategy for our stocks and the S&P 500.