Wednesday, 18 January 2017

Playing with Docker - some initial results (pysystemtrade)

This post is about using Docker - a containerisation tool - to run automated trading strategies. I'll show you a simple example of how to use Docker with my python back testing library pysystemtrade to run a backtest in a container, and get the results out. However this post should hopefully be comprehensible to non pysystemtrade and non python speaking people as well.

PS: Apologies for the long break between posts: I've been writing my second book, and until the first draft is at the publishers posts will be pretty infrequent.


The logo of docker is a cute whale with some containers on its back. It's fun, but I'm worried that people will think it's okay to start using whales as a cheap alternative to ships. Listen people: It's not okay.
Source: docker.com


What the Heck is Docker and why do I need it?

As you'll know if you've read this post I currently run my trading system with two machines - a live and a backup. A couple of months ago the backup failed; some kind of disk failure. I humanely dispatched it to ebay (it is now in a better place, the new owner managing to successfully replace the disk: yes I am a software guy and I don't do hardware...).

A brief trip to ebay later and I had a new backup machine (actually I accidentally bought two machines, which means I now have a machine I can use for development). A couple of days ago I then had to spend around half a day setting up the new machine.

This is quite a complicated business, as the trading system consists of:


  1. An operating system (Linux mint in my case)
  2. Some essential packages (ssh; emacs, as I can't use vi; x11vnc as the machine won't normally have a monitor attached; git to download useful python code from github or my local network drive)
  3. Drive mountings to network drives
  4. The interactive brokers gateway
  5. A specific version of Python 
  6. A bunch of python libraries that I wrote all by myself
  7. A whole bunch of other python libraries like numpy, pandas etc: again all requiring specific versions to work (my legacy trading system is in frozen development so it is keyed to an obsolete version of various libraries which have since had pesky API changes)
  8. Some random non python package dependencies 
  9. A directory structure for data
  10. The actual data itself 

All that needs to be setup on the machine; which at times can be quite fiddly; for example many of the dependencies have dependencies and sometimes a google is required. Although I have a 'build' file consisting of a document file mostly saying "do this, then this..." it can still be a tricky process. And some parts are... very... slow...

While moaning about this problem on twitter I kept hearing about something called Docker. I had also seen references to Docker in the geeky dead trees based comic I occasionally indulge myself with, and most recently at this ultra geeky linux blog written by an ex colleague.

At it's simplest level Docker allows the above nightmare to be simplified to just the following steps:

  1. An operating system. Scarily this could be ANY operating system. I discuss this below.
  2. Some essential packages (these could be containerised but probably not worth it)
  3. Drive mountings to network drives
  4. The interactive brokers gateway (could also be put inside Docker; see here).
  5. Install Docker
  6. Run a docker container that contains everything in steps 5 to 10

This will certainly save time every time I setup a new trading server; but unless you are running your own data centre that might seem a minimal time saving. Actually there are numerous other advantages to using Docker which I'll discuss in a second. But first lets look at a real example.


Installing Docker


Installing Docker is, to be fair, a little tricky. However it's something you should only need to do once; and it will in the long run prevent you from much pain installing other crud. It's also much easier than installing say pandas.

Here is how it's done: Docker installation

It's reasonably straightforward, although I found to my cost it won't work on a 32 bit linux distro. So I had to spend the first few hours of yesterday reinstalling the OS on my laptop, and on the development machine I intend to use for running pysystemtrade when it gets closer to being live for at least .

Not a bad thing: I had to go through the pain of reinstalling my legacy code and dependencies to remind me of why I was doing this, took the opportunity to switch to a new IDE pycharm, and as a bonus finally wiped Windows 10 off the hard disk (I'd kept it 'just in case' but I've only used it twice in the last 6 months: as the rest of my household are still using Windows there are enough machines lying around if I need it).

The documentation for Docker is mostly excellent, although I had to do a little bit of googling to work out how to run the test example I'm going to present now (that's mostly because there is a lot of documentation and I got bored - probably if I'd read it all I wouldn't have had to google the answer).


Example of using Docker with pysystemtrade


The latest version of pysystemtrade on github includes a new directory: pysystemtrade/examples/dockertest/.  You should follow along with that. All the command line stuff you see here is linux; windows users might have to read the documentation for Docker to see whats different.


Step one: Creating the docker image (optional)


The docker image is the starting state of your container. Think of a docker image as a little virtual machine (Docker is different from a true virtual machine... but you can google that distinction yourself) preloaded with the operating system, all the software and data you need do your thing; plus the script that your little machine will run when it's loaded up.

Creating a docker image essentially front loads - and makes repeatable - the job of setting up the machine. You don't have to create your own images, since you can download them from the docker hub - which is just like the git hub. Indeed you'll do that in step two. Nevertheless it's worth understanding how an image is created, even if you don't do it yourself.

If you want to create your own image you'll need to copy the file Dockerfile (in the directory of pysystemtrade/examples/dockertest/) to the parent directory of pysystemtrade on your computer. For example the full path name for me is /home/rob/workspace3/pysystemtrade/... so I would move to the directory /home/rob/workspace3/. Then you'll need to run this command:

sudo docker build -t mydockerimage .

Okay; what did that just do? First let's have a look at the Dockerfile:


FROM python
MAINTAINER Rob Carver <rob@qoppac.com>
RUN pip3 install pandas
RUN pip3 install pyyaml
RUN pip3 install scipy
RUN pip3 install matplotlib
COPY pysystemtrade/ /pysystemtrade/
ENV PYTHONPATH /pysystemtrade:$PYTHONPATH
CMD [ "python3", "/pysystemtrade/examples/dockertest/dockertest.py" ]

Ignoring the second line, this does the following (also read this):


  • Loads a base image called python (this defaults to python 3, but you can get earlier versions). I'll talk about base images later in the post.
  • Loads a bunch of python dependencies (the latest versions; but again I could get earlier versions if I wanted)
  • Copies my local version of the pysystemtrade library into the image
  • Ensures that python can see that library
  • Runs a script within the pysystemtrade 


If you wanted to you could tag this image and push it on the docker hub, so that other people could use it (See here). Indeed that's exactly what I've done: here.

Note: Docker hub gives you one free private image by default, but as many public images as you need.

Note 2: If you are running this on a machine without pysystem trade you will need to add an extra command to pull the code from github. I leave this as an exercise to the reader.


Step two: Running the container script


You're now ready to actually use the image you've created in a container. A container is like a little machine that springs into life with the image pre-loaded, does it stuff in a virtual machine like way, and then vanishes.

If you haven't created your own image, then you need to run this:

sudo docker run -t -v /home/rob/results:/results robcarver17/pysystemtrade

This will go to docker hub and get my image (warning this may take a few minutes).

OR with your own local image if you followed step one above:

sudo docker run -t -v /home/rob/results:/results mydockerimage


In both cases replacing  /home/rob/results with your own preferred directory for putting the backtest output. This stuff after the '-v' flag mounts that directory into the docker container; mapping it to the directory /results. Warning: the docker image will have complete power over that directory, so it's best to create a directory just for this purpose in case anything goes wrong.

The docker image will run, executing this script:


from systems.provided.futures_chapter15.basesystem import futures_system
from matplotlib.pyplot import show

resultsdir="/results"system = futures_system(log_level="on")
print(system.accounts.portfolio().sharpe())
system.pickle_cache("", resultsdir+"/dockertest.pck")


This just runs a backtest, and then saves the result to /results/dockertest.pck. The '-t' flag means you can see it running. Remember the /results directory is actually a volume mapped on to the local machine. After the image has closed it will have created a file on the local machine called   /home/rob/results/dockertest.pck


Step three: Perusing the results


Now in a normal local python sessions run the file dockertestresults.py

from systems.provided.futures_chapter15.basesystem import futures_system
from matplotlib.pyplot import show

resultsdir="/home/rob/results"
system = futures_system(log_level="on")
system.unpickle_cache("", resultsdir+"/dockertest.pck")
# this will run much faster and reuse previous calculationsprint(system.accounts.portfolio().sharpe())

Again you will need to change the resultsdir to reflect where you mapped the Docker volume earlier. This will load the saved back test, and recalculate the p&l (which is not stored in the systems object cache).


Okay... so what (Part one: backtesting)


You might be feeling a little underwhelmed by that example, but there are many implications of what we just did. Let's think about them.


Backtesting server


Firstly, what we just did could have happened on two machines: one to run the container, the other to analyse the results. If your computing setup is like mine (relatively powerful, headless servers, often sitting around doing nothing) that's quite a tasty prospect.


Backtesting servers: Land of clusters


I could also run backtests across multiple machines. Indeed there is a specific docker product (swarm) to make this easier (also check out Docker machine).


Easy setup


Right at the start I told you about the pain involved with setting up a single machine. With multiple machines in a cluster... that would be a real pain. But not with docker. It's just a case of installing essential services, docker itself, and then launching containers. 


Cloud computing


These multiple machines don't have to be in your house... they could be anywhere. Cloud computing is a good way of getting someone else to keep your machines running (if I was running modest amounts of outside capital, it would be the route I would take). But the task of spinning up, and preparing a new cloud environment is a pain. Docker makes it much easier (see Docker cloud).


Operating system independent


You can run the container on any OS that can install Docker... even (spits) Windows. The base image essentially gets a new OS; in the case of the base image python this is just a linux variant with python preloaded. You can also get different variants which have a much lower memory overhead.

This means access to a wider variety of cloud providers. It also provides redundancy for local machines: if both my machines fail I am only minutes away from running my trading system. Finally for users of pysystemtrade who don't use Linux it means they can still run my code. For example if you use this Dockerfile:

FROM python
MAINTAINER Rob Carver <rob@qoppac.com>
RUN pip3 install pandas
RUN pip3 install pyyaml
RUN pip3 install scipy
RUN pip3 install matplotlib
COPY pysystemtrade/ /pysystemtrade/
ENV PYTHONPATH /pysystemtrade:$PYTHONPATH


sudo docker build -t mydockerimage .
sudo docker run -t -v -i mydockerimage

... then you will be inside an interactive python session with access to the pysystemtrade libraries. Some googling indicates it's possible to run ipython and python notebooks inside docker containers as well, though I haven't tried this myself.


Okay... so what (Part two: production)


Docker also make running production automated trading systems much easier: in fact I would say this is the main benefit for me personally. For example you can easily spin up one or more new trading machines agnostic of OS either locally or on a cloud. Indeed using multiple machines in my trading system is one of the things I've been thinking about for a while (see this series of tweets: one, two, three and so on).


Microservices


Docker makes it easier to adopt a microservices approach where we have lots of little processes rather than a few big ones. For example instead of running one piece of code to do my execution, I could run multiple pieces one for each instrument I am trading. Each of these could live in it's own container. Then if one container fails, the others keep running (something that doesn't happen right now).

The main advantage of Docker over true virtual machines is that each of those containers would be working off almost identical images (the only difference being the CMD command at the end; in practice you'd have identical images and put the CMD logic into the command line); Docker would share this common stuff massively reducing the memory load of running a hundred processes instead of one.


Data in container images


In the simple version of pysystemtrade as it now exists the data is effectively static .csv files that live inside the python directory structure. But in the future the data would be elsewhere: probably in databases. However it would make sense to keep some data inside the docker image, eg static information about instruments or configuration files. Then it would be possible to easily test and deploy changes to that static information.


Data outside container images


Not all data can live inside images; in particular dynamic data like market prices and system state information like positions held needs to be somewhere else.

Multiple machines means multiple places where data can be stored (machine A, B or a local NAS). Docker volumes allow you to virtualise that so the container doesn't know or care where the data it's using lives. The only work you'd have to do is define environment variables which might change if data is living in a different place to where it normally is, and then launch your container with the approriate volume mappings.

Okay there are other ways of doing this (a messy script of sim links in linux for example) but this is nice and tidy.

You can also containerise your storage using Docker data volumes but I haven't looked into that yet.


Message bus


I am in two minds about whether using a message bus to communicate between processes is necessary (rather than just shared databases; the approach I use right now). But if I go down that route containers need to be able to speak to each other. It seems like this is possible although this kind of technical stuff is a little beyond me; more investigation is required. 

(It might be that Swarm might remove the need for a message bus in any case; with new containers launched passing key arguments)

Still at a minimum docker containers will need to talk to the IB Gateway (which could also live in a container... see here) so it's reassuring to know that's possible. But my next Docker experiment will probably be seeing if I can launch a Gateway instance (probably outside a container because of the two factor authentication I've grumbled about before unless I can use x11vnc to peek inside it) and then get a Docker container to talk to it. This is clearly a "gotcha" - if I can't get this to work then I can't use Docker to trade with! Watch this space.


Scheduling


At the moment my scheduling is very simple: I launch three big processes every morning using cron. Ideally I'd launch processes on demand; eg when a trade is required I'd run a process to execute it. I'd launch price capturing processes only when the market opens. If I introduce event driven trading systems into my life then I'd need processes that launched when specific price targets were reached.

It looks like Docker Swarm will enable this kind of thing very easily. In particularly because I'm not using python to do the process launching I won't violate the IB multiple gateway connection problem. I imagine I'd then be left with a very simple crontab on each machine to kick everything into life, and perhaps not even that.


Security


Security isn't a big deal for me, but there is something pleasing about only allowing images access to certain specific directories on the host machine.


Development and release cycle


Finally Docker makes it easier to have a development and release cycle. You can launch a docker container on one machine to test things are working. Then launch it on your production machine. If you have problems then you can easily revert to the last set of images that worked. You don't have to worry about reverting back to old python libraries and generally crossing your fingers and hoping it all works.

You can also easily run automated testing; a good thing if I ever get round to fixing all my tests.

Geeky note: You can only have one private image in your docker hub account; and git isn't ideal for storing large binaries. So another source control tool might be better for storing copies of images you want to keep private.


Summary


Development of pysystemtrade - as the eventual replacement to my legacy trading system - is currently paused whilst I finish my second book; but I'm sure that Docker will play a big part in it. It's a huge and complex beast with many possibilities which I need to research more. Hopefully this post has given you a glimpse of those possibilities.

Yes: there are other ways of achieving some of these goals (I look forward to people telling me I should use puppet or god knows what), but the massive popularity of Docker tells you why it's so good; it's very easy to use for someone like me who isn't a professional sysadmin or full on linux geek, and offers a complete solution to many of the typical problems involved with running a fully automated system.

PS You should know me better by now but to clear: I have no connection with Docker and I am receiving no benefit pecuniary or otherwise for writing this post.

28 comments:

  1. Nice, I too am experimenting with docker. A couple of code changes to your Dockerfile though, 1) always use the USER command to switch to a non root user, a vulnerability was discovered just yesterday about not using SELinux and root users. 2) your pIp installs should be on one line that way if you ever need to install a new package the entire image will be rebuilt. 3) if you deploy more than one container it might be wise to invest some time in learning YML so your container cluster can be deployed a little better and more efficiently.

    Other than that it really is cool to see how your using it.

    ReplyDelete
    Replies
    1. Hi Mark. I will add the USER command in (will take me a few minutes to update both hubs and the blog post). For (2) I'm not sure I agree entirely as I can see use cases where I want to update pandas and not scipy since the former is in development but the latter is more stable. For (3) I already use YAML (assume that's what you mean eg in https://docs.docker.com/compose/compose-file/) for pysytemtrade configuration so that's one less thing I need to learn (thank goodness).

      Delete
    2. Problem: running as a specific USER gives me permission issues (sudo can easily acccess the bound /home/rob/results directory, specific user cannot). Actually even in its present form the code isn't ideal since it creates a .pck file that has superuser only privileges.

      After a lot of googling it seems the best option is to create a data volume container, however it isn't clear to me if that will get rid of the permissions problem...

      Delete
    3. Problem: running as a specific USER gives me permission issues

      I believe this is due to UNIX filesystem access rights. User and group ownership of the directories and files is not the same on the host as in the container.

      This is solved by setting the same numeric user id and group id on the directories and files on the host filesystem as the user have in the container. You can change it recursively on the host filesystem with: chown -R : , the uid and gid numbers you can find by logging into the container with "docker exec" and issueing "id ", the actual info is stored in /etc/group and /etc/passwd text files.

      Another way to do it is to make the directories and files on the host filesystme writable for everyone, chmod -R o+w , this is of course less good from a security standpoint, but it will solve your problem.

      Delete
    4. Thanks. This is obviously the best solution.

      Delete
  2. Nice one.

    Passing messages between containers is fairly straightforward. When using "docker links", it effectively maps the local IP address of the linked container to /etc/hosts within the container, so you simply address it by the linked name (e.g. http://mongo would give you the IP of a container linked with the name 'mongo').

    IB gateway, either within a container or standalone, should hook onto the host systems' network interface, so your trading container should also be able to do the same thing.

    ReplyDelete
  3. My understanding is that Interactive Brokers allows you to run one TWS/Gateway instance at a time. How do you handle this limitation if you use Docker to run multiple instances of your trading system in parallel?

    ReplyDelete
    Replies
    1. There would only be one Gateway instance to which multiple processes would connect.

      Delete
  4. As someone who works with docker daily, I'd recommend against it in this case. You have a mostly static environment and the added complexity well just add to your maintenance work. My recommendation would be a solid config management tool like ansible to provision your servers and for you to standardise on an os, like centos 7.

    ReplyDelete
    Replies
    1. Thanks for commenting; I guess I'll be able to evaluate when I start using it in anger.

      Delete
    2. I would object. Using Docker and following the paradigm of "Infrastructure as Code" brings a lot of benefits and flexibilty. Using a build tool like make/cmake/ansible/ant/puppet/etc to automatically manufacture your container and uploading it to a remote registry, you are then able to run your software from anywhere in the world on any computer, cluster or cloud supporting Docker. Not only good from a development standpoint but aklso great from a disaster recovery perspective if you physical machine goes down you will be rapidly up and running on some other place with minimal downtime.

      Delete
    3. What I wanted to say is that you can treat your infrastructure as disposable and you throw it way and rebuild it anytime you like. "Pets vs. Cattle"

      As opposed to manually installing the OS, manually installing the software, manually configuring the software, time consuming installation and setup process, dependent on one server and that is seen as being fragile and need be careful taken care of.

      Delete
  5. Hi Rob,

    Great article!

    Just a basic question about backtesting strategies, particularly monthly rebalanced strategies.

    Say you have a strategy that gets rebalanced the last day of each month. Typically, you just run the weights on next months returns and have a monthly returns stream. However, what if you'd wanted insights about how the strategy performed on a daily basis. To do that would you simulate it as though you rebalanced daily using the most recent weights? Or would you simulate it mark-to-market on a daily basis?

    If the latter, is there an efficient way to do this?

    Thank you so much.

    ReplyDelete
    Replies
    1. A: Simulate it mark-to-market on a daily basis.

      Otherwise you'll be changing the behaviour of your strategy.

      You basically want to forward fill your positions from month start to month end after matching them against the data frame of prices.

      Delete
    2. Thank you for your response.

      I hate to be a pest, but would you mind demonstrating how this is done in python? There's nothing on the internet about how to do this(atleast that I've found). It would be greatly appreciated. It also gets tricky because I don't rebalance on the end of each month e.g the 10th trading day of the month.

      Delete
    3. import pandas as pd

      positions=pd.DataFrame(...monthly...)
      prices=pd.DataFrame(...daily...)

      positions=positions.reindex(prices.index, method="ffill")

      Will work for any day of rebalancing.

      Delete
    4. Yes, but isn't that just daily rebalancing. For instance, if I rebalance on 1/30 to a weight of 50% for market A and 50% for market B and I have $100, for example. Then on 2/1 the price for market A goes up 10% and market B is flat. So market A is now $50 * 1.1 = $55 and market B is still $50. So now my weight for market A is $55 / $105($55 + $50) is ~52% and market B is ~48%. Using your method it would say 50%/50% which is just daily rebalancing.

      Maybe I asked poorly. But I am trying to simulate what my weights and returns would be on a daily basis if I rebalance once a month without rebalancing at all between those periods.

      Using the example above, for instance, on 2/1 to find out what my weights "should" be I would see 52%/48% not 50%/50% and the returns for 2/2 would be based on the 52%/48% weights at EOD 2/1.

      Delete
    5. The code is correct. You need to forward fill your positions, not your weights.

      Delete
    6. Pardon my naivete, but what do you mean by positions?

      Thank you for responding, btw.

      Delete
    7. Position = how many actual shares or futures contracts you hold.

      Delete
  6. Okay great. Then each day during the month its just multiplying the # of contracts by the contract values?

    ReplyDelete
    Replies
    1. Thank you so much. It is more than appreciated.

      Delete
    2. So, I've been trying this entire day to make this work, unfortunately I cannot figure this out. I know your very busy and have more important things to do but I can I email you my code snippet n sample of the data I am working with for some help?

      This is an act of desperation.

      Delete
    3. This comment has been removed by the author.

      Delete
  7. There's a recent discussion thread over at nuclearphynance about Docker, and seems the sentiment about its actual ease of deployment is mixed, and one comment linked to this blog (a fellow Londoner, Rob, got to count for something right): https://thehftguy.com/2017/02/23/docker-in-production-an-update/

    Seems he's quite sour to Debian/Ubuntu, Cent OS, and so on.

    My dedicated server is running CentOS7 due to RHEL-only Linux support for Rithmic API...I'd considered using Docker but abandoned the idea until later...maybe Xen/LXD would be a good alternative?

    ReplyDelete
    Replies
    1. That's a really interesting article. I guess the key takeaway for me is not to build something that relies on docker being present; or to put it another way to create services that can run inside or outside of a container. I'd planned to do this anyway, as I'm releasing everything open source I don't think its fair to force a particular architecture on other users. I'm using Linux Mint which is Ubuntu based FWIW. I'd also always planned to create daily copies of the data normally inside any docker volumes on to a real file system somewhere, since I never really trusted the idea of a docker volume wholeheartedly.

      Delete
  8. While it's not clear they use it for front office systems, your former employers competitor posted recently about containerised services recently. Clearly smart minds think alike!!

    https://tech.winton.com/blog/2017/04/continuous-deployment-of-containerised-microservic

    ReplyDelete