What I have been doing lately

Here is an overview on the projects I have worked on for some time: pytask, respy, gettsim, and sid.

Hi everybody,

I have not written a post in a long time, so I am resuming this blog by keeping you posted on some of my more recent projects.

I have been developing or contributing to several research applications last year. I learned a lot about software engineering and designing applications. Maybe I will find the time to make a post on some of the things I learned along the way.

pytask

The project I am most excited about is pytask, a build system designed for researchers to run their project pipeline from data preparation over analyses to compiling the reports.

I was highly frustrated with existing solutions and programmed my build system. The interface is one of the highlights. It is similar to pytest to lower the entry barrier but more beautiful because it uses rich. pytask uses pluggy under the hood to offer a plugin system.

If you already know my cookiecutter for reproducible research, you know Waf. pytask replaces Waf. I will probably not update the cookiecutter for the foreseeable future and, instead, I recommend Hans-Martin’s cookiecutter, which will support pytask.

Please take a look at pytask and try it out in your next project. I appreciate any feedback, comments, feature requests, and harsh criticism :).

I already held a presentation about pytask’s design which I will probably post here in some weeks. Half the time is about plugin architectures in general and pluggy, and the other half is about pytask.

respy

Together with Janos, I created a framework for a certain class of econometric models called respy.

For the insiders, it is a framework for finite-horizon discrete choice dynamic programming models, also called Eckstein-Keane-Wolpin models. Researchers use them to study the human capital accumulation process in the labor market.

The documentation is quite extensive for such a young project—contributors taking over the project plan to extend it with even more examples and applications.

If you are an economist interested in structural modeling, it might be an excellent place to start. Even if you do not want to use this model, you might be able to get some inspiration for your model.

I learned a lot about building interfaces people can use and how to write performant code by choosing the right design and Numba where necessary.

gettsim

I redesigned the computational backend of gettsim with Janos and Hans-Martin. gettsim offers a representation of the German tax and transfer system and allows researchers to study the impact of reforms on the amount of taxes and benefits people face.

The task was to design an interface that allows users to modify or extend the pre-implemented tax and transfer system.

Our solution is a mixture inspired by pytest’s fixtures and a DAG (directed acyclic graph).

A subset of the German tax and transfer system.
A subset of the German tax and transfer system.

You can view the whole tax and transfer system as an extensive network. In this network, nodes are quantities like child benefits, capital gains, or taxes on capital gains. Edges represent how quantities relate to each other. For example, taxes paid on capital gains is derived from capital gains subject to income tax. Quantities are part of the data, or a function exists that computes it.

The network is a directed graph because edges point in one direction. And it is acyclic since there are no cycles in this graph, meaning you will never return to the same node following the edges. These properties make the network a directed acyclic graph or a DAG.

This network view has a couple of benefits.

  • A quantity can be computed once and then passed to the following nodes saving runtime and reducing code duplication.

  • If you want to model a policy change, you can single out the relevant nodes in the network and modify the underlying functions.

  • If you are interested only in a subset of tax and transfer system, subset the network and remove unnecessary nodes.

This flexibility is highly desirable, but what does the interface look like for a user.

Here, we use the idea of pytest’s fixtures where using a fixture’s name as an argument in a test function gives you access to the return of the fixture inside the test function. Similarly, a function in gettsim looks like this.

def child_benefits(n_children, parameters):
    return n_children * parameters["child_benefits"]

Here, n_children is either a variable in the input data or a function with the same name which computes the quantity.

We can build a DAG that allows us to determine an execution order for the functions from a function’s name and its argument names.

Users can modify the collection of functions by overwriting existing functions or adding their own.

You can find out more about the package in the documentation or check out the code to build a DAG in the standalone package dags.

sid

Last but not least, I have been working on an epidemiological model to predict the spread of infectious diseases. It is my COVID-19 project with Klara and Janos. It is called sid, and we hope to publish something soon.

Tobias Raabe
Tobias Raabe

I am a data scientist and programmer living in Hamburg.