How I write tests

Posted on Wed 31 March 2021 in Blog • Tagged with Python, Testing, Tests

Hi everybody,

I assume that all of you write tests for Python programs with pytest. If you do not use pytest or if you do not even write tests, you should check out the following links which are useful and provide some examples and an overview of pytest's capabilities.

Maybe you should also have heard about test driven development (TDD), but I have little experience with it myself. If you have a great resource for beginners, send it my way and I can include it here.

What I did not find in these guides is a combination of patterns I use fairly often to write tests. Hopefully, it is useful for you as well. Let's go!

The function

First, here is the function we are going to …

Continue reading

What I have been doing lately

Posted on Wed 30 September 2020 in Blog • Tagged with Software Engineering

Hi everybody,

I have not written a post in a long time and so I resuming this blog by keeping you posted on some of the projects I have been working on.

In general, I have been developing or contributing to several research applications in the last year. I learned a lot about software engineering and designing applications. Maybe I find the time to make a post on some of the things I learned along the way.


The project I am most excited about right now is pytask, a build system designed for researchers to run their project pipeline from data preparation over analyses to compiling the reports.

I was extremely frustrated with existing solutions and programmed my own build system. One of the highlights is the interface which is pretty similar to pytest to lower the entry-barrier. It also burrows the plugin system from pytest which is based …

Continue reading

Matplotlib for publications

Posted on Thu 15 August 2019 in Blog • Tagged with Data Analysis, Visualization


This article shows how to create plots with matplotlib for publications where fonts and font sizes match the LaTeX document and graphics are not blocky, but allow for infinite zooming.

Continue reading

Numba - @vectorize and @guvectorize

Posted on Sun 14 April 2019 in Blog • Tagged with Numba, Numpy


In this post, I will explain how to use the @vectorize and @guvectorize decorator from Numba. You can use the former if you want to write a function which extrapolates from scalars to elements of arrays and the latter for a function which extrapolates from arrays to arrays of higher dimensions.

Continue reading

Numpy - Views vs. Copies

Posted on Mon 25 March 2019 in Blog • Tagged with Numpy


Continue reading

The Roy Model

Posted on Sat 19 January 2019 in Blog • Tagged with Roy Model, Selection Bias


Continue reading

A time series course with Julia

Posted on Wed 17 October 2018 in Blog • Tagged with Time Series, Julia

Last semester, I took a time series course where we implemented some models like the Hodrick-Prescott filter or structural vector autoregressive processes in Julia. The whole thing is available online with the notebooks running on Binder which allows you to go through the programming examples in your browser. If you plan to use Julia yourself and want to play around, it might be a place to start.

As I had not used Julia before and only heard about how fast it is, that it is statically typed, and so on, I was very interested in the beginning, but that changed quickly.

The main cause of frustration was that the Julia developers released three versions during the time of the course. Version 0.6.4 was released on 9 July 2018, version 0.7.0 and 1.0.0 followed on 8 and 9 August respectively. All versions changed the main …

Continue reading

Facilitate reproducible research with cookiecutter-research-template

Posted on Mon 27 August 2018 in Blog • Tagged with Research, Reproducibility, Waf, cookiecutter

This DAG is produced by a sample project for reproducible research from I extended this template with the templating engine cookiecutter and various other software engineering tools.

Continue reading

Identifying Software Patents

Posted on Tue 21 August 2018 in Blog • Tagged with [Machine Learning, Deep Learning, Text Data]

In 2015, I wrote my Bachelor's thesis on identifying software patents. This is useful and necessary in two ways. First, there is no official system to sort patents this way. The main system used by the USPTO focuses on the technological and functional form. A subclass dealing with dispensing solids contains manure spreaders and toothpaste tubes. In contrast, researchers are more interested in topics like automation or software. Second, I learned Python and made my first steps into the world of machine learning.

You can find the whole project on Github as well as the paper. There is also a script to download different kinds of data sets. The raw data uses approximately 90GB of disk space whereas the data for replicating the previous results based on a simple algorithm is currently less than 1GB.

Now, let us see what has been done so far.

The idea of the project …

Continue reading

How to download files with Python

Posted on Mon 11 June 2018 in Blog • Tagged with Python, Downloader


This is a short script in python to download files, resume and to validate downloads with hash values. It is useful to distribute projects and data separately. You can find it at the end of the article. An interactive version of the notebook is available as a Binder notebook.

Continue reading

How to compile and distribute an R package with conda

Posted on Wed 21 March 2018 in Blog • Tagged with conda, Anaconda, R, conda-build, MRO

This article shows how to compile and distribute R packages on to be used in your data science projects. This is useful as R has not really a neat dependency pinning tool like Python with requirements.txt or environment.yml with conda and R is shipped with conda anyway. But, if you want to use the MKL accelerated Microsoft R Open instead of plain R, there are some packages which are currently not provided in conda's default channels or conda-forge. Here is how to lift this obstacle.

Introduction to Anaconda and R

I like to manage my research projects with conda which is the package manager for Anaconda, a popular Python distribution for data science. For one of my recent projects, I also needed to install R and I was lucky to find out that R is also available with conda.

First, you create your normal Python …

Continue reading

The Tragedy of Titanic

Posted on Sun 22 October 2017 in Blog • Tagged with Data Analysis, Machine Learning


Analysis of survival rates on the Titanic with a placebo test for whether traveling in couples increased the likelihood of survival. An interactive version of the notebook is available by clicking on the binder badge above.

Continue reading