I have not written a post in a long time and so I resuming this blog by keeping you posted on some of the projects I have been working on.
In general, I have been developing or contributing to several research applications in the last year. I learned a lot about software engineering and designing applications. Maybe I find the time to make a post on some of the things I learned along the way.
In one of my recent projects, I needed to accelarate a discrete choice dynamic programming model. After I changed a part of the implementation, the program was indeed faster. But, the most expensive operation according to profiling with snakeviz was now
~:0(<method 'copy' of 'numpy.ndarray' objects>). I was puzzled. I was sure that there was no use of
np.copy() at all. After reading some StackOverflow posts and blog entries, it became clear that some operations and more importantly indexing methods return copies instead of views. The difference between the two is that views refer to the same underlying data in memory whereas a copy creates a new object. The disadvantages of a copy are:
But, what operations return copies?
Last semester, I took a time series course where we implemented some models like the Hodrick-Prescott filter or structural vector autoregressive processes in Julia. The whole thing is available online with the notebooks running on Binder which allows you to go through the programming examples in your browser. If you plan to use Julia yourself and want to play around, it might be a place to start.
This DAG is produced by a sample project for reproducible research
https://github.com/hmgaudecker/econ-project-templates. I extended
this template with the templating engine
cookiecutter and various
other software engineering tools.
In 2015, I wrote my Bachelor's thesis on identifying software patents. This is useful and necessary in two ways. First, there is no official system to sort patents this way. The main system used by the USPTO focuses on the technological and functional form. A subclass dealing with dispensing solids contains manure spreaders and toothpaste tubes. In contrast, researchers are more interested in topics like automation or software. Second, I learned Python and made my first steps into the world of machine learning.
You can find the whole project on Github as well as the paper. There is also a script to download different kinds of data sets. The raw data uses approximately 90GB of disk space whereas the data for replicating the previous results based on a simple algorithm is currently less than 1GB.
Now, let us see what has been done so far.
This article shows how to compile and distribute R packages on anaconda.org to
be used in your data science projects. This is useful as R has not really a
neat dependency pinning tool like Python with
conda and R is shipped with
conda anyway. But,
if you want to use the MKL accelerated Microsoft R Open instead of plain R,
there are some packages which are currently not provided in
conda-forge. Here is how to lift this obstacle.