Divide and conquer

Working scientifically with statistics software implies that the analysis one performs should be done using batch-files, in STATA terms using do-files. This is important so that results can be reproduced, and if errors are found, the analysis can be run anew. I have been using a set-up in which I divide the empirical research in several steps, that allow me to reproduce my steps, save time, and insure that I can easily back-up my research without too much of a hassle.

The key is to divide the process into several steps. All steps together should lead to the results that are in the paper, starting from the original data that you used.

  1. The first step is to make sure that the original data is in a safe place. This is the data that the first do-file accesses, in order to get the processed format. Keep a spare copy of original data on a (secured) CD-ROM, or USB-Harddisk.
    Most of the times, the file is opened, the variables needed are selected, and the data is saved. If the original data is not in STATA format, a do-file can use dicitionaries or insheet commands in order to read in non-STATA formats.
    Most of the times several original data-files are opened succesively, and then combined using e.g. the merge function of STATA. In the Fokker project more than 20 different files were opened and combined.
  2. The second step is data cleaning and data reorganization. It is ideal to have a separate do-file that recodes errors, drops data that should not be included etc. as you can easily check whether the choices that have been made in the process of analysing the data, are still valid on hindsight when you are close to finishing the analysis. If all the changes and calculations of new variables are in one place, it is easy to change them, and apply them to the entire data-set.
  3. The third step is the actual data-description and analysis using figures, tables, regressions and other statistics. This third step is often split into different do-files. That can be done by sections of a paper, by categories (tables, figures, regressions), or by topics of the analysis (duration of unemployment, wage losses, information passing).

Finally, I write a short do-file in which I call the do-files in the three steps. I.e. if I have only one do-file for each step, this master do-file would look something like this:

do project_step1.do
do project_step2.do
do project_step3.do
exit

Now it is often useful to save interim results after each step, and work with these interim results, especially if preparing the data cost a lot of time. However, do yourself the favour to check once in a while that your master do-file is indeed capable of reproducing your current results “from scratch”. Only then is it sufficient to back up the do-files along with the original data.

Leave a Reply