Cleaning up messy (string) variables

Working on firm level data (again), I have the experience of cleaning up hundreds of different spelinngs of occupations that should eventually be categorized into a set of occupations that should only differ when actual different occupations are needed.Let me call the variable occupation.
34. slesar po remontu la 38. slesar po rem. la 44. slesar po rem. i obsluzh. vent. i kondicionirovaniya 54. slesar po rem. i obsluzh. ven. i kondicionirovaniya 146. slesar po rem. la 205. slesar po remontu agregatov 259. slesar po rem.agregatov 313. slesar po remontu kompressornyh ustanovok i oborudovaniya 343. slesar po remontu oborud
Wonderful mess, acutally only a minor part of the full data-set.

One of the problems is that occupations might be capitalized differently, i.e. slesar is not the same as Slesar, or for that matter sLesar.

So, one step is to change all letters to be small:
replace occupation = lower(occupation)

More problematic are white space within a string, or at the beginning and end. Quite typical for variables that come from typists, or systems that do not enforce a common coding. However, these differences are hard to find, even browsing through the data. To trim leading or trailing spaces you can use:
replace occuaption = trim(occupation)
Now the ” slesar”, ” sleasar “, and “slesar” are the same again (all are now “slesar”). For extra spaces within the string you have to use itrim:
replace occupation = itrim(occupation)

Now “slesar po remontu” and “slesar po remontu” are the same again, as all empty spaces are reduced to a single one: “slesar po remontu”.

Eventually, what I did was only one line of code:

replace occupation=trim(itrim(lower(occupation)))

StataTex Blog

Cleaning up messy (string) variables

Leave a Reply Cancel reply