Useful string functions in Stata (updated list)

Most often when I search the internet for help on Stata, it is probably when I need to work with string variables (such as names). There are some very good summaries that cover aspects of string variables (e.g., this page). In this post –which will be continuously updated– we present random string functions that we think are extremely useful for Stata users.

Identify string variables

The follow code (thanks Nick Cox!) creates a local strvars that stores all variables names that are defined as string variables:

ds, has(type string) 
local strvars "`r(varlist)' 

Trim strings

Uncleaned data often contains unwanted empty spaces (such as state=" Arizona"). To remove those, you can use Stata’s trim functions (see also here):

replace var=trim(var) // removes leading and trailing blanks (" Arizona", "Arizona "
replace var=ltrim(var) // only left trim (rtrim does the same for the right hand side)
replace var=stritrim(var) // reduces multiple blanks between words to one

Alternatively, you can also remove any blank this way:

replace state=subinstr(state," ","",.)

String to numerical and vice versa

To transform numerical values stored as string, use

detring strinvar, replace

Or alternatively use gen(newvarname) instead of replace. To transform variables from numerical variables to string variables, use tostring intead.

Identify string components in otherwise numerical variables

A common problem when using destring is that some observations might contain non-numerical pieces (such as * or anything else). One solution would be to use force when destringing your variable. An alternative is to tabulate all variable values that contain non-numerical characters:

gen notnumeric = real(number) == . 
tab number if notnumeric == 1 

The dummy variable notnumeric is defined such that it is 1 if a value contains a non-numeric character. Tabulating its values usually helps to identify why a destring did not work.

Calculate the number of words in a string variable

For calculating the number of words separated by a blank of a string variable, use Stata’s wordcount command:

gen NEWVAR=wordcount(LONG_STRING_VAR)

Extract the first word of a string variable

To extract the first word of a multiple string variable, use the following code. Note 1: you can replace " " with any other sign (such as a dash); Note 2: the second line is only needed if there are observations with only one word.

gen NEWVAR=substr(LONG_STRING_VAR,1, strpos(LONG_STRING_VAR, " ") - 1)
replace NEWVAR=LONG_STRING_VAR if wordcount(LONG_STRING_VAR)==1
replace NEWVAR=trim(NEWVAR)

Leave a Reply