Most often when I search the internet for help on Stata, it is probably when I need to work with string variables (such as names). There are some very good summaries that cover aspects of string variables (e.g., this page). In this post –which will be continuously updated– we present random string functions that we think are extremely useful for Stata users.
Identify string variables
The follow code (thanks Nick Cox!) creates a local
strvars that stores all variables names that are defined as string variables:
ds, has(type string) local strvars "`r(varlist)'
Uncleaned data often contains unwanted empty spaces (such as
state=" Arizona"). To remove those, you can use Stata’s trim functions (see also here):
replace var=trim(var) // removes leading and trailing blanks (" Arizona", "Arizona " replace var=ltrim(var) // only left trim (rtrim does the same for the right hand side) replace var=stritrim(var) // reduces multiple blanks between words to one
Alternatively, you can also remove any blank this way:
replace state=subinstr(state," ","",.)
String to numerical and vice versa
To transform numerical values stored as string, use
detring strinvar, replace
Or alternatively use
gen(newvarname) instead of
replace. To transform variables from numerical variables to string variables, use tostring
Identify string components in otherwise numerical variables
A common problem when using destring is that some observations might contain non-numerical pieces (such as * or anything else). One solution would be to use
force when destringing your variable. An alternative is to tabulate all variable values that contain non-numerical characters:
gen notnumeric = real(number) == . tab number if notnumeric == 1
The dummy variable
notnumeric is defined such that it is 1 if a value contains a non-numeric character. Tabulating its values usually helps to identify why a destring did not work.
Calculate the number of words in a string variable
For calculating the number of words separated by a blank of a string variable, use Stata’s
Extract the first word of a string variable
To extract the first word of a multiple string variable, use the following code. Note 1: you can replace
" " with any other sign (such as a dash); Note 2: the second line is only needed if there are observations with only one word.
gen NEWVAR=substr(LONG_STRING_VAR,1, strpos(LONG_STRING_VAR, " ") - 1) replace NEWVAR=LONG_STRING_VAR if wordcount(LONG_STRING_VAR)==1 replace NEWVAR=trim(NEWVAR)