Most often when I search the internet for help on Stata, it is probably when I need to work with string variables (such as names). There are some very good summaries that cover aspects of string variables (e.g., this page). In this post –which will be continuously updated– we present random string functions that we think are extremely useful for Stata users.
Table of Contents
- Identify string variables
- Trim strings
- String to numerical and vice versa
- Identify string components in otherwise numerical variables
- Calculate the number of words in a string variable
- Extract the first word of a string variable
- String functions for renaming variables
Identify string variables
The follow code (thanks Nick Cox!) creates a local strvars
that stores all variables names that are defined as string variables:
ds, has(type string) local strvars "`r(varlist)'
Trim strings
Uncleaned data often contains unwanted empty spaces (such as state=" Arizona"
). To remove those, you can use Stata’s trim functions (see also here):
replace var=trim(var) // removes leading and trailing blanks (" Arizona", "Arizona " replace var=ltrim(var) // only left trim (rtrim does the same for the right hand side) replace var=stritrim(var) // reduces multiple blanks between words to one
Alternatively, you can also remove any blank this way:
replace state=subinstr(state," ","",.)
String to numerical and vice versa
To transform numerical values stored as string, use
detring strinvar, replace
Or alternatively use gen(newvarname)
instead of replace
. To transform variables from numerical variables to string variables, use tostring intead
.
Identify string components in otherwise numerical variables
A common problem when using destring is that some observations might contain non-numerical pieces (such as * or anything else). One solution would be to use force
when destringing your variable. An alternative is to tabulate all variable values that contain non-numerical characters:
gen notnumeric = real(number) == . tab number if notnumeric == 1
The dummy variable notnumeric
is defined such that it is 1 if a value contains a non-numeric character. Tabulating its values usually helps to identify why a destring did not work.
Calculate the number of words in a string variable
For calculating the number of words separated by a blank of a string variable, use Stata’s wordcount
command:
gen NEWVAR=wordcount(LONG_STRING_VAR)
Extract the first word of a string variable
To extract the first word of a multiple string variable, use the following code. Note 1: you can replace " "
with any other sign (such as a dash); Note 2: the second line is only needed if there are observations with only one word.
gen NEWVAR=substr(LONG_STRING_VAR,1, strpos(LONG_STRING_VAR, " ") - 1) replace NEWVAR=LONG_STRING_VAR if wordcount(LONG_STRING_VAR)==1 replace NEWVAR=trim(NEWVAR)
String functions for renaming variables
Remove suffix from list of variable names (works equally well with prefixes)
foreach varname of varlist *_suffix { local newname=regexr("`varname'","_suffix","") rename `varname' `newname' }
Leave a Reply
You must be logged in to post a comment.