Contents
Special thanks to John Coglianese for feedback and for supplying the list of "vital" Stata commands. Feedback and requests for additions to the list are always welcome!
The official Pandas documentation includes a "Comparison with Stata" page which is another great resource.
1 Intro/Note on Notation
Coding in Python is a little different than coding in Stata.
In Stata, you have one dataset in memory. The dataset is a matrix where each
column is a "variable" with a unique name and each row has a number (the
special variable _n
). Everything in Stata is built around this
paradigm.
Python is a general purpose programming language where a "variable" is not a column of data. Variables can be anything, a single number, a matrix, a list, a string, etc. The Pandas package implements a kind of variable called a DataFrame that acts a lot like the single dataset in Stata. It is a matrix where each column and each row has a name. The key distinction in Python is that a DataFrame is itself a variable and you can work with any number of DataFrames at one time. You can think of each column in a DataFrame as a variable just like in Stata, except that when you reference a column, you also have to specify the DataFrame.
The Stata-to-Python translations below are written assuming that you have a
single DataFrame called df
. Placeholders like <varname>
and
<dtafile>
show where user-specified values go in each language. Note
that in many cases, <varname>
will be simple text in Stata (e.g.,
avg_income
) while in Python it will be a string ('avg_income'
).
If you were to write df[avg_income]
without quotes, Python would go
looking for a variable--a list, a number, a string--that's been defined
somewhere else. Because of this, <varlist>
in Python represents a list
of variable names: ['educ', 'income', 2017]
.
2 Note on composability
One concept that will push your Python skills forward quickly is "composability." That is, you can combine base-level commands to "create" whole new commands. This is a little more difficult in Stata where each line of code acts on an entire dataset.
For example, let's say you know two commands in Python/pandas. First,
df[<condition>]
, which returns the rows of DataFrame df
for
which the Boolean <condition>
is True. This is like keep if
<condition>
in Stata, except df[<condition>]
is itself a DataFrame
that can be acted upon independently of df
itself. This means that you
now know the general if <condition>
syntax for every other pandas
command.
So let's say the second command you know is df.describe()
which is the
equivalent of Stata's summary
. This command doesn't have it's own
dedicated if
command. It doesn't need one. All you need is
df[<condition>].describe()
.
In this way, you can create (i.e., "compose") new commands. You may notice that
sometimes Python/pandas will require composed commands where Stata uses a
one-liner (for example, see Stata's drop varstem*
below). While it may
may annoying to type a few more characters, this is actually a good thing! It
means that there are fewer base commands to learn in pandas. It's like learning
an alphabet with 26 letters and composing millions of words instead of learning
millions of individual hieroglyphs.
This also means that sometimes there is more than one way to do things.
Sometimes, it doesn't matter which way you do it. Other times, one way of doing
things will execute more quickly. You can use the %timeit
magic command
in IPython to easily test which one is faster. But this will usually only an
issue with with larger data sets (usually several gigabytes). For example,
in large Groupby operations .shift(fillna=0)
is much, much slower than
shift().fillna(0)
even though the end result is the same.
3 Input/Output
Stata | Python |
---|---|
log using <file> |
Python doesn't display results automatically like Stata. You have to
explicitly call the print function. Using a Jupyter notebook is
the closest equivalent. |
help <command> |
|
cd some/other/directory |
import os os.chdir('some/other/directory') but this is bad practice. Better practice is to use full pathnames whenever possible.
|
use my_file |
import pandas as pd df = pd.read_stata('my_file.dta') |
use var1 var2 using my_file |
df = pd.read_stata('my_file.dta', columns=['var1', 'var2']) |
import excel using <excelfile> |
df = pd.read_excel('<excelfile>') |
import delimited using my_file.csv |
df = pd.read_csv('my_file.csv') |
save my_file, replace |
|
outsheet using my_file.csv, comma |
df.to_csv('my_file.csv') |
export excel using <excel_name> |
df.to_excel('<excel_name>') |
4 Sample Selection
Stata | Python |
---|---|
keep if <condition> |
df = df[<condition>] |
keep if a > 7 |
df = df[df['a'] > 7] |
drop if <condition> |
df = df[~(<condition>)] where ~ is the logical negation
operator in pandas and numpy (and bitwise negation for Python more
generally). |
keep if _n == 1 |
|
keep if _n == _N |
|
keep if _n == 7 |
df = df.iloc[6, :] (Remember to count from 0) |
keep if _n <= 10 |
df = df.iloc[:9, :] (Remember to count from 0) |
keep var |
df = df['var'] |
keep var1 var2 |
df = df[['var1', 'var2']] |
keep varstem* |
df = df.filter(like='varstem') |
drop var |
|
drop var1 var2 |
df = df.drop(['var1', 'var2'], axis=1) |
drop varstem* |
df = df.drop(df.filter(like='varstem*').columns, axis=1) |
5 Data Info and Summary Statistics
Stata | Python |
---|---|
describe |
df.info() OR df.dtypes just to get data types. Note that
Python does not have value labels like Stata does. |
describe var |
df['var'].dtype |
count |
|
count if <condition> |
|
summ var |
df['var'].describe() |
summ var if <condition> |
|
summ var [aw = <weight>] |
Right now you have to calculate weighted summary stats manually. There are also some tools available in the Statsmodels package. |
summ var, d |
df['var'].describe() plus df['var'].quantile([.1, .25,
.5, .75, .9]) or whatever other statistics you want. |
tab var |
df['var'].value_counts() |
tab var1 var2 |
|
tab <var1> <var2>, summarize(<func>) |
df.groupby(['var1', 'var2'])
.agg(<func>)
.unstack('var2')
|
6 Variable Manipulation
Stata | Python |
---|---|
gen newvar = <expression> |
df['newvar'] = <expression> |
gen newvar = oldvar + 7 |
df['newvar'] = df['oldvar'] + 7 |
gen newvar = <expression> if <condition> |
df.loc[<condition>, 'newvar'] = <expression> . As with Stata,
the rows of df that don't meet the condition will be missing
(numpy.nan ). |
replace var = <expression> if <condition> |
df.loc[<condition>, 'var'] = <expression> |
rename var newvar |
df = df.rename(columns={'var': 'newvar'}) . You can also directly
manipulate df.columns like a list: df.columns = ['a',
'b', 'c'] . |
inlist(var, <val1>, <val2>) |
df['var'].isin((<val1>, <val2>)) |
inrange(var, <val1>, <val2>) |
df['var'].between((<val1>, <val2>)) |
subinstr(<str>, " ", "_", .) |
df['var'].str.replace(' ', '_') |
egen newvar = count(var) |
newvar = df['var'].notnull().sum() . NOTE: For these
egen commands, newvar is a full (constant) column in Stata, while it is a scalar in Python. If you want a full constant column on your DataFrame, you can do df['newvar'] = 7 or whatever the constant is. |
egen <newvar> = max(var) |
<newvar> = df['var'].max() |
egen <newvar> = mean(var) |
<newvar> = df['var'].mean() |
egen <newvar> = total(var) |
<newvar> = df['var'].sum() |
egen <newvar> = group(var1 var2) |
<newvar> = econtools.group_id(df, cols=['var1', 'var2]) Please see the documentation for group_id . |
egen newvar = <stat>(var), by(groupvar1 groupvar2) |
df['newvar'] = df.groupby(['groupvar1', 'groupvar2'])['var'].transform('<stat>') . |
collapse (sd) var (median) var /// (max) var (min) var, /// by(groupvar1 groupvar2) |
df.groupby(['groupvar1', 'groupvar2'])['var'].agg(['std', 'median', 'min', 'max', 'sum']) |
collapse (<stat>) var [iw = <weight>] |
Manually or maybe Statsmodels has a tool. |
collapse (mean) var1, var2, by(groupvar1 groupvar2) |
df.groupby(['groupvar1', 'groupvar2'])[['var1', 'var2']].mean() |
recode var (1/5 = 1) |
N/A, see note below. |
recode var (1/5 = 1), gen(<newvar>) |
N/A. |
label var var <label> |
N/A. |
label define <labelname> 1 <valuelabel> |
N/A. |
label values var <labelname> |
N/A. |
label list <labelname> |
N/A. |
Python doesn't have "labels" built into DataFrames like Stata does. However, you can use a dictionary to map data values to labels when necessary.
variable_labels = {
1: "First Category",
2: "Second Category",
3: "Last Category",
}
7 Bysort
All bysort
and egen
commands will be compositions using
groupby.agg
or groupby.transform
Stata | Python |
---|---|
bys group_var1 group_var2: gen group_sum = sum(var) |
|
bys group_var1 group_var2 (sort_var): keep if _n==1 |
|
bys group_var1 group_var2 (sort_var): keep if _n==j where
j is some arbitrary number |
df['group_sum'] = (
df
.sort_values(['group_var1', 'group_var2', 'sort_var1'])
.groupby(['var1', 'var2'])
.apply(lambda x: x.iloc[j, :])
)
|
bys group_var1 group_var2 (sort_var): gen jth_val = var[j] where
j is some arbitrary number |
df['jth_val'] = (
df
.sort_values(['group_var1', 'group_var2', 'sort_var1'])
.groupby(['var1', 'var2'])['var']
.transform(lambda x: x.iloc[j])
)
|
Note that groupby
was also used above to accomplish a collapse
.
Like, a Stata Collapse, a pandas Groupby reduces the size of the data set. If
you want to instead execute what's called a window function in SQL, which does
not reduce the shape of the original data set, you can call
df.groupby.transform(<function>)
instead of
df.groupby.agg(<function>)
.
If for some reason transform
doesn't work in your use case, you can
also re-join the collapsed data back onto your original data set:
df = pd.read_csv('my_data.csv')
state_pop_sum = df.groupby('state')['population'].sum()
df = df.join(state_pop_sum.to_frame('state_pop_sum'), # Give the new column a name besides 'population'
on='state')
8 Panel Data
There is no general equivalent to tsset
in Python. However, you can
accomplish most if not all of the same tasks using a DataFrame's index (the
row's equivalent of columns.) In Stata, the "DataFrame" in memory always has
the observation row number, denoted by the Stata built-in variable _n
.
In Python and Pandas, a DataFrame index can be anything (though you can also
refer to rows by the row number; see .loc
vs iloc
). It can also
be hierarchical with mutiple levels. It is a much more general tool than
tsset
.
Stata | Python |
---|---|
tsset panelvar timevar |
df = df.set_index(['panelvar', 'timevar']) |
L.var |
df['var'].shift() NOTE: The index must be
correctly sorted for shift to work the way you want it to. You
will also probably need to use a groupby ; see below. |
L2.var |
df['var'].shift(2) |
F.var |
df['var'].shift(-1) |
8.1 Examples
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df0 = pd.DataFrame({'var1': np.arange(6),
...: 'id': [1, 1, 2, 2, 3, 3],
...: 'period': [0, 1] * 3})
In [4]: print(df0)
var1 id period
0 0 1 0
1 1 1 1
2 2 2 0
3 3 2 1
4 4 3 0
5 5 3 1
In [5]: df = df0.set_index(['id', 'period']).sort_index()
In [6]: print(df)
var1
id period
1 0 0
1 1
2 0 2
1 3
3 0 4
1 5
In [7]: df['var1_lag'] = df.groupby(level='id')['var1'].shift()
In [8]: print(df)
var1 var1_lag
id period
1 0 0 NaN
1 1 0.0
2 0 2 NaN
1 3 2.0
3 0 4 NaN
1 5 4.0
In [9]: df['var1_for'] = df.groupby(level='id')['var1'].shift(-1)
In [10]: print(df)
var1 var1_lag var1_for
id period
1 0 0 NaN 1.0
1 1 0.0 NaN
2 0 2 NaN 3.0
1 3 2.0 NaN
3 0 4 NaN 5.0
1 5 4.0 NaN
9 Merging and Joining
NOTE: Merging in Python is like R, SQL, etc. See additional explanation below.
Stata | Python |
---|---|
append using <filename> |
df_joint = df1.append(df2) |
merge 1:1 <vars> using <filename> |
|
Merging with Pandas DataFrames does not require you to specify "many-to-one" or
"one-to-many". Pandas will figure that out based on whether the variables
you're merging on are unique or not. However, you can specify what sub-sample
of the merge to keep using the keyword argument how
, e.g.,
df_joint = df1.join(df2, how='left')
is the default for join
while how='inner'
is the default for pd.merge
.
Pandas how |
Stata , keep() |
Intuition |
---|---|---|
how='left' |
keep(1, 3) |
Keeps all observations in the "left" DataFrame. |
how='right' |
keep(2, 3) |
Keeps all observations in the "right" DataFrame. |
how='inner' |
keep(3) |
Keeps observations that are in both DataFrames. |
how='outer' |
keep(1 2 3) |
Keeps all observations. |
10 Reshape
Like with merging, reshaping a DataFrame in Python is a bit different because of the paradigm shift from the "only one data table in memory" model of Stata to "a data table is just another object/variable" of Python. But this difference also makes reshaping a little easier in Python.
The most fundamental reshape commands in Python/pandas are stack
and
unstack
:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: long = pd.DataFrame(np.arange(8),
...: columns=['some_variable'],
...: index=pd.MultiIndex.from_tuples(
...: [('a', 1), ('a', 2),
...: ('b', 1), ('b', 2),
...: ('c', 1), ('c', 2),
...: ('d', 1), ('d', 2)]))
In [4]: long.index.names=['unit_id', 'time']
In [5]: long.columns.name = 'varname'
In [6]: long
Out[6]:
varname some_variable
unit_id time
a 1 0
2 1
b 1 2
2 3
c 1 4
2 5
d 1 6
2 7
In [7]: wide = long.unstack('time')
In [8]: wide
Out[8]:
varname some_variable
time 1 2
unit_id
a 0 1
b 2 3
c 4 5
d 6 7
In [9]: long2 = wide.stack('time')
In [10]: long2
Out[10]:
varname some_variable
unit_id time
a 1 0
2 1
b 1 2
2 3
c 1 4
2 5
d 1 6
2 7
Here Input 3 creates a DataFrame, Input 4 gives each of the index columns a name, and Input 5 names the columns. Coming from Stata, it's a little weird to think of the column names themselves having a "name", but the columns names are just an index like the row names are. It starts to make more sense when you realize columns don't have to be strings. They can be integers, like years or FIPS codes. In those cases, it makes a lot of sense to give the columns a name so you know what you're dealing with.
Input 6 does the reshaping using unstack('time')
, which takes the index
'time'
and creates a new column for every unique value it has. Notice
that the columns now have multiple levels, just like the index previously did.
This is another good reason to label your index and columns. If you want to
access either of those columns, you can do so as usual, using a tuple to
differentiate between the two levels:
In [11]: wide[('some_variable', 1)]
Out[11]:
unit_id
a 0
b 2
c 4
d 6
Name: (some_variable, 1), dtype: int32
If you want to combine the two levels (like Stata defaults to), you can simply rename the columns:
In [13]: wide_single_level_column = wide.copy()
In [14]: wide_single_level_column.columns = [
...: '{}_{}'.format(var, time)
...: for var, time in wide_single_level_column.columns]
In [15]: wide_single_level_column
Out[15]:
some_variable_1 some_variable_2
unit_id
a 0 1
b 2 3
c 4 5
d 6 7
The pivot
command can also be useful, but it's a bit more complicated than stack
and
unstack
and is better to revisit pivot
after you are
comfortable working with DataFrame indexes and columns.
Stata | Python |
---|---|
reshape <wide/long> <stubs>, i(<i_vars>) j(<j_var>) |
wide:
df.unstack(<level>) long:
df.stack('j_var') see also
df.pivot |
11 Econometrics
Stata | Python |
---|---|
ttest var1, by(var2) |
from scipy.stats import ttest_ind ttest_ind(array1, array2) |
xi: i.var |
pd.get_dummies(df['var']) |
i.var2#c.var1 |
pd.get_dummies(df[var2]).multiply(df[var1]) |
reg yvar xvar if <condition>, r |
import econtools.metrics as mt results = mt.reg(df[<condition>], 'yvar', 'xvar', robust=True) |
reg yvar xvar if <condition>, vce(cluster cluster_var) |
results = mt.reg(df[<condition>], 'yvar', 'xvar', cluster='cluster_var') |
areg yvar xvar1 xvar2, absorb(fe_var) |
results = mt.reg(df, 'yvar', ['xvar1', 'xvar2'], fe_name='fe_var') |
predict newvar, resid |
newvar = results.resid |
predict newvar, xb |
newvar = results.yhat |
_b[var] , _se[var] |
results.beta['var'] , results.se['var'] |
test var1 var2 |
results.Ftest(['var1', 'var2']) |
test var1 var2, equal |
results.Ftest(['var1', 'var2'], equal=True) |
lincom var1 + var2 |
econtools.metrics.f_test with appropriate parameters. |
ivreg2 |
econtools.metrics.ivreg |
outreg2 |
econtools.outreg |
reghdfe |
None (hoping to add it to Econtools soon). |
12 Plotting
Visualizations are best handled by the packages Matplotlib and Seaborn.
Stata | Python |
---|---|
binscatter |
econtools.binscatter |
maptile |
No quick tool, but easy to do with Cartopy. |
coefplot |
ax.scatter(results.beta.index, results.beta) often works. Depends on context. |
twoway scatter y_var x_var |
df.scatter('x_var', 'y_var') |
twoway scatter y_var x_var if <condition> |
df[<condition>].scatter(x_var, y_var) |
twoway <connected/line/area/bar/rarea> |
As above with appropriate parameters passed to df.plot function.
However, it is better to use matplotlib or seaborn
directly and call ax.plot(df['var1'], df['var2']) . Like with
merge , it's a different paradigm that needs more explanation. |
13 Other differences
13.1 Missing values
In Python, missing values are represented by a NumPy "not a number" object,
np.nan
. In Stata, missing (.
) is larger than every number, so
10 < .
yields True. In Python, np.nan
is never equal to
anything. Any comparison involving np.nan
is always False, even
np.nan == np.nan
.
To look for missing values in DataFrame columns, use any of the following.
df[<varname>].isnull()
returns a vector of True and False values for each row ofdf[<varname>
.df[<varname>].notnull()
is the complement of.isnull()
.- The function
np.isnan(<arraylike>)
takes an array and returns True or False for each element of the array (a DataFrame is a special type of array).
Another important difference is that np.nan
is a floating point data
type, so any column of a DataFrame that contains missing numbers will be
floats. If a column of integers gets changed so that even one element is
np.nan
, the whole column will be converted to floats.
13.2 Floating point equality
In Stata, decimal numbers are never equal to anything, e.g., 3.0 == 3
is
False. In Python, the above equality check returns True as long as the floating
point error is within tolerances. However, you should be wary when
doing large-scale equality checks using floats, like when merging two data sets
using a variable/column that is a float. This is true for any language! (Read
up on floating point error if you're curious why measuring equality of floats
is tricky.)