Data (I/O) Tools¶
load_or_build
¶
Basic Usage¶
In Python, a decorator is a special function that alters other functions. They
are invoked using @
like so:
@decorator
def my_function():
print("I'm not sure I trust that instrument.")
load_or_build()
is a decorator that is used on functions that return
data (either a pandas Series or DataFrame) to cache that data to disk.
Let’s say that you have function prep_data
that preps a dataset for
regressions.
import numpy as np
import pandas as pd
def prep_data():
df = pd.read_csv('my_data.csv')
df['wage_sq'] = df['wage'] ** 2
df['ln_wage'] = np.log(df['wage'])
# Manipulate data in other ways...
return df
Maybe this function takes a long time to run and you’d like to save it to your
hard drive so you don’t have to re-process it every time. Instead of
copy/pasting the filepath associated with this function, you can just use
load_or_build()
:
import numpy as np
import pandas as pd
from econtools import load_or_build
@load_or_build('reg_data.pkl')
def prep_data():
df = pd.read_csv('my_data.csv')
df['wage_sq'] = df['wage'] ** 2
df['ln_wage'] = np.log(df['wage'])
# Manipulate data in other ways...
return df
Here, @load_or_build('reg_data.pkl')
adds the following
functionality to prep_data
:
- If there’s a file named
'reg_data.pkl'
, load it and return it. - If that file doesn’t exist, run
prep_data
like usual, save a copy to'reg_data.pkl'
for next time then return the data.
Dynamic Filenames¶
Let’s say your data function takes an argument, like year
:
def load_census_data(year):
# Load data for year `year`...
return df
In this case, you’ll need a different file name on disk for each year.
load_or_build()
handles this using Python’s named string
insertion using curly brackets like so:
@load_or_build('census_data_{year}.pkl')
def load_census_data(year):
# Load data for year `year`...
return df
if __name__ == '__main__':
# Loads from 'census_data_2010.pkl'
df = load_census_data(2010)
This works for both positional arguments and keywork arguments.
Special Keyword Switches¶
load_or_build()
adds two special keyword arguments to
functions it decorates.
_rebuild
(defaultFalse
): If_rebuild=True
,load_or_build()
will re-build the data output by the function and overwrite any saved version on disk._load
(defaultTrue
): If_load=False
,load_or_build()
will not look for saved data on disk and will only run the function as though you didn’t useload_or_build()
in the first place.
Examples:
@load_or_build('census_data_{year}.pkl')
def load_census_data(year):
# Load data for year `year`...
return df
if __name__ == '__main__':
# Loads from 'census_data_2010.pkl'
df = load_census_data(2010)
# Runs `load_census_data` and over writes what's on disk
df = load_census_data(2010, _rebuild=True)
# Doesn't load file on disk, only runs `load_census_data`
df = load_census_data(2010, _load=True)
save_cli
¶
The function save_cli()
adds a --save
flag to the
command line. When --save
is included on the command line,
save_cli()
returns True
, and False
otherwise.
This allows you to run a script without overwriting any tables or figures on
disk and avoid commenting/uncommenting lines of code that do the saving.
# script named "make_figure.py"
from econtools import save_cli
save = save_cli()
if save:
# Code to save the figure
else:
# Code to only display the figure
Then save
switch is invoked on the command line using:
$ python make_figure.py --save # Saves figure
$ python make_figure.py # Does not save
confirmer
¶
confirmer()
is a drop-in function to quickly allow a script
to get yes/no input from the user. It accepts a number of variations of
yes
, Y
, YES
, etc., and will force a correct response by re-asking
the question if an invalid response is given.
# Script thermonuclear_war.py
from econtools import confirmer
question = "Shall we play a game?"
answer = confirmer(question, default_no=True)
if answer:
# Action for 'yes' response
else:
# Action for no response
On the command line:
$ python thermonuclear_war.py
Shall we play a game? (y,[n]) >>> Y
# Code executed for 'yes' response
read
and write
¶
These function are primarily auxiliary functions used by
load_or_build()
, but they can be used directly if needed.
read()
will use the suffix of the passed filename to use
the correct pandas
method to read the data.
from econtools import read
df = read('my_data.csv') # uses pandas.read_csv
df = read('my_data.dta') # uses pandas.read_stata
df = read('my_data.pkl') # uses pandas.read_pickle
write()
does the same, but with writing.