I'm working on making a futures market tick data replay system using python and pytables with a pretty large dataset (+200GB).
As far as I can tell, pytables can only store numpy datetime64 objects for my timestamps. This is an issue because I need to convert them to datetime objects or pandas timestamps so that the trading module can call methods like time or weekday or month on the incoming data. trying to convert billions of rows at runtime basically makes the system unusable.
pd.to_datetime(my_datetime64)
datetime.datetime(my_datetime64)
are both way too slow.
here's how I import my thousands of raw csvs into the pytables store. Note that the index is in the pandas datetime format which allows me to get information about the timestamp such as time, month, year etc
from pandas import HDFStore
store = HDFStore(store_dir)
for file in files:
df = pd.read_csv("/TickData/"+file)
df.index = pd.to_datetime(df['date'].apply(str) + " " + df['time'], format = '%Y%m%d %H:%M:%S.%f')
df.drop(['date', 'time'], axis=1, inplace=True)
store.append('ticks', df, complevel=9, complib='blosc')
here's what the data looks like when I read back a chunk with the PyTables table.read method - you can see that the pandas timestamps all got converted into datetime64
array([(1220441851000000000, [b'ESU09'], [1281.0], [1]),
(1226937439000000000, [b'ESU09'], [855.75], [2]),
(1230045292000000000, [b'ESU09'], [860.0], [1]), ...,
(1244721917000000000, [b'ESU09'], [943.75], [1]),
(1244721918000000000, [b'ESU09'], [943.75], [2]),
(1244721920000000000, [b'ESU09'], [944.0], [15])],
dtype=[('index', '<i8'), ('values_block_0', 'S5', (1,)), ('values_block_1', '<f8', (1,)), ('values_block_2', '<i8', (1,))])
here's how I read them out of the table in chunks
chunksize = 100000
nrows = 1000000000
n_chunks = nrows//chunksize + 1
h5f = tables.open_file(store_directory, 'r')
t = h5f.get_node('/', 'ticks')
for i in range(n_chunks):
chunk = t.table.read(i*chunksize, (i+1)*chunksize)
for c in chunk:
#this is where we would convert c[0] which is the timestamp ,
pd.to_datetime(c[0]) or datetime.datetime(c[0]), both are too slow
My question ultimately is:
1: Is there a faster way to convert the datetime64's back into datetimes or pandas timestamps, perhaps something having to do with cython?
OR 2: is there a way to store the pandas timestamps in the HDF so that they don't need to be converted on read?
Thanks
Aucun commentaire:
Enregistrer un commentaire