vendredi 10 juin 2016

Speed up conversion of timestamp to datetime Python


I'm working on making a futures market tick data replay system using python and pytables with a pretty large dataset (+200GB).

As far as I can tell, pytables can only store numpy datetime64 objects for my timestamps. This is an issue because I need to convert them to datetime objects or pandas timestamps so that the trading module can call methods like time or weekday or month on the incoming data. trying to convert billions of rows at runtime basically makes the system unusable.

pd.to_datetime(my_datetime64)
datetime.datetime(my_datetime64)

are both way too slow.

here's how I import my thousands of raw csvs into the pytables store. Note that the index is in the pandas datetime format which allows me to get information about the timestamp such as time, month, year etc

from pandas import HDFStore
store = HDFStore(store_dir)

for file in files:
            df = pd.read_csv("/TickData/"+file)
            df.index = pd.to_datetime(df['date'].apply(str) + " " + df['time'], format = '%Y%m%d %H:%M:%S.%f')
            df.drop(['date', 'time'], axis=1, inplace=True)
            store.append('ticks', df, complevel=9, complib='blosc')

here's what the data looks like when I read back a chunk with the PyTables table.read method - you can see that the pandas timestamps all got converted into datetime64

array([(1220441851000000000, [b'ESU09'], [1281.0], [1]),
       (1226937439000000000, [b'ESU09'], [855.75], [2]),
       (1230045292000000000, [b'ESU09'], [860.0], [1]), ...,
       (1244721917000000000, [b'ESU09'], [943.75], [1]),
       (1244721918000000000, [b'ESU09'], [943.75], [2]),
       (1244721920000000000, [b'ESU09'], [944.0], [15])], 
      dtype=[('index', '<i8'), ('values_block_0', 'S5', (1,)), ('values_block_1', '<f8', (1,)), ('values_block_2', '<i8', (1,))])

here's how I read them out of the table in chunks

    chunksize = 100000
    nrows = 1000000000
    n_chunks =  nrows//chunksize + 1
    h5f = tables.open_file(store_directory, 'r')
    t = h5f.get_node('/', 'ticks')

    for i in range(n_chunks):
         chunk = t.table.read(i*chunksize, (i+1)*chunksize)
             for c in chunk:
                  #this is where we would convert c[0] which is the timestamp , 
pd.to_datetime(c[0]) or datetime.datetime(c[0]), both are too slow

My question ultimately is:

1: Is there a faster way to convert the datetime64's back into datetimes or pandas timestamps, perhaps something having to do with cython?

OR 2: is there a way to store the pandas timestamps in the HDF so that they don't need to be converted on read?

Thanks


Aucun commentaire:

Enregistrer un commentaire