samedi 18 juin 2016

Deserializing a huge json string to python objects


I am using simplejson to deserialize json string to python objects. I have a custom written object_hook that takes care of deserializing the json back to my domain objects.

The problem is, when my json string is huge (i.e. the server is returning around 800K domain objects in the form of a json string), my python deserializer is taking almost 10 minutes to deserialize them.

I drilled down a bit further and it looks like simplejson as such is not doing much work rather it's delegating everything to the object_hook. I tried optimizing my object_hook but that too is not improving my performance. (I hardly got 1 min improvement)

My question is, do we have any other standard framework that is optimized to handle huge data set or is there a way where I can utilize the framework's capability rather than doing everything at object_hook level.

I see that without object_hook the framework returns just a list of dictionaries not list of domain objects.

Any pointers here will be useful.

FYI I am using simplejson version 3.7.2

Here is my sample _object_hook:

def _object_hook(dct):
    if '@CLASS' in dct: # server sends domain objects with this @CLASS 
        clsname = dct['@CLASS']
        # This is like Class.forName (This imports the module and gives the class)
        cls = get_class(clsname)
        # As my server is in java, I convert the attributes to python as per python naming convention.
        dct = dict( (convert_java_name_to_python(k), dct[k]) for k in dct.keys())
       if cls != None:
            obj_key = None
            if "@uuid"in dct
                obj_key = dct["@uuid"]
                del(dct["@uuid"])
            else:
                info("Class missing uuid: " + clsname)
            dct.pop("@CLASS", None)

            obj = cls(**dct) #This I found to be the most time consuming process. In my domian object, in the __init__ method I have the logic to set all attributes based on the kwargs passed 
            if obj_key is not None:
                shared_objs[obj_key] = obj #I keep all uuids along with the objects in shared_objs dictionary. This shared_objs will be used later to replace references.
        else:
            warning("class not found: " + clsname)
            obj = dct

        return obj
    else:
        return dct

A Sample response:

    {"@CLASS":"sample.counter","@UUID":"86f26a0a-1a58-4429-a762-  9b1778a99c82","val1":"ABC","val2":1131,"val3":1754095,"value4":  {"@CLASS":"sample.nestedClass","@UUID":"f7bb298c-fd0b-4d87-bed8-  74d5eb1d6517","id":1754095,"name":"XYZ","abbreviation":"ABC"}}

I have many levels of nesting and the number of records I am receiving from server is more than 800K.


Aucun commentaire:

Enregistrer un commentaire