samedi 18 juin 2016

Mismatch of memory used by consumer on Celery


I have a celery task that I am loading with some data (~4GB) when the worker is starting through bootstep and the extra argument --init. Worker is started with concurrency 1.

celery worker -A config --config config.celery_settings.development -Q nlp --concurrency=1 --init nlp

Under MacOSX, everything seems to work fine: 2 processes one with little RAM use (the consumer), one with some RAM (the worker) .

> ps aux | grep -i celery
xxx 91879   0.0  0.1  7457408 13284 s003  S+    4:48PM   0:00.02 /Users/xxx/.envs/paper/bin/python3.4 /Users/xxx/.envs/paper/bin/celery worker -A config --config config.celery_settings.development -Q nlp --concurrency=1 --init nlp -l DEBUG
xxx 91229   0.0 29.9  7459456 5012028 s003  S+    4:48PM   0:18.79 /Users/xxx/.envs/paper/bin/python3.4 /Users/xxx/.envs/paper/bin/celery worker -A config --config config.celery_settings.development -Q nlp --concurrency=1 --init nlp -l DEBUG
xxx 99368   0.0  0.0  2446104    820 s002  S+    9:06AM   0:00.00 grep -i celery

But when I am firing the same worker on my aws prod environment (Ubuntu 14.04) the 2 processes are using (roughly) the same RAM.

> ps aux | grep -i celery 
ubuntu   19361 0.0 34.7 6273624 5714528 pts/0 Sl+  17:04   0:59 /home/ubuntu/.virtualenvs/production/bin/python3 /home/ubuntu/.virtualenvs/production/bin/celery worker -A config --config config.celery_settings.development -Q nlp --concurrency=1 --init nlp
ubuntu   19485  0.0 34.6 6273624 5702252 pts/0 Sl+  17:05   0:00 /home/ubuntu/.virtualenvs/production/bin/python3 /home/ubuntu/.virtualenvs/production/bin/celery worker -A config --config config.celery_settings.development -Q nlp --concurrency=1 --init nlp
ubuntu   19490  0.0  0.0  10460   940 pts/1    S+   17:05   0:00 grep --color=auto -i celery

Any clues on what could cause this memory mismatch between the 2 OS ?

Below is the code sample I am using (and that I think is relevant): (Python 3.4, Celery 3.1.18).

I have the bootstep in place in config/celery.py:

celery_app.user_options['worker'].add(
    Option('--init', dest='init', default=None, help='Load data at task instantiation')
)

class InitArgs(bootsteps.Step):
    """BootStep to warm up task dispatchers of type Model, PaperEngine or
    ThreadEngine with data"""
    def __init__(self, worker, init, **options):
        for k, task in worker.app.tasks.items():
            if task.__name__.startswith('{0}_dispatcher'.format(init)):
                task.load()

celery_app.steps['worker'].add(InitArgs)

The task requiring data load is defined in one of my django app in tasks.py:

class EmbedPaperTask(Task):

    abstract = True
    model_name = None
    _model = None

    def __init__(self, *args, **kwargs):
        self.model_name = kwargs.get('model_name')

    def load(self):
        self._model = Model.objects.load(name=self.model_name)
        return self._model

    @property
    def model(self):
        if self._model is None:
            return self.load()
        return self._model        

@app.task(base=EmbedPaperTask, bind=True, name=task_name, model_name=model_name)
def nlp_dispatcher(self, *args, **kwargs):
    return self.model.tasks(*args, **kwargs)

NB: Hoping this is clear enough, I am new to this kind of thing.


Aucun commentaire:

Enregistrer un commentaire