dimanche 19 juin 2016

Pipeline programming in parallel


I've got a linux VM which is updated with new data files every 4 hours. The files are organized in directories by numbered 01 to 10.

I've got an executable (convert.exe) that converts the files uploaded to a different file type.

I'd like to develop a pipeline to process the files (convert.exe) then redirects them to another directory.

I've already programmed this in series in linux bash script. Using the following code:

for d in $(find /mnt/data01/dpad -mindepth 1 -name "DIR*" -type d); do

  #recursively iterate through files
  #for those that were modified within the last day (i.e. new files added)
  for f in $(find $d -type f -mtime -1); do

    #determine appropriate folder for file to move to
    newdirname=$(basename $d)
    newfilename=$(basename $f)

    mono convert.exe $f -o /mnt/convertedfiles/$newdirname/$newfilename
  done
done

However, I'd like to use the processing power I have access to and run it in parallel over several CPUs to gain more of a real time conversion method and results.

I was planning on changing to python and using snakemake to distribute the commands.

I'm not new to programming but am new to python and snakemake commands.

Just wondering if anyone could provide some insight into how to go about starting this process?


Aucun commentaire:

Enregistrer un commentaire