dimanche 26 juin 2016

Storing 100Mb files and efficient cross-join operation in MongoDB or other DB


Part of a project I am involved in consists of developing a scientific application for internal usage, working with large collection of files (about 20000) each of which is ~100Mb in size. Files are accompanied with meta information, used to select subsets of the whole set.

Update after reading response Yes, processing is located in the single server room.

The application selects two subsets of these files. On the first stage it processes each file individually and independently yielding up to 30 items from one file, for the second stage. Each resulting item is also stored in a file, and file size varies from 5 to 60Kb.

On the second stage the app processes all possible pairs of results that have been produced on the first stage where the first element of a pair has came from the 1st subset and 2nd - from the 2nd - a cross-join or Cartesian product of two sets.

Typical amount of items in the first subset is thousands, and in the 2nd - tens of thousands. Therefore amount of all possible pairs in the second stage is hundreds of millions.

Typical time for processing of a single source 100Mb file is about 1 second, of a single pair of 1st stage results - microseconds. The application is not for real-time processing, its general use case would be to submit a job for overnight calculation and obtain results in the morning.

We have got already a version of an application, developed earlier, when we have had much less data. It is developed in Python and uses file system and data structures from the Python library. The computations are performed on 10 PCs, connected with self-developed software, written with Twisted. Files are stored on NAS and on PCs' local drives. Now the app performs very poorly, especially on the second stage and after it, during aggregation of results.

Currently I am looking at MongoDB to accomplish this task. However, I do not have too much experience with such tools and open for suggestions.

I have conducted some experiments with MongoDB and PyMongo and have found that loading the whole file from the database takes about 10 seconds over the Gigabit ethernet. Minimal chunk size for processing is ~3Mb and it is retrieved for 320 msec. Loading files from a local drive is faster.

MongoDB config contained a single line with a path.

However, very appealing feature of the database is its ability to store metainformation and support search for it, as well as automatic replication. This is also a persistent data storage, therefore, the computations can be continued after accidental stop (currently we have to start over).

So, my questions are.

Is MongoDB a right choice? If yes, then what are the guide lines for a data model?

Is it possible to improve retrieval time for files?

Or, is it reasonable to store files in a file system, as before, and store paths to them in the database?

Creation of a list of all possible pairs for the 2nd stage has been performed in the client python code and also took rather long time (I haven't measured it).

Will MongoDB server do better?


Aucun commentaire:

Enregistrer un commentaire