Dump Large Datasets and Machine Learning Models with Joblib and Pickle
Working with relatively large datasets but not large enough as Big Data. To dump them and their results on your local disk with limited memory, which serialization method should you choose?
OSError: [Errno 28] No space left on device
Familiar with this type of error? It will really cause you a headache when the algorithm has been running for several hours, the error was raised at the last step — dumping the result.
Despite the specific dumping method comes with the library, the most common methods would be joblib and pickle, which are recommended by sklearn as well.
Pickle is common used that it can persist the data as it was. For example, for a pandas dataframe with a column of datetime format, if we use pd.DataFrame.to_csv() to dump it, the datatype of the column will be changed to string when we reload it. Using pickle can avoid this issue.
The pickle module can transform a complex object into a byte stream and it can transform the byte stream into an object with the same internal structure.
The code for pickle is as simple as:
# To dump
f = open('directory/filename.pickle','wb')
pickle.dump(file, f)# To load
f = open('directory/filename.pickle','rb')
file = pickle.load(f)
One of the main features of Joblib is described as below in its documentation.
Fast compressed Persistence: a replacement for pickle to work efficiently on Python objects containing large data ( joblib.dump & joblib.load ).
The normal usage of joblib would be:
# To dump
f = 'directory/filename.joblib'
joblib.dump(file_to_dump, f)# To load
file = joblib.load(f)
However, when it’s taking too long and too much space using the above methods, the easiest way would be compressing the files using joblib.
The supported compression methods are ‘compressed’(auto), ‘z’, ‘gzip’, ‘bz2’, ‘lzma’ and ‘xz’.
The sample code is like this:
f = 'directory/filename.joblib'
joblib.dump(file_to_dump, f + '.bz2', compress=('bz2', 3))
Hope this will help you free some storage space.