How to add/delete/count multiple records in Redis using Python
2 min readDec 4, 2021
I am dealing with tens of billions of data points in Redis for our app to use. The data is deleted and updated in batches and incrementally. After trials and errors, I found add/delete them in batches is the most efficient way as far as I now.
Background:
Methodology: The data is stored in HDFS, and we are using Hive to process the data and perform calculation. Then we write the data to Redis for the App to use.
Challenge:
- Time range is long, 1 year, lead to extremely large data, which is more than ten billions of data.
- Update daily. Updating such massive data needs to carefully consider the resources and strategy.
- Redis is a single-threaded server, which means each time one one operation can be performed. The time to add/update/delete/count will a really long time, which is really suffering.
Strategies:
- Create middle tables in Hive, and update the table daily by adding D-1 result and D0’s new data, then deleting D-1 year’s old data.
- Instead of overwrite the Redis DB all over again (deleting all the old records + adding all the new records), which takes O(n+n) running time, I compared the D0 result and D-1 result, and split the data into ‘To Add’, ‘To Delete’ and ‘To Update’. Since every day, there are only roughly 50 millions active users that would reduce the running time with much less records need to be changed.
- When adding/updating/deleting/count them in python, do multiple keys together would save tremendous time. The code is as below:
import redistarget_redis = redis.Redis(host=host, port=port, db=db, password=password)# To add/update
target_redis.set({'a':1})
# To add/update multiple records
target_redis.mset({'a':1, 'b':2, ...})# To delete one key
target_redis.delete(keyrow)
# To delete multiple keys
keyrows = [keyrow1, keyrow2, ...]
target_redis.delete(*keyrows)# To count
logging.info("Count previous records.")
st = time.time()
cnt = 0
cursor = '0'
while cursor != 0:
cursor, keys = target_redis.scan(cursor=cursor, match="keyrow*",count=100000)
cnt += len(keys)
et = time.time()
logging.info("%d records counted within %d s."%(cnt, et-st))
END.