Writing a large .aselmdb is very slow
I am trying to collect data from several smaller ASE databases into a single large (~100 GB) .aselmdb file for training machine learning models. However, writing the large .aselmdb is slow and it might even slow down as the file grows in size.
Here is the code I am currently using:
for input_file in input_files:
with ase.db.connect(input_file) as input_db:
# Write to output database in limited size transactions
N = len(input_db)
n = 0
while n < N:
with ase.db.connect(output_db_path, append=True) as output_db:
for _ in range(1000):
if n >= N:
break
row = input_db.get(n + 1)
output_db.write(row, key_value_pairs=row.key_value_pairs, data=row.data)
n += 1
The input and output file paths can be either .db or .aselmdb. I am keeping the connection to the input file open and writing to the output database in transactions of 1000 rows at a time. The total size is around 100 GB and 12 million rows.
When the output file is a .db (the default sqlite format), it writes around 100 GB in 24 hours, but when the output is a .aselmdb, it only writes around 30 GB in 24 hours and around 60 GBs in 4 days (it slows down).
Do you have any recommendations for writing large .aselmdb files faster? Or perhaps this format not suited for this amount of data?
Thank you.