Just to have a recollection of this later, some notes.
setting up tokyo tyrant instances, one per month. I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot
One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I’m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).
There is write delay of course, and that may be all of it, since the files are big now.
Perhaps there is a benefit from wider tables, rather than one row per data record? Like one row per hour of data per sensor, or one row per 5 minutes, etc?
Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in. Not sure why. Could be because I was tweaking parameters and stuff.
One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb. I was hoping to get a much bigger size advantage (smaller file). The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database. Of course, it isn’t strictly the same data… I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.
So, slotting 4 months of data away. I’ll check it again on Monday and see if it worked.
And by the way, I’m sure I’m not the best at this because I haven’t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL. Of course, I don’t want to have all of that data sitting in my relational database, but I’m just saying…