350 points by storage_wiz 6 months ago flag hide 23 comments
deepindexer 6 months ago next
I've been working on a new file scanning technique for terabyte-sized files, and I'm proud to say I've managed to bring the scan time down to sub-second levels! The groundbreaking novel indexing technique behind it makes large file scans much more feasible for data-intensive applications. AMA incoming ...
speedsorcerer 6 months ago next
Incredible work! Would you care to elaborate on the novel indexing technique used? I'm sure the community would love to read up on it, even if just a brief overview.
deepindexer 6 months ago next
Absolutely! The novel indexing technique creates a sparse table over the file, which tremendously accelerates the scanning process without compromising the scanned data credits:pmi_terabytes.pdf.
blazingbits 6 months ago prev next
How about file integrity checks during the sub-second scans? It's essential not to sacrifice validation speed or accuracy for quicker scan times.
deepindexer 6 months ago next
Excellent question! Built-in validation checks are part of the indexing methodology, which retains data accuracy without opening room for errors. Details to follow in an upcoming blog post.
csharper50 6 months ago prev next
Very cool stuff, I recently faced scanning challenges for a petabyte dataset. Would love to know about your future plans involving this project.
deepindexer 6 months ago next
I'm planning on expanding the solution to multiple parallel nodes and eventually scaling to petabyte levels. Stay tuned for more updates!
whats_a_byte 6 months ago prev next
References pls for the technique, I want to understand how the magic is happening...
deepindexer 6 months ago next
You can find a detailed glimpse of the technique in 'pmi_terabytes.pdf'. Our team will publish the full work soon, so hold tight! :)
syseng007 6 months ago prev next
Did you consider using any parallel or distributed computational methods to further optimize the speed?
deepindexer 6 months ago next
The next iteration of the design will probably include parallelism or distribution. However, the current novel indexing technique already proved substantial accelerations on a single machine.
goforjava 6 months ago prev next
That's really something, Nicely done! Did you run load or stress tests to see how things fare under more strenuous circumstances?
deepindexer 6 months ago next
Yes, I subjected the algorithm to a plethora of tests; the results are encouraging with the sub-second threshold breached every single time!
algsguru 6 months ago prev next
What was the main difficulty while implementing this method, and any interesting hurdles overcome?
deepindexer 6 months ago next
There were quite a few, but I could mention the most prominent ones in a post next week as a follow-up to address the community's curiosity. Stay tuned!
mathemagician123 6 months ago prev next
Any insights on the algorithm complexity, Big O notations? Would be interesting to compare its performance!
deepindexer 6 months ago next
\\mathcal{O}(N)\cdot\\text {log}(\\mathcal{O}(N)) \\approx\\text{sub-second time}, where N is the file size. Happy to delve deeper in a further explanation.
bigironman 6 months ago prev next
What are the practical use cases you are looking to address with such technology?
deepindexer 6 months ago next
Potential use cases include large data repositories, log analysis, and data-intensive AI applications. These all necessitate substantially rapid searching and validation.
efficientencoding 6 months ago prev next
Encryption of these massive files would require similar speeds and security for efficient usage of resources. Does it aid in securing file contents as well?
deepindexer 6 months ago next
Encryption/Decryption is seen as a different module, however, it benefits from metadata accessibility enabled by the index for prompt processing. A secure and efficient separation!
mrdatascientist 6 months ago prev next
Which storage protocols or formats took best advantage of your novel indexing method?
deepindexer 6 months ago next
I will need to perform a more fine-grained analysis, but preliminary results indicate HDFS and EXT4 file systems reap the largest benefits.