Handing scale #122
scholarsmate
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The Problems
Now that we can open and edit large files in the Data Editor, we have to re-think how to handle operations like search, replace, and data profiling.
Search
Let's begin with search. Suppose we load a 1M file full of the letter 'A'. Next we do a search for 'A'. Now we get an array of length 1M with all the offsets of each of the million letter 'A's which takes up quite a bit of memory. Now imagine a 1G file, or 1T file. Yes, scale breaks search.
Replace
Replace is even worse. Suppose we load that 1M file fill of the letter 'A' and we search for 'A' and replace them all with 'XYZ'. The UI replaces things from top to bottom, so the match at the smallest offset is replaced first, then the next smallest and so on. Each replace is done as an edit transaction by deleting the matched token and inserting the replacement token. Each edit transaction is maintained in a changes stack and the model is mathematically updates. These operations are cheap, but not free. For 1M replacements, there are 2M changes added to change stack, and 2 million updates to the edit model. Each of these inserts also stores the bytes to insert when segments in the model are materialized. At this scale memory using starts to become more significant and materializing the model starts taking longer. When doing replace from top to bottom, each subsequent replace needs the model adjustments from the previous one (this is why the original algorithm did replace all from bottom to top to spare these additional model adjustments, speeding up the process). In a recent test, it took 3 hours for Ωedit to do 1M replacements as described, replacing 'A' with 'XYZ', taking a 1M file and writing out a 3M file. The CPU was pegged. Yes, scale really breaks replace.
Data Profiling
Data profiling calculates bytes frequencies over a given range (start offset and length). By default the UI chooses 0 as the start offset and the length is the computed file size. This means the profiler must read every single byte, so the entire model must be materialized (it does this using a sliding viewport internally). As more and more changes are made, the model gets more and more complex and fragmented as more bits and pieces of the model are all over the memory address space. Materialization speed takes a hit, affecting the speed of the data profile. By default Scala times out the profiling thread after 5 seconds. Without many changes, it works fine on a 1.6GB file, but do 1K replaces and now it can't profile fast enough and gets timed out. Yes, scale breaks data profiling.
Possible Solutions
Search
For search instead of finding and reporting on every instance of a token, instead it just finds the first instance of the token. If you click Next, starting from that offset + len(token), we find the first instance after that offset, and so on. We can do previous by searching backwards from that offset (search direction will need to be implemented in Ωedit). We'll have to give up knowing the number of tokens in the materialized model however.
Replace
As with search, we find the first match, then the user can choose to replace the matched token with the replacement token, or skip to the next match. This could be a problem for the user if there are many of these. In the case of 1M 'A's that they'd want to replace, this would effectively be a blocker. In this case, the user can use
sed
, or wait for Ωedit to support large scale operations in a single transaction using checkpoints (think checkpointing a model to file, then running a file to file transform (similarsed
but as a library function), then having that transformed file be the next checkpoint in a new editing frame, and continuing with the editing session). This generalized solution can be leveraged for all kinds of whole file transforms like {up,down}casing, {en,de}coding, (de)compression, {en,de}cryption, etc. These file transforms may not scale well either, but will be faster than trying to do it all as a massive series of individual edits maintained by Ωedit. There are still plenty of challenges with this.Profiling
We limit the length of the profiling to something like 1M bytes from some offset, so at most we profile a 1M section of a larger file. The user can select the start offset and the length as long as the length doesn't exceed the profiling max length.
Beta Was this translation helpful? Give feedback.
All reactions