\$\begingroup\$

I needed to do transformations on a data stream, not on an element-by-element basis, but on subsequences. This is what I've come up with: https://github.com/peterjoel/rust-iter-replace/blob/master/src/lib.rs

I am relatively new to Rust, and would love some feedback on any areas of the code: style, design, memory/performance no-nos. But performance (throughput) is particularly important in this application, as I'll be using it to "compress" very large (multi GB) pieces of data, in memory and on disc.

Design Overview

The central struct Replace , keeps track of two buffers: buffer_out and buffer_in , a set of partial match candidates and some other members for keeping track of state between invocations of next() .

buffer_out is data that is fully processed and ready to pass to the next iterator adapter - this will either contain unmatched data, or the full replacement sequence. buffer_in contains data that may or may not match, and gets copied to buffer_out as soon as it can be shown that it doesn't match and gets erased when it does. I chose a VecDeque for buffer_out because it generally gets written to at the back and read from at the start. As I write this, I realise that buffer_in could have been just a Vec . Maybe I'll change that.

The BTreeSet , candidates , keeps track of the index when the first item of a partial match occurred. As soon as a partial match no longer matches then it is discarded. I chose a BTreeSet because I needed to access the smallest element to know when I can flush any part of buffer_in . But actually its elements are added in size order - which I'm not taking advantage of here - so there could be a better data structure which can exploit that invariant.

The section of code where I remove elements from the candidates set isn't great. I originally wrote a trait and had it as a method of BTreeSet , but I had some errors (I forgot now) to do with the type signature, where I couldn't get it to match the expected type of the predicate used by filter . I may revisit that.

Source

use std::collections::{BTreeSet}; use std::collections::VecDeque; pub struct Replace <'a, I, T: 'a + Ord > { iter: I, buffer_out: VecDeque<T>, buffer_in: VecDeque<T>, replace_from: &'a [T], replace_with: &'a [T], candidates: BTreeSet<usize>, index: usize, flushed_index: usize, } impl <'a, I, T> Replace <'a, I, T> where I: Iterator<Item = T>, T: Eq + Ord + Copy { pub fn adapt(iter: I, replace_from: &'a [T], replace_with: &'a [T]) -> Replace<'a, I, T> { Replace { iter: iter, buffer_out: VecDeque::new(), buffer_in: VecDeque::new(), replace_from: replace_from, replace_with: replace_with, candidates: BTreeSet::new(), index: 0, flushed_index: 0, } } fn fill_buffer(&mut self) { 'consume: while let Some(item) = self.iter.next() { self.index += 1; // buffer all incoming items self.buffer_in.push_back(item); // Prune existing partial match candidates that don't match the next item let removes: Vec<_> = self.candidates.iter().cloned() .filter(|start_index| { self.replace_from[self.index - *start_index] != item }).collect(); for r in removes { self.candidates.remove(&r); } // Keep track of new partial match candidates if self.replace_from[0] == item { self.candidates.insert(self.index); } // if the length of the first match is the length of the replace sequence then it's a complete match match self.candidates.iter().cloned() .next() .into_iter() .find(|x| self.index - x + 1 == self.replace_from.len()) { None => { // We can flush the inbound buffer up to the first partial match // (or the full buffer if there are no partial matches) let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index); if flush_index > self.flushed_index { let mut flush: VecDeque<_> = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect(); self.buffer_out.append(&mut flush); self.flushed_index = flush_index; break 'consume; } }, Some(_) => { // A match! So replace it and clear all the partial matches self.candidates.clear(); for &x in self.replace_with.iter() { self.buffer_out.push_back(x); } self.buffer_in.clear(); self.flushed_index = self.index; break 'consume; } } } } } pub trait ReplaceIter<'a, I, T> where I: Iterator<Item = T>, T: Ord { fn replace(self, from: &'a [T], to: &'a [T]) -> Replace<'a, I, T>; } impl <'a, I, T> ReplaceIter<'a, I, T> for I where I: Iterator<Item = T>, T: Eq + Ord + Copy { fn replace(self, from: &'a [T], to: &'a [T]) -> Replace<'a, I, T> { Replace::adapt(self, from, to) } } impl <'a, I, T> Iterator for Replace <'a, I, T> where I: Iterator<Item = T>, T: Eq + Ord + Copy { type Item = T; fn next(&mut self) -> Option<T> { if self.buffer_out.len() == 0 { self.fill_buffer(); } self.buffer_out.pop_front() } }

And tests, which show the usage: