Go is shipped with package helping with buffered I/O — technique to optimize read or write operations. For writes it’s done by temporary storing data before transmitting it further (like disk or socket). Data is stored till certain size is reached. This way less write actions are triggered and each boils down to syscall which might be expensive when done frequently. For reads it means retrieving more data during single operation. It also reduces number of sycalls but can also uses underlaying hardware in more efficient way like reading data in disk blocks. This post focuses on Scanner provided by bufio package. It helps to process stream of data by splitting it into tokens and removing space between them:

"foo bar baz"

If we’re are interested only in words then scanner helps retrieving “foo”, “bar” and “baz” in sequence (source code):

package main import (

"bufio"

"fmt"

"strings"

) func main() {

input := "foo bar baz"

scanner := bufio.NewScanner(strings.NewReader(input))

scanner.Split(bufio.ScanWords)

for scanner.Scan() {

fmt.Println(scanner.Text())

}

}

Output:

foo

bar

baz

Scanner uses buffered I/O while reading the stream — it takes io.Reader as an argument.

If you’re dealing with data in memory like string or slice of bytes then first check utilities like bytes.Split, strings.Split. It’s probably simpler to rely on those or others goodies from bytes or strings package when not working with data stream.

Under the hood scanner uses buffer to accumulate read data. When buffer is not empty or EOF has been reached then split function (SplitFunc) is called. So far we’ve seen one of pre-defined split functions but it’s possible to set anything with signature:

func(data []byte, atEOF bool) (advance int, token []byte, err error)

Split function is called with data read so far and basically can behave in 3 different ways — distinguished by returned values…

1. Give me more data!

It says that passed data is not enough to get a token. It’s done by returning 0, nil, nil . When it happens, scanner tries to read more data. If buffer is full then will double it before any reading. Let’s see how it works (source code):

package main import (

"bufio"

"fmt"

"strings"

) func main() {

input := "abcdefghijkl"

scanner := bufio.NewScanner(strings.NewReader(input))

split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {

fmt.Printf("%t\t%d\t%s

", atEOF, len(data), data)

return 0, nil, nil

}

scanner.Split(split)

buf := make([]byte, 2)

scanner.Buffer(buf, bufio.MaxScanTokenSize)

for scanner.Scan() {

fmt.Printf("%s

", scanner.Text())

}

}

Output:

false 2 ab

false 4 abcd

false 8 abcdefgh

false 12 abcdefghijkl

true 12 abcdefghijkl

The above split function is very simple and greedy — always requesting for more data. Scanner will try to read more but also making sure that buffer has enough space. In our case we’re starting with buffer of size 2:

buf := make([]byte, 2)

scanner.Buffer(buf, bufio.MaxScanTokenSize)

After split function is called for the very first time, scanner will double the size of the buffer, read more data and will call split function for the 2nd time. After 2nd call the scenario will be exactly the same. It’s visible in the output — first call of split gets slice of size 2, then 4, 8 and finally 12 since there is no more data.

Default size of buffer is 4096.

It’s worth to discuss atEOF parameter here. Designed to pass information to split function that no more data will be available. It can happen either while reaching EOF or if read call returns an error. If any of these happens then scanner will never try to read anymore. Such flag can used f.ex. to return error (because of incomplete token) which will cause scanner.Split() to return false and stop the whole process. Error can be later checked using Err method (source code):

package main import (

"bufio"

"errors"

"fmt"

"strings"

) func main() {

input := "abcdefghijkl"

scanner := bufio.NewScanner(strings.NewReader(input))

split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {

fmt.Printf("%t\t%d\t%s

", atEOF, len(data), data)

if atEOF {

return 0, nil, errors.New("bad luck")

}

return 0, nil, nil

}

scanner.Split(split)

buf := make([]byte, 12)

scanner.Buffer(buf, bufio.MaxScanTokenSize)

for scanner.Scan() {

fmt.Printf("%s

", scanner.Text())

}

if scanner.Err() != nil {

fmt.Printf("error: %s

", scanner.Err())

}

}

Output:

false 12 abcdefghijkl

true 12 abcdefghijkl

error: bad luck

Parameter atEOF can be also used to process what is left inside buffer. One of pre-defined split functions which scans input line by line behaves exactly this way. For input like:

foo

bar

baz

there is no

at the end of last line so when function ScanLines cannot find new line character it will simply return remaining characters as the last token (source code):

package main import (

"bufio"

"fmt"

"strings"

) func main() {

input := "foo

bar

baz"

scanner := bufio.NewScanner(strings.NewReader(input))

// Not actually needed since it’s a default split function.

scanner.Split(bufio.ScanLines)

for scanner.Scan() {

fmt.Println(scanner.Text())

}

}

Output:

foo

bar

baz

2. Token found

This happens when split function was able to detect a token. It returns the number of characters to move forward in the buffer and the token itself. The reason to return two values is simply because token doesn’t have to be always equal to the number of bytes to move forward. If input is “foo foo foo” and when goal is to detect words (ScanWords), then split function will also skip over spaces in between:

(4, "foo")

(4, "foo")

(3, "foo")

Let’s see it in action. This function will look only for contiguous strings foo (source code):

package main import (

"bufio"

"bytes"

"fmt"

"io"

"strings"

) func main() {

input := "foofoofoo"

scanner := bufio.NewScanner(strings.NewReader(input))

split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {

if bytes.Equal(data[:3], []byte{'f', 'o', 'o'}) {

return 3, []byte{'F'}, nil

}

if atEOF {

return 0, nil, io.EOF

}

return 0, nil, nil

}

scanner.Split(split)

for scanner.Scan() {

fmt.Printf("%s

", scanner.Text())

}

}

Output:

F

F

F

3. Error

If split function returns an error then scanner stops (source code):

package main import (

"bufio"

"errors"

"fmt"

"strings"

) func main() {

input := "abcdefghijkl"

scanner := bufio.NewScanner(strings.NewReader(input))

split := func(data []byte, atEOF bool) (advance int, token []byte, err error) {

return 0, nil, errors.New("bad luck")

}

scanner.Split(split)

for scanner.Scan() {

fmt.Printf("%s

", scanner.Text())

}

if scanner.Err() != nil {

fmt.Printf("error: %s

", scanner.Err())

}

}

Output:

error: bad luck

There is one special error which doesn’t stop the scanner immediately….

ErrFinalToken

Scanner offers an option to signal so-called final token. It’s a special token which doesn’t break the loop (Scan still returns true) but subsequent calls to Scan will stop immediately (source code):

func (s *Scanner) Scan() bool {

if s.done {

return false

}

...

Proposed in #11836 and can be used to stop scanning when finding special token (source code):

package main import (

"bufio"

"bytes"

"fmt"

"strings"

) func split(data []byte, atEOF bool) (advance int, token []byte, err error) {

advance, token, err = bufio.ScanWords(data, atEOF)

if err == nil && token != nil && bytes.Equal(token, []byte{'e', 'n', 'd'}) {

return 0, []byte{'E', 'N', 'D'}, bufio.ErrFinalToken

}

return

} func main() {

input := "foo end bar"

scanner := bufio.NewScanner(strings.NewReader(input))

scanner.Split(split)

for scanner.Scan() {

fmt.Println(scanner.Text())

}

if scanner.Err() != nil {

fmt.Printf("Error: %s

", scanner.Err())

}

}

Output:

foo

END