stringsext(1) Jens Getreu :manmanual: STRINGSUTF :mansource: STRINGSUTF :man-linkstyle: blue R <> NAME stringsext - search for valid strings, decode and print its graphic characters as UTF-8. stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, CJKV characters and other scripts in all supported multi-byte-encodings, while GNU strings fails in finding any of these scripts in UTF-16 and many other encodings. SYNOPSIS stringsext [options] [-e ENC...] [--] [FILE] stringsext [options] [-e ENC...] [--] [-] DESCRIPTION stringsext prints all graphic character sequences in FILE or stdin that are at least MIN bytes long. Unlike GNU strings stringsext can be configured to search for valid characters not only in ASCII but also in many other input encodings, e.g.: utf-8, utf-16be, utf-16le, big5-2003, euc-jp, koi8-r and many others. --list-encodings shows a list of valid encoding names based on the WHATWG Encoding Standard. When more than one encoding is specified, the scan is performed in different threads simultaneously. stringsext reads its input data from FILE. With no FILE, or when FILE is - , it reads standard input stdin. stringsext is mainly useful for determining the Unicode content of non-text files. When invoked with stringsext -e ascii -c i stringsext can be used as GNU strings replacement. OPTIONS -c MODE, --control-chars=MODE Determine if and how control characters are printed. The search algorithm first scans for valid character sequences which are then are re-encoded into UTF-8 strings containing graphical (printable) and control (non-printable) characters. When MODE is set to p all valid (control and graphic) characters are printed. Warning: Control characters may contain a harmful payload. An attacker may exploit a vulnerability of your terminal or post processing software. Use with caution. MODE r will never print any control character but instead indicate their position: Control characters in valid strings are first grouped and then replaced with the Unicode replacement character '�' (U+FFFD). This mode is most useful together with --radix because it keeps the whole valid character sequence in one line allowing post-processing the output with line oriented tools like grep . To ease post-processing, the output in MODE r is formatted slightly different from other modes: instead of indenting the byte-counter, the encoding name and the found string with spaces as separator, only one tab is inserted. When MODE is i all control characters are silently ignored. They are first grouped and then replaced with a newline character. See the output of --help for the default value of MODE. -e ENC, --encoding=ENC Set (multiple) input search encodings. Encoding names ENC are identified according to the WATHWG standard. --list-encodings prints a list of implemented encodings. See the output of --help for the default value of ENC. -h, --help Print a synopsis of available options and default values. -l, --list-encodings List available encodings as WHATWG Encoding Standard names and exit. -n MIN, --bytes=MIN Print only strings at least min bytes long. The length is measured as UTF-8 byte-string. --help shows the default value. -p FILE, --output=FILE Print to FILE instead of stdout. -t RADIX, --radix=RADIX Print the offset within the file before each valid string. The single character argument specifies the radix of the offset: o for octal, x for hexadecimal, or d for decimal. When a valid string is split into several graphic character sequences the cut-off point is labelled according to --control-chars but no additional offset is printed for each graphic character sequence. The exception to the above is --encoding=ascii --control-chars=i for which the offset is always printed before each graphic character sequence. When the output of stringsext is piped to another filter you may consider --control-chars=r to keep multi-line strings in one line. -V, --version Print version info and exit. EXIT STATUS 0 Success. other values Failure. EXAMPLES List available encodings: stringsext -l Search for UTF-8 strings and strings in UTF-16 Big Endian encoding: stringsext -e utf-8 -e utf-16be somefile.bin Or: cat somefile.bin {vbar} stringsext -e utf-8 -e utf-16be - The following settings are designed to produce bit-identical output with GNU strings: stringsext -e ascii -c i # equals `strings` stringsext -e ascii -c i -t d # equals `strings -t d` stringsext -e ascii -c i -t x # equals `strings -t x` stringsext -e ascii -c i -t o # equals `strings -t o` When used with pipes -c r is required: stringsext -e ascii -e iso-8859-7 -c r somefile.bin {vbar} grep "Ιστορία" LIMITATIONS It is guaranteed that all valid string sequences are detected and printed whatever their size is. However due to potential false positives when interpreting binary data as multi-byte-strings, it may happen that the first characters of a valid string may not be recognised immediately. In practice, this effect occurs very rarely and the scanner synchronises with the correct character boundaries quickly. When the size of a valid string exceeds FLAG_BYTES_MAX bytes it may be split into two or more strings and then printed separately. Note that this limitation refers to the valid string size and not to the graphic string size which may be shorter. If a valid string is longer than WIN_LEN bytes then it is always split. To know the values of the constants please refer to the definition in the source code of your stringsext build. Original values are: FLAG_BYTES_MAX = 6144 bytes, WIN_LEN = 14342 bytes. RESOURCES Project website: https://github.com/getreu/stringsext COPYING Copyright (C) 2016 Jens Getreu Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.