Document Number: N3572

Date: 2013-03-10

Project: Programming Language C++, Library Working Group

Reply-to: Mark Boyall wolfeinstein@gmail.com

Unicode Support in the Standard Library

Introduction

The purpose of this document is to propose new interfaces to support Unicode text, where the existing interfaces are quite deficient.

Motivation and Scope

This proposal is primarily motivated by two problems. The first is the overwhelming number of string types- both primitive, Standard and third-party. This mess of text types makes it impossible to reliably hold string data. The second is the poor support for Unicode within the C++ Standard library. Unicode is a complex topic, where correctness depends on the implementation of complex algorithms by the user. This is only exacerbated by the problem of multiple string encodings, and poor conversion interfaces, which is why C++ is awash with third-party string types. This problem is made even worse by the existence of unrelated types that need to hold string data- for example, exceptions. The existing exception hierarchy is of significantly limited usefulness, as it cannot hold Unicode exception data. This proposal aims to solve both these problems by offering freestanding algorithms and a fresh string class and set of freestanding algorithms which constitutes significant support for Unicode and alternative string encodings. Unicode is considered to be version 6.2- the most recent finalized version.

It is not currently in use and a reference implementation is still under construction. However, there are numerous implementations of the various subcomponents, such as Unicode algorithms and formatting routines. ICU implements virtually all of the proposed functionality and then some.

Impact on the Standard

There are no additional language or library features required, although fixing UTF-8 would be of benefit.

Design Decisions

The primary design decision taken here is to give one universal definition of a string- a range of Unicode codepoints. This decision was taken because it allows free-standing algorithms, and an interface that fits well with the rest of the Standard library. It also allows the string interface to be significantly simplified compared to the previous iteration. In addition, the library provides one single string type, best suited for each platform. This string type is intended to meet the requirements of, for example, the filesystem TS for storing paths.

Unicode validation failure throwing an exception is well known to be a limited solution in many cases. This part of the API is due for additional consideration, as this is only a first draft. In addition, because of the potential for O(n) assignment, it was decided that the only kind of iterator offered over a string should be immutable, as in many cases the operation would boil down to inserting a variable size range. This could be prohibitively expensive. In addition, the choice of an rvalue makes it significantly simpler to offer iterators, as they can decode on the fly to codepoints from their choice of encoding. Aside from this, however, the string was designed to be a familiar container, offering the minimal set of functions required to manipulate the sequence of codepoints.

Another problem is posed by UTF-8. As u8 literals do not have a distinct type, it's almost impossible to handle them correctly. There are other proposals for introducing char8_t and fixing UTF-8 literals, and introducing std::u8string, but this proposal does not assume they are accepted. It would, however, be of significant benefit.

Finally, the std namespace is becoming very overloaded. It was decided that it would be best to split the components into subnamespaces. This not only aids with the organization of the library as a whole, but also provides a clear difference between old and new components.

Technical Specification

Currently, to avoid ambiguity, the specification is given as a series of declarations in C++11.

For iterators, usually only the iterator category and return value of operator* are specified, as the full specification of an iterator involves a lot of plumbing. If requested, these specifications can be expanded to the full definition.

In header <unicode>

namespace std { namespace unicode { enum class normal_form { nfc, nfd, nfkc, nfkd }; namespace policies { struct throw_exception {}; template<char32_t> struct replacement_character {}; struct undefined_behaviour {}; struct discard {}; };

These policies define what happens when the encoding iterators encounter bad input. If the throw_exception policy is specified, an exception shall be thrown of type std::runtime_error. If replacement_character is specified, then the codepoint specified as the replacement character shall be the replacement output. When converting from codepoints to codeunits, the encoding shall specify a replacement character, and ignore the template parameter. The algorithm to determine how many replacement characters are issued is part of the Unicode Standard. If the undefined_behaviour policy is specified, then no validation shall take place, and if the input sequence is bad, then the behaviour is undefined. If discard is specified, then bad input shall be silently discarded.

The encoded_string class is templated based on an encoding parameter. This is a traits-style class implemented for each encoding. The required members are:

typedef unspecified codeunit; static constexpr codeunit replacement_character = unspecified;

The codeunit typedef is for the individual unit of storage for this specific encoding. This would be char16_t for UTF-16, char for narrow encoding, etc.

template<typename CodeunitIterator, typename Policy> std::pair<unspecified, unspecified> make_codepoint_range(CodeunitIterator begin, CodeunitIterator end, Policy p);

This function returns a pair of iterators, which are of the same type, which represent a view of the codeunit range as codepoints. They shall have at least the same iterator category as the input, except that the maximum required category is bidirectional, even if the input is random. The behaviour of these iterators shall correspond to the given Policy.

template<typename CodepointIterator, typename Policy> std::pair<unspecified, unspecified> make_codeunit_range(CodepointIterator begin, CodepointIterator end, Policy p);

A pair of iterator adaptors which view the original range of Unicode codepoints as code units, according to the given policy.

template<typename ForeignEncoding, typename ForeignCodeunitIterator> std::pair<unspecified, unspecified> make_conversion_range(ForeignCodeunitIterator begin, ForeignCodeunitIterator end, Policy p);

Views a range of codeunits in the foreign encoding as a range of code units in this encoding. A reasonable implementation for any foreign encoding is to simply view it as Unicode codepoints and then view those as this encoding, but for specific encodings some cross-optimizations may be possible.

static constexpr bool is_self_synchronizing = undefined; static constexpr bool is_fixed_width = undefined; static constexpr unsigned max_width = undefined;

An implementation shall provide at least the following encodings:

namespace encoding { typedef unspecified utf8; typedef unspecified utf16; typedef unspecified utf32; typedef unspecified wide; typedef unspecified narrow; typedef unspecified system; }

The narrow encoding is the encoding used for narrow string literals, such as "hello". The wide string literal is used for wide string literals such as L"hello". An implementation has an obligation to make each encoding a separate type, even if they represent the same logical encoding. This is to permit overloading or specialization in portable code. The system encoding is an implementation-defined default which shall be the encoding best used for interoperation with platform APIs, especially operating system APIs, such as UTF16 on Windows and UTF8 on Unix. The implementation may provide arbitrary additional encodings.

template<typename Char> using encoding_of = implementation-defined;

The encoding_of template returns the assumed encoding of a string whose codeunit type is std::decay<Char>::type . This shall be narrow where the decayed type is char, wide for wchar_t, utf16 for char16_t, and utf32 for char32_t.

template<typename Iterator> using encoding_of_iterator = encoding_of<typename std::iterator_traits<Iterator>::value_type>

The string class is a container of Unicode codepoints. The treatment of the freestanding algorithms as a range of Unicode codepoints means that any container of Unicode codepoints may be used, but this class is provided as the minimal useful container. It may contain embedded null characters.

template<typename Encoding, typename Allocator = std::allocator<typename Encoding::codeunit>> class encoded_string { public: encoded_string(); template<typename OtherEncoding, typename OtherAlloc> encoded_string(const encoded_string<OtherEncoding, OtherAlloc>&); encoded_string(encoded_string&&); encoded_string(const char*);

When the encoded_string interface deals with a const char* or std::string, it will assume narrow encoding, not UTF-8. A constructor which can take an encoding is available for UTF-8 const char*. When the encoded_string class takes input from an external source, it will validate that it is well-formed Unicode. If not, an exception shall be thrown.

template<typename Encoding> encoded_string(const char*, Encoding = Encoding()); encoded_string(const wchar_t*); encoded_string(const char16_t*); encoded_string(const char32_t*); template<typename T, typename Traits, typename Allocator, typename Encoding = encoding_of<T>> encoded_string(const std::basic_string<T, Traits, Allocator>&, Encoding e = Encoding()); template<typename Iterator, typename Encoding = encoding_of_iterator<Iterator>> encoded_string(Iterator, Iterator, Encoding e = Encoding()); using iterator = implementation_defined; using const_iterator = implementation_defined; using allocator_type = implementation_defined; using size_type = implementation_defined; using value_type = char32_t; template<typename Iterator, typename Encoding = encoding_of_iterator<Iterator>> void assign(Iterator, Iterator, Encoding e = Encoding()) &; void assign(encoded_string&) &; void assign(encoded_string&&) &; template<typename other_encoding, typename other_alloc> encoded_string operator+(const encoded_string<other_encoding, other_alloc>&) const; encoded_string operator+(encoded_string&&) const; encoded_string operator+(const char*) const; encoded_string operator+(const wchar_t*) const; encoded_string operator+(const char16_t*) const; encoded_string operator+(const char32_t*) const; template<typename T, typename Traits, typename Allocator> encoded_string operator+(const std::basic_string<T, Traits, Allocator>&) const; template<typename other_encoding, typename other_alloc> encoded_string& operator+=(const encoded_string<other_encoding, other_alloc>&) &; encoded_string& operator+=(encoded_string&&) &; encoded_string& operator+=(const char*) &; encoded_string& operator+=(const wchar_t*) &; encoded_string& operator+=(const char16_t*) &; encoded_string& operator+=(const char32_t*) &; template<typename T, typename Traits, typename Allocator> encoded_string& operator+=(const std::basic_string<T, Traits, Allocator>&); encoded_string& operator=(const encoded_string&) &; encoded_string& operator=(encoded_string&&) &; encoded_string& operator=(const char*) &; encoded_string& operator=(const wchar_t*) &; encoded_string& operator=(const char16_t*) &; encoded_string& operator=(const char32_t*) &; template<typename T, typename Traits, typename Allocator> encoded_string& operator=(const std::basic_string<T, Traits, Allocator>&); iterator begin() &; const_iterator begin() const &; const_iterator cbegin() const &; iterator end() &; const_iterator end() const &; const_iterator cend() const &;

The iterator and const_iterator types are bidirectional iterators of Unicode codepoints. The value_type is char32_t. The invalidation semantics of iterators shall be those of std::string. Particularly, it is explicitly legal for iterators to refer to values inside the encoded_string value itself, and thus move or swap may invalidate iterators.

void clear() &; bool empty() const; iterator erase(const_iterator where) &; iterator erase(const_iterator first, const_iterator last) &; void swap(encoded_string&); char32_t front() const; char32_t back() const; iterator insert(const_iterator where, char32_t codepoint); template<typename InputIterator, typename Encoding = encoding_of_iterator<InputIterator> iterator insert(const_iterator where, InputIterator begin, InputIterator end, Encoding e = Encoding()); template<typename Encoding, typename Allocator> iterator insert(const_iterator where, const encoded_string<Rncoding, Allocator>&); template<typename T, typename Traits, typename Alloc, typename Encoding = encoding_of<T> iterator insert(const_iterator where, const basic_string<T, Traits, Alloc>&, Encoding e = Encoding()); void pop_back(); void push_back(char32_t); void normalize(normal_form);

Performs an in-place normalization of the string's contents to the requested form.

const encoding::codeunit* codeunit_data() const; std::size_t codeunit_size() const;

codeunit_data returns the contents of the encoded_string as a null-terminated buffer. This pointer shall be valid for as long as the encoded_string is not mutated or destroyed. The codeunit_size function shall return the size of this buffer, except for the null terminator.

void codeunit_reserve(std::size_t size); std::size_t codeunit_capacity() const;

}; using string = encoded_string<encoding::system, implementation-defined default> template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator<(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator<(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator<(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs); template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator==(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator==(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator==(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs); template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator<=(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator<=(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator<=(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs); template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator>(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator>(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator>(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs); template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator>=(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator>=(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator>=(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs); template<typename LHSEncoding, typename LHSAllocator, typename RHSEncoding, typename RHSAllocator> bool operator!=(const encoded_string<LHSEncoding, LHSAllocator>& lhs, const encoded_string<RHSEncoding, RHSAllocator>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator!=(const basic_string<T, Traits, Alloc>& lhs, const encoded_string<Encoding, EncAlloc>& rhs); template<typename T, typename Traits, typename Alloc, typename Encoding, typename EncAlloc> bool operator!=(const encoded_string<Encoding, EncAlloc>& rhs, const basic_string<T, Traits, Alloc>& lhs);

For all primitive character types C char, wchar_t, char16_t, and char32_t,

template<typename Enc, typename Alloc> bool operator<(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator<(const C* lhs, const encoded_string<Enc, Alloc& rhs); template<typename Enc, typename Alloc> bool operator==(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator==(const C* lhs, const encoded_string<Enc, Alloc& rhs); template<typename Enc, typename Alloc> bool operator<=(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator<=(const C* lhs, const encoded_string<Enc, Alloc& rhs); template<typename Enc, typename Alloc> bool operator!=(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator!=(const C* lhs, const encoded_string<Enc, Alloc& rhs); template<typename Enc, typename Alloc> bool operator>(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator>(const C* lhs, const encoded_string<Enc, Alloc& rhs); template<typename Enc, typename Alloc> bool operator>=(const encoded_string<Enc, Alloc>& lhs, const C* rhs); template<typename Enc, typename Alloc> bool operator>=(const C* lhs, const encoded_string<Enc, Alloc& rhs);

These comparison operators behave as if the data in the lhs and the rhs was passed to the respective iterator based Unicode freestanding algorithm, defined shortly.

template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool less(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding(), std::locale = std::locale()); template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool less_or_equal(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding(), std::locale = std::locale()); template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool greater(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding(), std::locale = std::locale()); template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool greater_or_equal(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding(), std::locale = std::locale()); template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool equal(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding()); template<typename First, typename Second, typename FEncoding = encoding_of_iterator<First>, typename SEncoding = encoding_of_iterator<Second>> bool not_equal(First begin, First end, Second begin, Second end, FEnconding = FEncoding(), SEncoding = SEncoding());

These six algorithms implement Unicode comparison functionality on the Unicode codepoints provided in the passed encodings. Equivalence is defined as canonical equivalence. Canonical equivalence and collation are defined by the Unicode Standard. The comparison is performed at L3 or greater.

template<typename Iterator> std::pair<unspecified, unspecified> extended_grapheme_boundaries(Iterator begin, Iterator end, std::locale = std::locale()); template<typename Iterator> std::pair<unspecified, unspecified> word_boundaries(Iterator begin, Iterator end, std::locale = std::locale()); template<typename Iterator> std::pair<unspecified, unspecified> line_boundaries(Iterator begin, Iterator end, std::locale = std::locale()); template<typename Iterator> std::pair<unspecified, unspecified> sentence_boundaries(Iterator begin, Iterator end, std::locale = std::locale());

All four iterator types- grapheme_iterator, word_iterator, line_break_iterator, and sentence_iterator implement the respective Unicode Standard boundary analysis algorithms. The Line algorithm is defined in UAX #14 (http://www.unicode.org/reports/tr14/) and the other three in UAX #29 (http://www.unicode.org/reports/tr29/). The input iterators are at least forward iterators of Unicode codepoints. The boundary iterators all have a value_type which is Iterator. This iterator is the position of the boundary.

template<typename Iterator, typename Out> Out normalize(Iterator begin, Iterator end, Out out, normal_form); template<typename T, typename Traits, typename Alloc, typename Encoding = encoding_of<T>> basic_string<T, Traits, Alloc> normalize(basic_string<T, Traits, Alloc>, Encoding = Encoding()); template<typename Encoding, typename Alloc> encoded_string<Encoding, Alloc> normalize(encoded_string<Encoding, Alloc>);

Implements normalization of the forward range over Unicode codepoints, with the output provided to the output iterator. The normal_form argument indicates which normal form is requested. Returns out.

template<typename Encoding, typename Allocator, typename Char, typename CharT> std::basic_istream<Char, CharT>& operator<<(std::basic_istream<Char, CharT>&, encoded_string<Encoding, Allocator>&);

Reads until the next whitespace, as operator>>(std::istream&, std::string&);. Shall perform the necessary conversion from encoding_of<T> to Encoding.