doc.go

Documentation: github.com/rivo/uniseg

     1  /*
     2  Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
     3  string width calculation for monospace fonts. Unicode Text Segmentation conforms
     4  to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
     5  Line Breaking conforms to Unicode Standard Annex #14
     6  (https://unicode.org/reports/tr14/).
     7  
     8  In short, using this package, you can split a string into grapheme clusters
     9  (what people would usually refer to as a "character"), into words, and into
    10  sentences. Or, in its simplest case, this package allows you to count the number
    11  of characters in a string, especially when it contains complex characters such
    12  as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or
    13  other languages. Additionally, you can use it to implement line breaking (or
    14  "word wrapping"), that is, to determine where text can be broken over to the
    15  next line when the width of the line is not big enough to fit the entire text.
    16  Finally, you can use it to calculate the display width of a string for monospace
    17  fonts.
    18  
    19  # Getting Started
    20  
    21  If you just want to count the number of characters in a string, you can use
    22  [GraphemeClusterCount]. If you want to determine the display width of a string,
    23  you can use [StringWidth]. If you want to iterate over a string, you can use
    24  [Step], [StepString], or the [Graphemes] class (more convenient but less
    25  performant). This will provide you with all information: grapheme clusters,
    26  word boundaries, sentence boundaries, line breaks, and monospace character
    27  widths. The specialized functions [FirstGraphemeCluster],
    28  [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
    29  [FirstSentence], and [FirstSentenceInString] can be used if only one type of
    30  information is needed.
    31  
    32  # Grapheme Clusters
    33  
    34  Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one
    35  character. But its string representation actually has 14 bytes, so counting
    36  bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't,
    37  either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
    38  utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.
    39  
    40  The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
    41  The Graphemes class and a variety of functions in this package will allow you to
    42  split strings into its grapheme clusters.
    43  
    44  # Word Boundaries
    45  
    46  Word boundaries are used in a number of different contexts. The most familiar
    47  ones are selection (double-click mouse selection), cursor movement ("move to
    48  next word" control-arrow keys), and the dialog option "Whole Word Search" for
    49  search and replace. This package provides methods for determining word
    50  boundaries.
    51  
    52  # Sentence Boundaries
    53  
    54  Sentence boundaries are often used for triple-click or some other method of
    55  selecting or iterating through blocks of text that are larger than single words.
    56  They are also used to determine whether words occur within the same sentence in
    57  database queries. This package provides methods for determining sentence
    58  boundaries.
    59  
    60  # Line Breaking
    61  
    62  Line breaking, also known as word wrapping, is the process of breaking a section
    63  of text into lines such that it will fit in the available width of a page,
    64  window or other display area. This package provides methods to determine the
    65  positions in a string where a line must be broken, may be broken, or must not be
    66  broken.
    67  
    68  # Monospace Width
    69  
    70  Monospace width, as referred to in this package, is the width of a string in a
    71  monospace font. This is commonly used in terminal user interfaces or text
    72  displays or editors that don't support proportional fonts. A width of 1
    73  corresponds to a single character cell. The C function [wcswidth()] and its
    74  implementation in other programming languages is in widespread use for the same
    75  purpose. However, there is no standard for the calculation of such widths, and
    76  this package differs from wcswidth() in a number of ways, presumably to generate
    77  more visually pleasing results.
    78  
    79  To start, we assume that every code point has a width of 1, with the following
    80  exceptions:
    81  
    82    - Code points with grapheme cluster break properties Control, CR, LF, Extend,
    83      and ZWJ have a width of 0.
    84    - U+2E3A, Two-Em Dash, has a width of 3.
    85    - U+2E3B, Three-Em Dash, has a width of 4.
    86    - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
    87      (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
    88      have a width of 1.)
    89    - Code points with grapheme cluster break property Regional Indicator have a
    90      width of 2.
    91    - Code points with grapheme cluster break property Extended Pictographic have
    92      a width of 2, unless their Emoji Presentation flag is "No", in which case
    93      the width is 1.
    94  
    95  For Hangul grapheme clusters composed of conjoining Jamo and for Regional
    96  Indicators (flags), all code points except the first one have a width of 0. For
    97  grapheme clusters starting with an Extended Pictographic, any additional code
    98  point will force a total width of 2, except if the Variation Selector-15
    99  (U+FE0E) is included, in which case the total width is always 1. Grapheme
   100  clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
   101  
   102  Note that whether these widths appear correct depends on your application's
   103  render engine, to which extent it conforms to the Unicode Standard, and its
   104  choice of font.
   105  
   106  [wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html
   107  */
   108  package uniseg
   109
View as plain text