doc.go

Documentation: cloud.google.com/go/bigquery/storage/managedwriter

     1  // Copyright 2021 Google LLC
     2  //
     3  // Licensed under the Apache License, Version 2.0 (the "License");
     4  // you may not use this file except in compliance with the License.
     5  // You may obtain a copy of the License at
     6  //
     7  //     https://www.apache.org/licenses/LICENSE-2.0
     8  //
     9  // Unless required by applicable law or agreed to in writing, software
    10  // distributed under the License is distributed on an "AS IS" BASIS,
    11  // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  // See the License for the specific language governing permissions and
    13  // limitations under the License.
    14  
    15  /*
    16  Package managedwriter provides a thick client around the BigQuery storage API's BigQueryWriteClient.
    17  More information about this new write client may also be found in the public documentation: https://cloud.google.com/bigquery/docs/write-api
    18  
    19  Currently, this client targets the BigQueryWriteClient present in the v1 endpoint, and is intended as a more
    20  feature-rich successor to the classic BigQuery streaming interface, which is presented as the Inserter abstraction
    21  in cloud.google.com/go/bigquery, and the tabledata.insertAll method if you're more familiar with the BigQuery v2 REST
    22  methods.
    23  
    24  # Creating a Client
    25  
    26  To start working with this package, create a client:
    27  
    28  	ctx := context.Background()
    29  	client, err := managedwriter.NewClient(ctx, projectID)
    30  	if err != nil {
    31  		// TODO: Handle error.
    32  	}
    33  
    34  # Defining the Protocol Buffer Schema
    35  
    36  The write functionality of BigQuery Storage requires data to be sent using encoded
    37  protocol buffer messages using proto2 wire format.  As the protocol buffer is not
    38  self-describing, you will need to provide the protocol buffer schema.
    39  This is communicated using a DescriptorProto message, defined within the protocol
    40  buffer libraries: https://pkg.go.dev/google.golang.org/protobuf/types/descriptorpb#DescriptorProto
    41  
    42  More information about protocol buffers can be found in the proto2 language guide:
    43  https://developers.google.com/protocol-buffers/docs/proto
    44  
    45  Details about data type conversions between BigQuery and protocol buffers can be
    46  found in the public documentation: https://cloud.google.com/bigquery/docs/write-api#data_type_conversions
    47  
    48  For cases where the protocol buffer is compiled from a static ".proto" definition,
    49  this process is straightforward.  Instantiate an example message, then convert the
    50  descriptor into a descriptor proto:
    51  
    52  	m := &myprotopackage.MyCompiledMessage{}
    53  	descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor())
    54  
    55  If the message uses advanced protocol buffer features like nested messages/groups,
    56  or enums, the cloud.google.com/go/bigquery/storage/managedwriter/adapt subpackage
    57  contains functionality to normalize the descriptor into a self-contained definition:
    58  
    59  	m := &myprotopackage.MyCompiledMessage{}
    60  	descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor())
    61  	if err != nil {
    62  		// TODO: Handle error.
    63  	}
    64  
    65  The adapt subpackage also contains functionality for generating a DescriptorProto using
    66  a BigQuery table's schema directly.
    67  
    68  # Constructing a ManagedStream
    69  
    70  The ManagedStream handles management of the underlying write connection to the BigQuery
    71  Storage service.  You can either create a write session explicitly and pass it in, or
    72  create the write stream while setting up the ManagedStream.
    73  
    74  It's easiest to register the protocol buffer descriptor you'll be using to send data when
    75  setting up the managed stream using the WithSchemaDescriptor option, though you can also
    76  set/change the schema as part of an append request once the ManagedStream is created.
    77  
    78  	// Create a ManagedStream using an explicit stream identifer, either a default
    79  	// stream or one explicitly created by CreateWriteStream.
    80  	managedStream, err := client.NewManagedStream(ctx,
    81  		WithStreamName(streamName),
    82  		WithSchemaDescriptor(descriptorProto))
    83  	if err != nil {
    84  		// TODO: Handle error.
    85  	}
    86  
    87  In addition, NewManagedStream can create new streams implicitly:
    88  
    89  	// Alternately, allow the ManagedStream to handle stream construction by supplying
    90  	// additional options.
    91  	tableName := fmt.Sprintf("projects/%s/datasets/%s/tables/%s", myProject, myDataset, myTable)
    92  	manageStream, err := client.NewManagedStream(ctx,
    93  		WithDestinationTable(tableName),
    94  		WithType(managedwriter.BufferedStream),
    95  		WithSchemaDescriptor(descriptorProto))
    96  	if err != nil {
    97  		// TODO: Handle error.
    98  	}
    99  
   100  # Writing Data
   101  
   102  Use the AppendRows function to write one or more serialized proto messages to a stream. You
   103  can choose to specify an offset in the stream to handle de-duplication for user-created streams,
   104  but a "default" stream neither accepts nor reports offsets.
   105  
   106  AppendRows returns a future-like object that blocks until the write is successful or yields
   107  an error.
   108  
   109  		// Define a couple of messages.
   110  		mesgs := []*myprotopackage.MyCompiledMessage{
   111  			{
   112  				UserName: proto.String("johndoe"),
   113  				EmailAddress: proto.String("jd@mycompany.mydomain",
   114  				FavoriteNumbers: []proto.Int64{1,42,12345},
   115  			},
   116  			{
   117  				UserName: proto.String("janesmith"),
   118  				EmailAddress: proto.String("smith@othercompany.otherdomain",
   119  				FavoriteNumbers: []proto.Int64{1,3,5,7,9},
   120  			},
   121  		}
   122  
   123  		// Encode the messages into binary format.
   124  		encoded := make([][]byte, len(mesgs))
   125  		for k, v := range mesgs{
   126  			b, err := proto.Marshal(v)
   127  			if err != nil {
   128  				// TODO: Handle error.
   129  			}
   130  			encoded[k] = b
   131  	 	}
   132  
   133  		// Send the rows to the service, and specify an offset for managing deduplication.
   134  		result, err := managedStream.AppendRows(ctx, encoded, WithOffset(0))
   135  
   136  		// Block until the write is complete and return the result.
   137  		returnedOffset, err := result.GetResult(ctx)
   138  		if err != nil {
   139  			// TODO: Handle error.
   140  		}
   141  
   142  # Buffered Stream Management
   143  
   144  For Buffered streams, users control when data is made visible in the destination table/stream
   145  independently of when it is written.  Use FlushRows on the ManagedStream to advance the flush
   146  point ahead in the stream.
   147  
   148  	// We've written 1500+ rows in the stream, and want to advance the flush point
   149  	// ahead to make the first 1000 rows available.
   150  	flushOffset, err := managedStream.FlushRows(ctx, 1000)
   151  
   152  # Pending Stream Management
   153  
   154  Pending streams allow users to commit data from multiple streams together once the streams
   155  have been finalized, meaning they'll no longer allow further data writes.
   156  
   157  	// First, finalize the stream we're writing into.
   158  	totalRows, err := managedStream.Finalize(ctx)
   159  	if err != nil {
   160  		// TODO: Handle error.
   161  	}
   162  
   163  	req := &storagepb.BatchCommitWriteStreamsRequest{
   164  		Parent: parentName,
   165  		WriteStreams: []string{managedStream.StreamName()},
   166  	}
   167  	// Using the client, we can commit data from multple streams to the same
   168  	// table atomically.
   169  	resp, err := client.BatchCommitWriteStreams(ctx, req)
   170  
   171  # Error Handling and Automatic Retries
   172  
   173  Like other Google Cloud services, this API relies on common components that can provide an
   174  enhanced set of errors when communicating about the results of API interactions.
   175  
   176  Specifically, the apierror package (https://pkg.go.dev/github.com/googleapis/gax-go/v2/apierror)
   177  provides convenience methods for extracting structured information about errors.
   178  
   179  The BigQuery Storage API service augments applicable errors with service-specific details in
   180  the form of a StorageError message. The StorageError message is accessed via the ExtractProtoMessage
   181  method in the apierror package. Note that the StorageError messsage does not implement Go's error
   182  interface.
   183  
   184  An example of accessing the structured error details:
   185  
   186  	// By way of example, let's assume the response from an append call returns an error.
   187  	_, err := result.GetResult(ctx)
   188  	if err != nil {
   189  		if apiErr, ok := apierror.FromError(err); ok {
   190  			// We now have an instance of APIError, which directly exposes more specific
   191  			// details about multiple failure conditions include transport-level errors.
   192  			storageErr := &storagepb.StorageError{}
   193  			if e := apiErr.Details().ExtractProtoMessage(storageErr); e != nil {
   194  				// storageErr now contains service-specific information about the error.
   195  				log.Printf("Received service-specific error code %s", storageErr.GetCode().String())
   196  			}
   197  		}
   198  	}
   199  
   200  This library supports the ability to retry failed append requests, but this functionality is not
   201  enabled by default.  You can enable it via the EnableWriteRetries option when constructing a new
   202  managed stream.  Use of automatic retries can impact correctness when attempting certain exactly-once
   203  write patterns, but is generally recommended for workloads that only need at-least-once writing.
   204  
   205  With write retries enabled, failed writes will be automatically attempted a finite number of times
   206  (currently 4) if the failure is considered retriable.
   207  
   208  In support of the retry changes, the AppendResult returned as part of an append call now includes
   209  TotalAttempts(), which returns the number of times that specific append was enqueued to the service.
   210  Values larger than 1 are indicative of a specific append being enqueued multiple times.
   211  
   212  # Usage of Contexts
   213  
   214  The underlying rpc mechanism used to transmit requests and responses between this client and
   215  the service uses a gRPC bidirectional streaming protocol, and the context provided when invoking
   216  NewClient to instantiate the client is used to maintain those background connections.
   217  
   218  This package also exposes context when instantiating a new writer (NewManagedStream), as well as
   219  allowing a per-request context when invoking the AppendRows function to send a set of rows.  If the
   220  context becomes invalid on the writer all subsequent AppendRows requests will be blocked.
   221  
   222  Finally, there is a per-request context supplied as part of the AppendRows call on the ManagedStream
   223  writer itself, useful for bounding individual requests.
   224  
   225  # Connection Sharing (Multiplexing)
   226  
   227  Note: This feature is EXPERIMENTAL and subject to change.
   228  
   229  The BigQuery Write API enforces a limit on the number of concurrent open connections, documented
   230  here: https://cloud.google.com/bigquery/quotas#write-api-limits
   231  
   232  Users can now choose to enable connection sharing (multiplexing) when using ManagedStream writers
   233  that use default streams.  The intent of this feature is to simplify connection management for users
   234  who wish to write to many tables, at a cardinality beyond the open connection quota.  Please note that
   235  explicit streams (Committed, Buffered, and Pending) cannot leverage the connection sharing feature.
   236  
   237  Multiplexing features are controlled by the package-specific custom ClientOption options exposed within
   238  this package.  Additionally, some of the connection-related WriterOptions that can be specified when
   239  constructing ManagedStream writers are ignored for writers that leverage the shared multiplex connections.
   240  
   241  At a high level, multiplexing uses some heuristics based on the flow control of the shared connections
   242  to infer whether the pool should add additional connections up to a user-specific limit per region,
   243  and attempts to balance traffic from writers to those connections.
   244  
   245  To enable multiplexing for writes to default streams, simply instantiate the client with the desired options:
   246  
   247  	ctx := context.Background()
   248  	client, err := managedwriter.NewClient(ctx, projectID,
   249  		WithMultiplexing,
   250  		WithMultiplexPoolLimit(3),
   251  	)
   252  	if err != nil {
   253  		// TODO: Handle error.
   254  	}
   255  
   256  Special Consideration:  The gRPC architecture is capable of its own sharing of underlying HTTP/2 connections.
   257  For users who are sending significant traffic on multiple writers (independent of whether they're leveraging
   258  multiplexing or not) may also wish to consider further tuning of this behavior.  The managedwriter library
   259  sets a reasonable default, but this can be tuned further by leveraging the WithGRPCConnectionPool ClientOption,
   260  documented here:
   261  https://pkg.go.dev/google.golang.org/api/option#WithGRPCConnectionPool
   262  
   263  A reasonable upper bound for the connection pool size is the number of concurrent writers for explicit stream
   264  plus the configured size of the multiplex pool.
   265  
   266  # Writing JSON Data
   267  
   268  As an example, you can refer to this integration test that demonstrates writing JSON data to a stream:
   269  https://github.com/googleapis/google-cloud-go/blob/7a46b5428f239871993d66be2c7c667121f60a6f/bigquery/storage/managedwriter/integration_test.go#L397
   270  
   271  This integration test assumes the destination table already exists. In addition, it relies upon having a definition of
   272  a BigQuery schema that is compatible with this table (for this example the schema is defined here:
   273  https://github.com/googleapis/google-cloud-go/blob/2020edff24e3ffe127248cf9a90c67593c303e18/bigquery/storage/managedwriter/testdata/schemas.go#L31).
   274  Given the schema, this test first utilizes the function setupDynamicDescriptors() to derive both a MessageDescriptor
   275  and DescriptorProto from the schema. This function is defined here:
   276  https://github.com/googleapis/google-cloud-go/blob/7a46b5428f239871993d66be2c7c667121f60a6f/bigquery/storage/managedwriter/integration_test.go#L100
   277  The test initializes the ManagedStream it will write to with the derived DescriptorProto. The test then iterates
   278  through each of the JSON rows to be written. For each row, it first dynamically creates an empty Message based on
   279  the derived MessageDescriptor. Then it loads the JSON row into the Message. Finally it generates protocol buffer
   280  bytes from the Message. These bytes are then sent to the ManagedStream within an AppendRows request.
   281  */
   282  package managedwriter
   283
View as plain text