1 // Copyright 2021 Google LLC 2 // 3 // Licensed under the Apache License, Version 2.0 (the "License"); 4 // you may not use this file except in compliance with the License. 5 // You may obtain a copy of the License at 6 // 7 // https://www.apache.org/licenses/LICENSE-2.0 8 // 9 // Unless required by applicable law or agreed to in writing, software 10 // distributed under the License is distributed on an "AS IS" BASIS, 11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 // See the License for the specific language governing permissions and 13 // limitations under the License. 14 15 /* 16 Package managedwriter provides a thick client around the BigQuery storage API's BigQueryWriteClient. 17 More information about this new write client may also be found in the public documentation: https://cloud.google.com/bigquery/docs/write-api 18 19 Currently, this client targets the BigQueryWriteClient present in the v1 endpoint, and is intended as a more 20 feature-rich successor to the classic BigQuery streaming interface, which is presented as the Inserter abstraction 21 in cloud.google.com/go/bigquery, and the tabledata.insertAll method if you're more familiar with the BigQuery v2 REST 22 methods. 23 24 # Creating a Client 25 26 To start working with this package, create a client: 27 28 ctx := context.Background() 29 client, err := managedwriter.NewClient(ctx, projectID) 30 if err != nil { 31 // TODO: Handle error. 32 } 33 34 # Defining the Protocol Buffer Schema 35 36 The write functionality of BigQuery Storage requires data to be sent using encoded 37 protocol buffer messages using proto2 wire format. As the protocol buffer is not 38 self-describing, you will need to provide the protocol buffer schema. 39 This is communicated using a DescriptorProto message, defined within the protocol 40 buffer libraries: https://pkg.go.dev/google.golang.org/protobuf/types/descriptorpb#DescriptorProto 41 42 More information about protocol buffers can be found in the proto2 language guide: 43 https://developers.google.com/protocol-buffers/docs/proto 44 45 Details about data type conversions between BigQuery and protocol buffers can be 46 found in the public documentation: https://cloud.google.com/bigquery/docs/write-api#data_type_conversions 47 48 For cases where the protocol buffer is compiled from a static ".proto" definition, 49 this process is straightforward. Instantiate an example message, then convert the 50 descriptor into a descriptor proto: 51 52 m := &myprotopackage.MyCompiledMessage{} 53 descriptorProto := protodesc.ToDescriptorProto(m.ProtoReflect().Descriptor()) 54 55 If the message uses advanced protocol buffer features like nested messages/groups, 56 or enums, the cloud.google.com/go/bigquery/storage/managedwriter/adapt subpackage 57 contains functionality to normalize the descriptor into a self-contained definition: 58 59 m := &myprotopackage.MyCompiledMessage{} 60 descriptorProto, err := adapt.NormalizeDescriptor(m.ProtoReflect().Descriptor()) 61 if err != nil { 62 // TODO: Handle error. 63 } 64 65 The adapt subpackage also contains functionality for generating a DescriptorProto using 66 a BigQuery table's schema directly. 67 68 # Constructing a ManagedStream 69 70 The ManagedStream handles management of the underlying write connection to the BigQuery 71 Storage service. You can either create a write session explicitly and pass it in, or 72 create the write stream while setting up the ManagedStream. 73 74 It's easiest to register the protocol buffer descriptor you'll be using to send data when 75 setting up the managed stream using the WithSchemaDescriptor option, though you can also 76 set/change the schema as part of an append request once the ManagedStream is created. 77 78 // Create a ManagedStream using an explicit stream identifer, either a default 79 // stream or one explicitly created by CreateWriteStream. 80 managedStream, err := client.NewManagedStream(ctx, 81 WithStreamName(streamName), 82 WithSchemaDescriptor(descriptorProto)) 83 if err != nil { 84 // TODO: Handle error. 85 } 86 87 In addition, NewManagedStream can create new streams implicitly: 88 89 // Alternately, allow the ManagedStream to handle stream construction by supplying 90 // additional options. 91 tableName := fmt.Sprintf("projects/%s/datasets/%s/tables/%s", myProject, myDataset, myTable) 92 manageStream, err := client.NewManagedStream(ctx, 93 WithDestinationTable(tableName), 94 WithType(managedwriter.BufferedStream), 95 WithSchemaDescriptor(descriptorProto)) 96 if err != nil { 97 // TODO: Handle error. 98 } 99 100 # Writing Data 101 102 Use the AppendRows function to write one or more serialized proto messages to a stream. You 103 can choose to specify an offset in the stream to handle de-duplication for user-created streams, 104 but a "default" stream neither accepts nor reports offsets. 105 106 AppendRows returns a future-like object that blocks until the write is successful or yields 107 an error. 108 109 // Define a couple of messages. 110 mesgs := []*myprotopackage.MyCompiledMessage{ 111 { 112 UserName: proto.String("johndoe"), 113 EmailAddress: proto.String("jd@mycompany.mydomain", 114 FavoriteNumbers: []proto.Int64{1,42,12345}, 115 }, 116 { 117 UserName: proto.String("janesmith"), 118 EmailAddress: proto.String("smith@othercompany.otherdomain", 119 FavoriteNumbers: []proto.Int64{1,3,5,7,9}, 120 }, 121 } 122 123 // Encode the messages into binary format. 124 encoded := make([][]byte, len(mesgs)) 125 for k, v := range mesgs{ 126 b, err := proto.Marshal(v) 127 if err != nil { 128 // TODO: Handle error. 129 } 130 encoded[k] = b 131 } 132 133 // Send the rows to the service, and specify an offset for managing deduplication. 134 result, err := managedStream.AppendRows(ctx, encoded, WithOffset(0)) 135 136 // Block until the write is complete and return the result. 137 returnedOffset, err := result.GetResult(ctx) 138 if err != nil { 139 // TODO: Handle error. 140 } 141 142 # Buffered Stream Management 143 144 For Buffered streams, users control when data is made visible in the destination table/stream 145 independently of when it is written. Use FlushRows on the ManagedStream to advance the flush 146 point ahead in the stream. 147 148 // We've written 1500+ rows in the stream, and want to advance the flush point 149 // ahead to make the first 1000 rows available. 150 flushOffset, err := managedStream.FlushRows(ctx, 1000) 151 152 # Pending Stream Management 153 154 Pending streams allow users to commit data from multiple streams together once the streams 155 have been finalized, meaning they'll no longer allow further data writes. 156 157 // First, finalize the stream we're writing into. 158 totalRows, err := managedStream.Finalize(ctx) 159 if err != nil { 160 // TODO: Handle error. 161 } 162 163 req := &storagepb.BatchCommitWriteStreamsRequest{ 164 Parent: parentName, 165 WriteStreams: []string{managedStream.StreamName()}, 166 } 167 // Using the client, we can commit data from multple streams to the same 168 // table atomically. 169 resp, err := client.BatchCommitWriteStreams(ctx, req) 170 171 # Error Handling and Automatic Retries 172 173 Like other Google Cloud services, this API relies on common components that can provide an 174 enhanced set of errors when communicating about the results of API interactions. 175 176 Specifically, the apierror package (https://pkg.go.dev/github.com/googleapis/gax-go/v2/apierror) 177 provides convenience methods for extracting structured information about errors. 178 179 The BigQuery Storage API service augments applicable errors with service-specific details in 180 the form of a StorageError message. The StorageError message is accessed via the ExtractProtoMessage 181 method in the apierror package. Note that the StorageError messsage does not implement Go's error 182 interface. 183 184 An example of accessing the structured error details: 185 186 // By way of example, let's assume the response from an append call returns an error. 187 _, err := result.GetResult(ctx) 188 if err != nil { 189 if apiErr, ok := apierror.FromError(err); ok { 190 // We now have an instance of APIError, which directly exposes more specific 191 // details about multiple failure conditions include transport-level errors. 192 storageErr := &storagepb.StorageError{} 193 if e := apiErr.Details().ExtractProtoMessage(storageErr); e != nil { 194 // storageErr now contains service-specific information about the error. 195 log.Printf("Received service-specific error code %s", storageErr.GetCode().String()) 196 } 197 } 198 } 199 200 This library supports the ability to retry failed append requests, but this functionality is not 201 enabled by default. You can enable it via the EnableWriteRetries option when constructing a new 202 managed stream. Use of automatic retries can impact correctness when attempting certain exactly-once 203 write patterns, but is generally recommended for workloads that only need at-least-once writing. 204 205 With write retries enabled, failed writes will be automatically attempted a finite number of times 206 (currently 4) if the failure is considered retriable. 207 208 In support of the retry changes, the AppendResult returned as part of an append call now includes 209 TotalAttempts(), which returns the number of times that specific append was enqueued to the service. 210 Values larger than 1 are indicative of a specific append being enqueued multiple times. 211 212 # Usage of Contexts 213 214 The underlying rpc mechanism used to transmit requests and responses between this client and 215 the service uses a gRPC bidirectional streaming protocol, and the context provided when invoking 216 NewClient to instantiate the client is used to maintain those background connections. 217 218 This package also exposes context when instantiating a new writer (NewManagedStream), as well as 219 allowing a per-request context when invoking the AppendRows function to send a set of rows. If the 220 context becomes invalid on the writer all subsequent AppendRows requests will be blocked. 221 222 Finally, there is a per-request context supplied as part of the AppendRows call on the ManagedStream 223 writer itself, useful for bounding individual requests. 224 225 # Connection Sharing (Multiplexing) 226 227 Note: This feature is EXPERIMENTAL and subject to change. 228 229 The BigQuery Write API enforces a limit on the number of concurrent open connections, documented 230 here: https://cloud.google.com/bigquery/quotas#write-api-limits 231 232 Users can now choose to enable connection sharing (multiplexing) when using ManagedStream writers 233 that use default streams. The intent of this feature is to simplify connection management for users 234 who wish to write to many tables, at a cardinality beyond the open connection quota. Please note that 235 explicit streams (Committed, Buffered, and Pending) cannot leverage the connection sharing feature. 236 237 Multiplexing features are controlled by the package-specific custom ClientOption options exposed within 238 this package. Additionally, some of the connection-related WriterOptions that can be specified when 239 constructing ManagedStream writers are ignored for writers that leverage the shared multiplex connections. 240 241 At a high level, multiplexing uses some heuristics based on the flow control of the shared connections 242 to infer whether the pool should add additional connections up to a user-specific limit per region, 243 and attempts to balance traffic from writers to those connections. 244 245 To enable multiplexing for writes to default streams, simply instantiate the client with the desired options: 246 247 ctx := context.Background() 248 client, err := managedwriter.NewClient(ctx, projectID, 249 WithMultiplexing, 250 WithMultiplexPoolLimit(3), 251 ) 252 if err != nil { 253 // TODO: Handle error. 254 } 255 256 Special Consideration: The gRPC architecture is capable of its own sharing of underlying HTTP/2 connections. 257 For users who are sending significant traffic on multiple writers (independent of whether they're leveraging 258 multiplexing or not) may also wish to consider further tuning of this behavior. The managedwriter library 259 sets a reasonable default, but this can be tuned further by leveraging the WithGRPCConnectionPool ClientOption, 260 documented here: 261 https://pkg.go.dev/google.golang.org/api/option#WithGRPCConnectionPool 262 263 A reasonable upper bound for the connection pool size is the number of concurrent writers for explicit stream 264 plus the configured size of the multiplex pool. 265 266 # Writing JSON Data 267 268 As an example, you can refer to this integration test that demonstrates writing JSON data to a stream: 269 https://github.com/googleapis/google-cloud-go/blob/7a46b5428f239871993d66be2c7c667121f60a6f/bigquery/storage/managedwriter/integration_test.go#L397 270 271 This integration test assumes the destination table already exists. In addition, it relies upon having a definition of 272 a BigQuery schema that is compatible with this table (for this example the schema is defined here: 273 https://github.com/googleapis/google-cloud-go/blob/2020edff24e3ffe127248cf9a90c67593c303e18/bigquery/storage/managedwriter/testdata/schemas.go#L31). 274 Given the schema, this test first utilizes the function setupDynamicDescriptors() to derive both a MessageDescriptor 275 and DescriptorProto from the schema. This function is defined here: 276 https://github.com/googleapis/google-cloud-go/blob/7a46b5428f239871993d66be2c7c667121f60a6f/bigquery/storage/managedwriter/integration_test.go#L100 277 The test initializes the ManagedStream it will write to with the derived DescriptorProto. The test then iterates 278 through each of the JSON rows to be written. For each row, it first dynamically creates an empty Message based on 279 the derived MessageDescriptor. Then it loads the JSON row into the Message. Finally it generates protocol buffer 280 bytes from the Message. These bytes are then sent to the ManagedStream within an AppendRows request. 281 */ 282 package managedwriter 283