Saltar a contenido

gRPC vs REST — trade-offs

Análisis de protocolo para el engine API. REST como default para HTTP/JSON; gRPC opcional para worker streaming high-throughput. NO mutually exclusive: ambos coexisten con misma backend logic.

Contexto

Camunda Zeebe usa gRPC. Decisión: - ✅ Streaming bidireccional eficiente para workers. - ✅ Binary protocol fast. - ❌ Friction de adoption (firewalls, gateway proxies, dev tools). - ❌ Mal browser support (necesita grpc-web).

Nuestra decisión inicial (analysis/rest-api-design): REST como default, mejor para adoption + DX. Considerar gRPC complementario para worker streaming.

Comparativa técnica

Dimensión REST/HTTP gRPC
Transport HTTP/1.1 o HTTP/2 HTTP/2 (mandatory)
Serialization JSON (text) Protobuf (binary)
Schema OpenAPI (opcional) .proto (mandatory)
Browser support Native Requires grpc-web proxy
Curl-able Yes No (needs grpcurl)
Streaming SSE / WebSocket Native bidi
Bandwidth Higher (~2-5×) Lower
CPU (serialization) Medium Low
Code generation Opcional (OpenAPI codegen) Mandatory (protoc)
Tooling ecosystem Massive Smaller
Adoption friction Low Medium
Debugging Trivial Tools required

Benchmarks típicos

GET single instance (1KB payload):
  REST/JSON:    p50  3.2ms   p99  18ms
  gRPC/proto:   p50  2.5ms   p99  12ms
  (~25% faster gRPC, but both <20ms)

Batch fetch 1000 instances (1MB payload):
  REST/JSON:    p50  35ms    bandwidth 1.2 MB
  gRPC/proto:   p50  12ms    bandwidth 350 KB
  (gRPC 3× faster, 3× less bandwidth)

Streaming jobs (10 jobs/sec sustained):
  REST/long-poll:  CPU 5%   latency p99 80ms
  gRPC streaming:  CPU 2%   latency p99 25ms
  (gRPC clearly better for high-throughput streaming)

Cuándo cuál

REST: defaults

✅ Browser → engine (web UI, no proxy needed). ✅ Curl / Postman / dev tools. ✅ Single-shot operations (deploy, start, list). ✅ Webhooks (callbacks de external systems). ✅ Public-facing APIs.

gRPC: opcional, performance-critical

✅ Worker → engine streaming (high-throughput jobs). ✅ Service mesh internal communication. ✅ Engine ↔ engine (multi-node cluster). ✅ Mobile clients (bandwidth-sensitive).

Diseño: ambos coexisten

flowchart LR
    REST["REST API<br/>/api/v1"]
    GRPC["gRPC API<br/>:9090"]
    subgraph EngineCore[Engine core]
        HTTP[HTTP handlers]
        GH[gRPC handlers]
        Engine[Engine]
    end
    PG[(Postgres)]
    REST --> HTTP --> Engine
    GRPC --> GH --> Engine
    Engine --> PG

Misma lógica de negocio, dos surfaces. NO duplicar código:

// internal/api/handlers.go
func StartInstance(ctx context.Context, req StartInstanceRequest) (*Instance, error) {
    // Business logic
}

// internal/api/rest/handler.go
func RestStartInstance(w http.ResponseWriter, r *http.Request) {
    req := decodeJSON(r)
    resp, err := api.StartInstance(r.Context(), req)
    writeJSON(w, resp)
}

// internal/api/grpc/handler.go
func (s *Server) StartInstance(ctx context.Context, req *pb.StartInstanceRequest) (*pb.Instance, error) {
    resp, err := api.StartInstance(ctx, convertFromProto(req))
    return convertToProto(resp), nil
}

gRPC API design (si se implementa)

Service definition (.proto)

syntax = "proto3";
package workflow.v1;

service WorkflowEngine {
    // Process management
    rpc DeployProcess(DeployProcessRequest) returns (DeployProcessResponse);
    rpc StartProcessInstance(StartProcessInstanceRequest) returns (ProcessInstance);
    rpc CancelInstance(CancelInstanceRequest) returns (google.protobuf.Empty);

    // Job worker
    rpc ActivateJobs(ActivateJobsRequest) returns (stream ActivatedJob);  // streaming
    rpc CompleteJob(CompleteJobRequest) returns (google.protobuf.Empty);
    rpc FailJob(FailJobRequest) returns (google.protobuf.Empty);
    rpc ThrowError(ThrowErrorRequest) returns (google.protobuf.Empty);

    // Variables
    rpc SetVariables(SetVariablesRequest) returns (google.protobuf.Empty);

    // Message
    rpc PublishMessage(PublishMessageRequest) returns (PublishMessageResponse);
}

message ActivateJobsRequest {
    string type = 1;
    string worker = 2;
    int32 max_jobs = 3;
    int64 timeout_ms = 4;
    repeated string fetch_variables = 5;
    string tenant_id = 6;
}

message ActivatedJob {
    int64 key = 1;
    string type = 2;
    int64 process_instance_key = 3;
    string element_id = 4;
    map<string, string> headers = 5;
    map<string, google.protobuf.Value> variables = 6;
    int32 retries = 7;
    int64 deadline_ms = 8;
    string tenant_id = 9;
}

Streaming workers

// Worker side
stream, err := client.ActivateJobs(ctx, &pb.ActivateJobsRequest{
    Type:    "charge-payment",
    Worker:  "worker-1",
    MaxJobs: 32,
})

for {
    job, err := stream.Recv()
    if err == io.EOF { break }
    if err != nil { /* handle */ }

    go handleJob(job)  // process concurrently
}

Engine push jobs as they become available. Worker recibe en orden, processa en paralelo (goroutines).

Comparativa con REST long-polling:

REST:
  Client → GET /api/v1/jobs?type=charge-payment (long poll)
  Server holds connection up to 30s, returns jobs as JSON when available
  Connection close, client re-opens.

gRPC:
  Client → ActivateJobs (stream)
  Server holds bi-directional stream
  Multiple jobs sent over same stream
  Bidirectional: server can also send pings, capacity updates

gRPC win en cluster con miles de workers + alto throughput.

Headers de autenticación

REST

GET /api/v1/instances/123 HTTP/1.1
Authorization: Bearer eyJ...
X-Tenant-ID: acme
X-Trace-ID: abc-123

gRPC

metadata:
  authorization: Bearer eyJ...
  x-tenant-id: acme
  x-trace-id: abc-123

Convención: lowercase keys en gRPC metadata.

Error handling

REST: HTTP status + RFC 7807

HTTP/1.1 404 Not Found
Content-Type: application/problem+json

{
  "type": "https://docs.example.com/errors/instance-not-found",
  "title": "Instance not found",
  "status": 404,
  "detail": "Instance 12345 not found in tenant acme",
  "instance": "/api/v1/instances/12345"
}

gRPC: status codes

import "google.golang.org/grpc/codes"
import "google.golang.org/grpc/status"

return nil, status.Errorf(codes.NotFound, "instance %d not found", id)

Status codes mapping:

REST gRPC Use
200 OK Success
400 InvalidArgument Bad input
401 Unauthenticated No auth
403 PermissionDenied Authorized but not allowed
404 NotFound Resource missing
409 AlreadyExists / FailedPrecondition Conflict
429 ResourceExhausted Rate limit
500 Internal Bug
503 Unavailable Backpressure / down

Error details

gRPC permite structured error details vía google.rpc.Status:

message ErrorInfo {
    string reason = 1;       // "TENANT_QUOTA_EXCEEDED"
    string domain = 2;       // "workflow.example.com"
    map<string, string> metadata = 3;
}

Cliente puede inspect details:

if st, ok := status.FromError(err); ok {
    for _, detail := range st.Details() {
        if info, ok := detail.(*ErrorInfo); ok {
            if info.Reason == "TENANT_QUOTA_EXCEEDED" {
                // handle
            }
        }
    }
}

REST equivalent: extension fields en problem+json.

TLS y mTLS

REST: TLS 1.3 standard.

gRPC: TLS 1.3 + opcional mTLS (mutual TLS).

mTLS use case: workers se autentican con cert client, engine valida. Sin necesidad de bearer tokens.

// Server
creds := credentials.NewTLS(&tls.Config{
    ClientAuth:   tls.RequireAndVerifyClientCert,
    ClientCAs:    caPool,
})
grpcServer := grpc.NewServer(grpc.Creds(creds))

// Worker
creds := credentials.NewTLS(&tls.Config{
    Certificates: []tls.Certificate{clientCert},
    RootCAs:      serverCAPool,
})
conn, _ := grpc.Dial(":9090", grpc.WithTransportCredentials(creds))

Para zero-trust networks (service mesh con SPIFFE/SPIRE).

Load balancing

REST: stateless, LB simple (HAProxy, nginx, ALB).

gRPC: HTTP/2 connection multiplexes streams. LB needs to be HTTP/2-aware: - L4 LB (TCP): connects sticky, no balancing per-request. - L7 LB (Envoy, Linkerd): balances streams across connections.

Para gRPC, prefer Envoy / Linkerd / Istio (service mesh).

Browser support

REST: native fetch / XHR.

gRPC: NO. Browser cannot speak HTTP/2 trailers correctly. Options: - gRPC-Web: proxy translates browser-friendly to gRPC. Envoy filter. - Connect-Web: gRPC-Web pero más DX-friendly. https://connectrpc.com/

Para webapps (Tasklist, Operate): REST direct, no proxy needed.

DX tooling

REST

curl https://wf.example.com/api/v1/instances/123 \
    -H "Authorization: Bearer $TOKEN"

Postman, Insomnia, Bruno: GUI clients.

OpenAPI spec → SwaggerUI → interactive docs.

gRPC

grpcurl -d '{"id": 123}' wf.example.com:9090 workflow.v1.WorkflowEngine/GetInstance

BloomRPC, Kreya: GUI clients (less mature than REST).

buf (https://buf.build/) for proto management — modern alternative to protoc.

CI / codegen

REST:

# Optional: codegen from OpenAPI
oapi-codegen -package wfclient spec/openapi.yaml > wfclient/client.go

gRPC:

# Mandatory: codegen from .proto
buf generate
# Genera: client + server + types en Go, TypeScript, Java, etc.

gRPC tooling es más mature para multi-language: una .proto genera código para 10+ idiomas confiable. REST con OpenAPI tiene calidad variable across generators.

Versionado

REST: path-based

/api/v1/instances
/api/v2/instances

gRPC: package-based

workflow.v1.WorkflowEngine
workflow.v2.WorkflowEngine

Mismo concepto; ambos requieren disciplina (additive non-breaking, MAJOR para breaking).

Adoption en industria

Producto API Notas
Temporal gRPC Worker-heavy, choice OK
Zeebe (Camunda 8) gRPC Worker-heavy
Conductor (Netflix) REST Optimized for adoption
Argo Workflows REST + Kubernetes API k8s-native
AWS Step Functions REST (AWS API) Standard AWS pattern
Stripe / Twilio REST + Webhooks DX-optimized

REST dominates "general purpose APIs". gRPC dominates "high-throughput service-to-service".

Decisión consolidada

M1: Solo REST. Cubre 95% de casos. Workers vía REST long-polling con HTTP/2 streaming (alternativa a SSE).

M2: Considerar gRPC complementario si: - Hay demand de clientes high-throughput. - Workers en hot path muestran latency issues vía REST. - Service mesh adoption interna justifica.

NO gRPC en M1 porque: - Adds complexity (proto, codegen, tooling). - Adoption friction (most teams know REST). - HTTP/2 streaming sobre REST cubre 80% del benefit.

Si gRPC se agrega: keep REST como primary. gRPC opcional. Same backend logic.

Performance benchmarks goal

Para validar la decisión, baseline test:

Scenario: 1000 workers fetching jobs at 100 jobs/sec total.

REST/long-poll:
  Engine CPU: ~3%
  Network: ~5 MB/s
  Latency p99 job-pickup: <100ms

gRPC streaming:
  Engine CPU: ~1%
  Network: ~1.5 MB/s
  Latency p99 job-pickup: <30ms

Si REST cumple SLO → no urgencia para gRPC.

Lo que NO va a cambiar entre REST y gRPC

  • Business semantics (mismas operaciones, mismos errores).
  • Persistence layer (Postgres).
  • Engine internals.
  • Audit logging.
  • Authentication providers (OIDC).

Es solo la transport layer.

Roadmap

  • M1: REST only.
  • M2: gRPC opcional para workers (compatible REST mantenido).
  • M3: gRPC para engine-to-engine internal communication.
  • M4: Connect protocol (https://connectrpc.com/) considered para mejor DX.

Referencias