gRPC vs REST — trade-offs¶
Análisis de protocolo para el engine API. REST como default para HTTP/JSON; gRPC opcional para worker streaming high-throughput. NO mutually exclusive: ambos coexisten con misma backend logic.
Contexto¶
Camunda Zeebe usa gRPC. Decisión: - ✅ Streaming bidireccional eficiente para workers. - ✅ Binary protocol fast. - ❌ Friction de adoption (firewalls, gateway proxies, dev tools). - ❌ Mal browser support (necesita grpc-web).
Nuestra decisión inicial (analysis/rest-api-design): REST como default, mejor para adoption + DX. Considerar gRPC complementario para worker streaming.
Comparativa técnica¶
| Dimensión | REST/HTTP | gRPC |
|---|---|---|
| Transport | HTTP/1.1 o HTTP/2 | HTTP/2 (mandatory) |
| Serialization | JSON (text) | Protobuf (binary) |
| Schema | OpenAPI (opcional) | .proto (mandatory) |
| Browser support | Native | Requires grpc-web proxy |
| Curl-able | Yes | No (needs grpcurl) |
| Streaming | SSE / WebSocket | Native bidi |
| Bandwidth | Higher (~2-5×) | Lower |
| CPU (serialization) | Medium | Low |
| Code generation | Opcional (OpenAPI codegen) | Mandatory (protoc) |
| Tooling ecosystem | Massive | Smaller |
| Adoption friction | Low | Medium |
| Debugging | Trivial | Tools required |
Benchmarks típicos¶
GET single instance (1KB payload):
REST/JSON: p50 3.2ms p99 18ms
gRPC/proto: p50 2.5ms p99 12ms
(~25% faster gRPC, but both <20ms)
Batch fetch 1000 instances (1MB payload):
REST/JSON: p50 35ms bandwidth 1.2 MB
gRPC/proto: p50 12ms bandwidth 350 KB
(gRPC 3× faster, 3× less bandwidth)
Streaming jobs (10 jobs/sec sustained):
REST/long-poll: CPU 5% latency p99 80ms
gRPC streaming: CPU 2% latency p99 25ms
(gRPC clearly better for high-throughput streaming)
Cuándo cuál¶
REST: defaults¶
✅ Browser → engine (web UI, no proxy needed). ✅ Curl / Postman / dev tools. ✅ Single-shot operations (deploy, start, list). ✅ Webhooks (callbacks de external systems). ✅ Public-facing APIs.
gRPC: opcional, performance-critical¶
✅ Worker → engine streaming (high-throughput jobs). ✅ Service mesh internal communication. ✅ Engine ↔ engine (multi-node cluster). ✅ Mobile clients (bandwidth-sensitive).
Diseño: ambos coexisten¶
flowchart LR
REST["REST API<br/>/api/v1"]
GRPC["gRPC API<br/>:9090"]
subgraph EngineCore[Engine core]
HTTP[HTTP handlers]
GH[gRPC handlers]
Engine[Engine]
end
PG[(Postgres)]
REST --> HTTP --> Engine
GRPC --> GH --> Engine
Engine --> PG
Misma lógica de negocio, dos surfaces. NO duplicar código:
// internal/api/handlers.go
func StartInstance(ctx context.Context, req StartInstanceRequest) (*Instance, error) {
// Business logic
}
// internal/api/rest/handler.go
func RestStartInstance(w http.ResponseWriter, r *http.Request) {
req := decodeJSON(r)
resp, err := api.StartInstance(r.Context(), req)
writeJSON(w, resp)
}
// internal/api/grpc/handler.go
func (s *Server) StartInstance(ctx context.Context, req *pb.StartInstanceRequest) (*pb.Instance, error) {
resp, err := api.StartInstance(ctx, convertFromProto(req))
return convertToProto(resp), nil
}
gRPC API design (si se implementa)¶
Service definition (.proto)¶
syntax = "proto3";
package workflow.v1;
service WorkflowEngine {
// Process management
rpc DeployProcess(DeployProcessRequest) returns (DeployProcessResponse);
rpc StartProcessInstance(StartProcessInstanceRequest) returns (ProcessInstance);
rpc CancelInstance(CancelInstanceRequest) returns (google.protobuf.Empty);
// Job worker
rpc ActivateJobs(ActivateJobsRequest) returns (stream ActivatedJob); // streaming
rpc CompleteJob(CompleteJobRequest) returns (google.protobuf.Empty);
rpc FailJob(FailJobRequest) returns (google.protobuf.Empty);
rpc ThrowError(ThrowErrorRequest) returns (google.protobuf.Empty);
// Variables
rpc SetVariables(SetVariablesRequest) returns (google.protobuf.Empty);
// Message
rpc PublishMessage(PublishMessageRequest) returns (PublishMessageResponse);
}
message ActivateJobsRequest {
string type = 1;
string worker = 2;
int32 max_jobs = 3;
int64 timeout_ms = 4;
repeated string fetch_variables = 5;
string tenant_id = 6;
}
message ActivatedJob {
int64 key = 1;
string type = 2;
int64 process_instance_key = 3;
string element_id = 4;
map<string, string> headers = 5;
map<string, google.protobuf.Value> variables = 6;
int32 retries = 7;
int64 deadline_ms = 8;
string tenant_id = 9;
}
Streaming workers¶
// Worker side
stream, err := client.ActivateJobs(ctx, &pb.ActivateJobsRequest{
Type: "charge-payment",
Worker: "worker-1",
MaxJobs: 32,
})
for {
job, err := stream.Recv()
if err == io.EOF { break }
if err != nil { /* handle */ }
go handleJob(job) // process concurrently
}
Engine push jobs as they become available. Worker recibe en orden, processa en paralelo (goroutines).
Comparativa con REST long-polling:
REST:
Client → GET /api/v1/jobs?type=charge-payment (long poll)
Server holds connection up to 30s, returns jobs as JSON when available
Connection close, client re-opens.
gRPC:
Client → ActivateJobs (stream)
Server holds bi-directional stream
Multiple jobs sent over same stream
Bidirectional: server can also send pings, capacity updates
gRPC win en cluster con miles de workers + alto throughput.
Headers de autenticación¶
REST¶
GET /api/v1/instances/123 HTTP/1.1
Authorization: Bearer eyJ...
X-Tenant-ID: acme
X-Trace-ID: abc-123
gRPC¶
Convención: lowercase keys en gRPC metadata.
Error handling¶
REST: HTTP status + RFC 7807¶
HTTP/1.1 404 Not Found
Content-Type: application/problem+json
{
"type": "https://docs.example.com/errors/instance-not-found",
"title": "Instance not found",
"status": 404,
"detail": "Instance 12345 not found in tenant acme",
"instance": "/api/v1/instances/12345"
}
gRPC: status codes¶
import "google.golang.org/grpc/codes"
import "google.golang.org/grpc/status"
return nil, status.Errorf(codes.NotFound, "instance %d not found", id)
Status codes mapping:
| REST | gRPC | Use |
|---|---|---|
| 200 | OK | Success |
| 400 | InvalidArgument | Bad input |
| 401 | Unauthenticated | No auth |
| 403 | PermissionDenied | Authorized but not allowed |
| 404 | NotFound | Resource missing |
| 409 | AlreadyExists / FailedPrecondition | Conflict |
| 429 | ResourceExhausted | Rate limit |
| 500 | Internal | Bug |
| 503 | Unavailable | Backpressure / down |
Error details¶
gRPC permite structured error details vía google.rpc.Status:
message ErrorInfo {
string reason = 1; // "TENANT_QUOTA_EXCEEDED"
string domain = 2; // "workflow.example.com"
map<string, string> metadata = 3;
}
Cliente puede inspect details:
if st, ok := status.FromError(err); ok {
for _, detail := range st.Details() {
if info, ok := detail.(*ErrorInfo); ok {
if info.Reason == "TENANT_QUOTA_EXCEEDED" {
// handle
}
}
}
}
REST equivalent: extension fields en problem+json.
TLS y mTLS¶
REST: TLS 1.3 standard.
gRPC: TLS 1.3 + opcional mTLS (mutual TLS).
mTLS use case: workers se autentican con cert client, engine valida. Sin necesidad de bearer tokens.
// Server
creds := credentials.NewTLS(&tls.Config{
ClientAuth: tls.RequireAndVerifyClientCert,
ClientCAs: caPool,
})
grpcServer := grpc.NewServer(grpc.Creds(creds))
// Worker
creds := credentials.NewTLS(&tls.Config{
Certificates: []tls.Certificate{clientCert},
RootCAs: serverCAPool,
})
conn, _ := grpc.Dial(":9090", grpc.WithTransportCredentials(creds))
Para zero-trust networks (service mesh con SPIFFE/SPIRE).
Load balancing¶
REST: stateless, LB simple (HAProxy, nginx, ALB).
gRPC: HTTP/2 connection multiplexes streams. LB needs to be HTTP/2-aware: - L4 LB (TCP): connects sticky, no balancing per-request. - L7 LB (Envoy, Linkerd): balances streams across connections.
Para gRPC, prefer Envoy / Linkerd / Istio (service mesh).
Browser support¶
REST: native fetch / XHR.
gRPC: NO. Browser cannot speak HTTP/2 trailers correctly. Options: - gRPC-Web: proxy translates browser-friendly to gRPC. Envoy filter. - Connect-Web: gRPC-Web pero más DX-friendly. https://connectrpc.com/
Para webapps (Tasklist, Operate): REST direct, no proxy needed.
DX tooling¶
REST¶
Postman, Insomnia, Bruno: GUI clients.
OpenAPI spec → SwaggerUI → interactive docs.
gRPC¶
BloomRPC, Kreya: GUI clients (less mature than REST).
buf (https://buf.build/) for proto management — modern alternative to protoc.
CI / codegen¶
REST:
# Optional: codegen from OpenAPI
oapi-codegen -package wfclient spec/openapi.yaml > wfclient/client.go
gRPC:
# Mandatory: codegen from .proto
buf generate
# Genera: client + server + types en Go, TypeScript, Java, etc.
gRPC tooling es más mature para multi-language: una .proto genera código para 10+ idiomas confiable. REST con OpenAPI tiene calidad variable across generators.
Versionado¶
REST: path-based¶
gRPC: package-based¶
Mismo concepto; ambos requieren disciplina (additive non-breaking, MAJOR para breaking).
Adoption en industria¶
| Producto | API | Notas |
|---|---|---|
| Temporal | gRPC | Worker-heavy, choice OK |
| Zeebe (Camunda 8) | gRPC | Worker-heavy |
| Conductor (Netflix) | REST | Optimized for adoption |
| Argo Workflows | REST + Kubernetes API | k8s-native |
| AWS Step Functions | REST (AWS API) | Standard AWS pattern |
| Stripe / Twilio | REST + Webhooks | DX-optimized |
REST dominates "general purpose APIs". gRPC dominates "high-throughput service-to-service".
Decisión consolidada¶
M1: Solo REST. Cubre 95% de casos. Workers vía REST long-polling con HTTP/2 streaming (alternativa a SSE).
M2: Considerar gRPC complementario si: - Hay demand de clientes high-throughput. - Workers en hot path muestran latency issues vía REST. - Service mesh adoption interna justifica.
NO gRPC en M1 porque: - Adds complexity (proto, codegen, tooling). - Adoption friction (most teams know REST). - HTTP/2 streaming sobre REST cubre 80% del benefit.
Si gRPC se agrega: keep REST como primary. gRPC opcional. Same backend logic.
Performance benchmarks goal¶
Para validar la decisión, baseline test:
Scenario: 1000 workers fetching jobs at 100 jobs/sec total.
REST/long-poll:
Engine CPU: ~3%
Network: ~5 MB/s
Latency p99 job-pickup: <100ms
gRPC streaming:
Engine CPU: ~1%
Network: ~1.5 MB/s
Latency p99 job-pickup: <30ms
Si REST cumple SLO → no urgencia para gRPC.
Lo que NO va a cambiar entre REST y gRPC¶
- Business semantics (mismas operaciones, mismos errores).
- Persistence layer (Postgres).
- Engine internals.
- Audit logging.
- Authentication providers (OIDC).
Es solo la transport layer.
Roadmap¶
- M1: REST only.
- M2: gRPC opcional para workers (compatible REST mantenido).
- M3: gRPC para engine-to-engine internal communication.
- M4: Connect protocol (https://connectrpc.com/) considered para mejor DX.
Referencias¶
- analysis/rest-api-design — REST design completo
- analysis/worker-sdk-go-design — SDK worker
- analysis/api-versioning-strategy — versionado
- concepts/grpc-api — gRPC concept page
- gRPC vs REST: Performance comparison
- HTTP/2 streaming (RFC 7540)
- Connect protocol