Enterprise Architecture and Microservices FAQ

Question 1

How does OKAXI design systems that carry heavy load?

Accepted Answer

OKAXI runs an event-driven architecture with Kafka as the backbone. Every Microservices interaction goes through an event topic rather than a synchronous REST call. The receiving service processes at its own capacity and is not forced to keep pace with the producer. On the OKAXI retail engagement, sustained throughput reaches 5000 events per second without degradation and peak burst hits 12000 events per second with zero event loss.

Question 2

What is the core role of Kafka in asynchronous processing?

Accepted Answer

Kafka plays five roles in the OKAXI stack. First, durable buffer that persists events to disk before acknowledging the producer. Second, partitioned log for parallel consumers, with each partition handled by a single consumer instance. Third, replay log so a consumer can seek back to an older offset to reprocess. Fourth, fanout that delivers the same event to multiple consumer groups independently. Fifth, natural backpressure where a slow consumer never throttles the producer.

Question 3

How does OKAXI avoid bottlenecks during traffic burst?

Accepted Answer

There are four main mechanisms. First, Kafka topic partitioning on a natural key (customer_id, region) to spread load evenly across consumers. Second, an elastic consumer pool that auto-scales based on topic lag metrics. Third, asynchronous processing that separates the latency-critical path from the heavy computation path. Fourth, circuit breakers on downstream API calls so the system fails fast when an external dependency slows.

Question 4

Can OKAXI use a different message broker than Kafka?

Accepted Answer

Yes. OKAXI carries integration templates for RabbitMQ, NATS, AWS SQS, and Redis Streams. The choice depends on the concrete requirement. Kafka fits high throughput with long retention. RabbitMQ fits complex routing and priority queues. NATS fits ultra-low latency. SQS fits serverless workloads on AWS. Redis Streams fits a lightweight queue colocated with the cache tier.

Question 5

How does backpressure work in the OKAXI event-driven architecture?

Accepted Answer

Backpressure is handled naturally through the Kafka offset model. Producers publish events fast, consumers process at their own pace, and lag accumulates on the broker. When lag exceeds a configured threshold, an alert fires and auto-scaling spins up new consumer instances. The producer is only throttled when Kafka disk fills up or partition count hits the limit. A real-time dashboard shows consumer lag per topic.

Question 6

How does OKAXI guarantee that data is not lost across Microservices?

Accepted Answer

OKAXI applies the outbox pattern combined with the saga pattern. Outbox: the business transaction and the event write happen inside one local database transaction. A relay process reads the outbox table and publishes to Kafka, ensuring at-least-once delivery. Saga: a long-running workflow is split into smaller steps, and each step carries a compensating action to roll back when a later step fails. The combination prevents lost events and inconsistent business state.

Question 7

How do idempotency keys work in OKAXI APIs?

Accepted Answer

Every mutating endpoint accepts an Idempotency-Key UUID header. The server stores the key with its response in Redis with a 24-hour TTL. A repeat request with the same key returns the cached response rather than executing the work again. Kafka producers also tag every event with an event_id UUID, and the consumer deduplicates through a processed_events table before committing business logic. Two idempotency layers together provide exactly-once business semantics on top of at-least-once transport.

Question 8

How does OKAXI handle distributed transactions?

Accepted Answer

OKAXI does not use 2-phase commit in production. Every cross-service workflow ships as Saga choreography or Saga orchestration. Choreography: each service reacts to an event from the previous service. Orchestration: one central service drives every step and handles compensation. The pattern choice depends on workflow complexity and the required visibility.

Question 9

How does OKAXI handle network partition and split-brain?

Accepted Answer

OKAXI applies the CAP theorem by picking AP for the event delivery layer and CP for the configuration and identity layer. The event layer uses a Kafka cluster with replication factor 3 and min.insync.replicas 2, which tolerates a transient single-broker loss. The configuration layer uses etcd or ZooKeeper with quorum writes, rejecting writes when quorum is not met. Consumer-side services detect partitions through heartbeat and fall back to a local cache.

Question 10

What data consistency model does OKAXI choose?

Accepted Answer

OKAXI applies eventual consistency across services and strong consistency inside each service boundary. Every service owns a local ACID database. Cross-service consistency is achieved through eventual replication and saga compensation. Clients that require strong consistency across services receive guidance to use a synchronous orchestration layer on top, with the latency cost made explicit during the architecture review.

Question 11

How are OKAXI Microservices designed to be stateless?

Accepted Answer

Services do not store state in memory or on local disk. State lives in three separate layers. First, databases (PostgreSQL, MongoDB) for persistent state. Second, a Redis cluster for sessions and cache. Third, Kafka for the event log. Service instances can scale up or be killed at any moment without data loss. Container restarts or node failures do not affect business state.

Question 12

How do session state and cache distribution work?

Accepted Answer

OKAXI uses a Redis cluster with hash slot partitioning for both the session store and the cache. Sticky sessions are avoided so any instance can serve any request. The session token is a signed JWT, and the server only stores a blacklist for revocation. Cache TTL is short (seconds to minutes) for hot data and long (up to 24 hours) for lookup tables. Cache invalidation fires through Kafka events whenever the source data changes.

Question 13

What is the OKAXI auto-scaling policy?

Accepted Answer

OKAXI uses Kubernetes HPA driven by two signals. CPU utilisation above 70% triggers scale up. A custom metric on Kafka consumer lag above threshold also triggers scale up. Scale down is slower (5-minute cooldown) to avoid thrashing during traffic oscillation. A pre-warmed pod pool stays ready for expected peaks such as campaign launches or sale events. The cluster autoscaler adds nodes when pending pods exceed threshold.

Question 14

What is the OKAXI database scaling strategy?

Accepted Answer

OKAXI applies multiple strategies based on use case. Read-heavy workloads use read replicas with the connection pool routing reads to the replica. Single-tenant write-heavy workloads use partitioning on a natural key (customer_id or a time range). Multi-tenant workloads use sharding or a database-per-tenant model. Multi-master is avoided except for special cases because conflict resolution is complex. The client receives a concrete recommendation after the architecture review.

Question 15

What is the OKAXI load balancer strategy?

Accepted Answer

OKAXI combines L7 and L4 by layer. Ingress uses L7 with Nginx or Envoy for HTTP routing, TLS termination, and rate limiting. The internal service mesh uses L4 for the gRPC binary protocol to minimise parsing cost. Sticky sessions are avoided across the board, and every load balancer uses round-robin or least-connection. Health checks run every 5 seconds to remove unhealthy pods from the pool quickly.

Question 16

How does OKAXI collect distributed logs?

Accepted Answer

OKAXI runs an OpenTelemetry agent inside every service container. The agent collects log, metric, and trace and forwards them to a collector cluster. The collector routes logs to Loki or to the ELK stack per client preference. The standard log format is structured JSON with correlation_id, request_id, and customer_id. Search and filter happen through Grafana or Kibana. Retention policy is 30 days hot and 1 year cold on object storage.

Question 17

How is distributed tracing implemented?

Accepted Answer

OKAXI uses the OpenTelemetry SDK for Go, Python, Java, and Node backends. A trace covers the root span at Ingress, child spans for every service hop, database queries, and external API calls. Traces are stored in Jaeger or Tempo. The sampling rate is configured per service: 100% in debug environments and 1 to 10% in production depending on traffic. The trace ID propagates through HTTP headers and Kafka message headers.

Question 18

How does OKAXI run real-time metrics and alerting?

Accepted Answer

OKAXI uses Prometheus to scrape metrics from every service every 15 seconds. Grafana powers the real-time dashboards. AlertManager routes alerts to Slack, Email, or PagerDuty depending on severity. The core metrics tracked are request rate, error rate, latency p50/p95/p99 (RED method), CPU/memory/disk (USE method), and industry-specific business KPI. Alert thresholds are configured against the target SLO.

Question 19

How does OKAXI detect and isolate faults in Microservices?

Accepted Answer

There are three detection layers. First, automated alerts when error rate or latency exceeds the SLO. Second, distributed traces for every failing request, with on-call pulling the trace ID from logs to follow the service hop chain. Third, internal chaos engineering that injects faults regularly to verify circuit breakers and fallbacks. Once the faulty service is identified, traffic is drained through the load balancer while the team fixes the root cause.

Question 20

What is the OKAXI SLI, SLO, and SLA approach?

Accepted Answer

SLI (Service Level Indicator) is the actual measurement: availability, latency, error rate, freshness. SLO (Service Level Objective) is the internal target. For example, availability 99.9%, latency p95 under 300 ms for the API, sustained throughput of 1000 RPS. SLA (Service Level Agreement) is the commitment to the client, typically set one tier below the SLO to leave a safety margin. Error budget is computed from the SLO. When the budget is exhausted, the team freezes new feature releases and focuses on reliability work.

Enterprise System Architecture and Microservices