Blog

/

Apr 24, 2025

Scaling Our Chat Infrastructure

Scaling Our Chat Infrastructure

Written by: Jaisal Friedman

Since Counsel’s inception in 2023, asynchronous chat has always been the backbone of our product and care model. By November 2024, message volumes had surged and it was clear that we had to re-architect our infrastructure for scale, reliability, and performance.

This blog post is a behind-the-scenes look at how we rebuilt our chat platform. If you’re building anything chat-related, especially in healthcare, we hope you can learn from our experience.

The Problem

We built our original chat system on Twilio Conversations, mainly for its cross-platform SMS support, which is critical for reaching our patients. But the more we scaled, the more issues we found:

  • Dropped messages: Users and physicians missed messages when the Twilio SDK silently failed to connect.

  • No retries or alerts: Failed messages weren’t retried or flagged, leaving users thinking they’d responded when they hadn’t.

  • Sluggish startup experience: The app took up to 2 minutes to load conversations because we had no control over how Twilio’s SDK loaded data.

In short, our chat infrastructure met the demand of our early users, but it didn’t reflect the quality of care we always strive to provide.

The New Architecture

We rebuilt our chat system around three key principles: control, observability, and fault tolerance.

  1. API-first design: We wrapped all write actions in our own APIs. Twilio is now just an implementation detail. These APIs are idempotent and transactional, so we never end up with inconsistent data.

  2. Startup optimization: Instead of loading all messages on launch, we fetch only paginated, recent data. This alone cut physician app startup time from 2 minutes to under 1 second.

  3. Resilient offline support: When the SDK fails to connect, we don’t leave users stranded. Messages queue locally, retry automatically, and notify users if delivery ultimately fails.

Tackling Message Reconciliation

One of our trickiest bugs was handling messages that were “sent” locally, but never made it to Twilio. To fix this, we introduced a reconciliation loop between local and remote messages using Redux sagas. If a message isn’t confirmed by the server (via HTTP or WebSocket), it stays flagged as unconfirmed, and the UI reflects that state.

The below code showcases how we implemented the reconciliation logic using sagas:

// Redux state
export type Threads = { [threadId: string]: Thread };
export type Messages = { [threadId: string]: Message[] };
export type LocalMessages = { [threadId: string]: LocalMessage[] };

// Redux saga - reconcilation logic that listens on any message dispatch action
export function* watchMessagesSaga() {
  // We don't care about every message update, only the latest ones batched together. takeLatest will cancel any previous in-flight updates.
  yield takeLatest(MessageActions, reconcileRemoteAndLocalMessagesSaga);
}

function* reconcileRemoteAndLocalMessagesSaga(action: MessageAction) {
  const allRemoteMessages = yield* selectTyped(selectMessages);
  const allLocalMessages = yield* selectTyped(selectLocalMessages);
  const threadsToUpdate = getConvoIdsToUpdate(action);

  for (const threadId of threadsToUpdate) {
    const remoteMessages = allRemoteMessages[threadId] ?? [];
    const localMessages = allLocalMessages[threadId] ?? [];

    const newLocalMessages = reconcileLocalAndRemoteMessages(
      localMessages,
      remoteMessages
    );

    yield put(
      setLocalMessagesForThread({ threadId, localMessages: newLocalMessages })
    );
  
}

/**
 * Combine local and remote messages, only keep the local message if it has not yet been replaced with a remote message
 * Returns all messages (local and remote) sorted by timestamp
 */
export const reconcileLocalAndRemoteMessages = (
  localMessages: LocalMessage[],
  remoteMessages: ReduxMessage[]
): LocalMessage[] => {
  const remoteUnconfirmedMessageIds = new Set(
    remoteMessages.map((msg) => msg.attributes.messageId)
  );

  const remoteConfirmedMessageIds = new Set(
    remoteMessages.map((msg) => msg.sid)
  );

  /**
   * We construct the local message with a temporary id, which is replaced by a server-side ID when the message is confirmed.
   * We store the locally generated message id in the message attributes for future matching.
   *
   * Case 1: LocalMessage has not been replaced yet with remoteMessage, compare id with set of unconfirmed remote message ids
   * Case 2: LocalMessage already replaced with remoteMessage, compare id with set of confirmed remote message ids
   */
  const remainingLocalMessages = localMessages.filter(
    (localMessage) =>
      localMessage.unconfirmed &&
      !remoteUnconfirmedMessageIds.has(localMessage.id) &&
      !remoteConfirmedMessageIds.has(localMessage.id)
  );

  return [
    ...convertReduxToLocalMessages(remoteMessages),
    ...remainingLocalMessages,
  ].sort(sortLocalMessagesFnAsc);
};

Transactional APIs & Idempotency

Chat operations now go through an internal API layer that keeps Twilio in sync with our own database. To make this safe and repeatable, we implemented a higher-order rollback function and used unique idempotency keys on every request. Inspired by Stripe, we made every POST request idempotent.

When a message is submitted twice concurrently, we simultaneously check our database DB and lock the row for updates:

  • If it’s already there, we return the same response.

  • If not, we insert it, write to Twilio, and log it.

This guarantees consistency and shields our users from bugs in connectivity or retry logic.

Resilient Offline Support

Several parts of our client-side app rely on a WebSocket-based SDK to fetch data. We wanted to uncouple the usability of the app from the connectivity state of the SDK. WebSockets can be unreliable on mobile devices.

To achieve this we implemented a class that wraps all calls to the underlying WebSocket server.

/**
 * Abstraction around our WebSocket client that handles queuing for offline-first support
 */
export abstract class BaseChatManager {
  client: WebSocketClient | null = null;

  /** Connection promise that will only resolve when the SDK client is in the connected state */
  private connectionState: ConnectionState = "unknown";

  /**
   * Turns the connection state into a promise that can be awaited to queue on a connected state.
   */
  private waitForConnection: Promise<void>;
  private waitForConnectionResolve:
    | ((value: void | PromiseLike<void>) => void)
    | undefined;
  private waitForConnectionReject:
    | ((reason?: ChatManagerRejectedError) => void)
    | undefined;

  public constructor() {
    this.createConnectionPromise()
  }

  public releaseClient() {
    // Reject awaiting callers
    if (this.waitForInitializedReject) {
      this.waitForInitializedReject(
        new ChatManagerRejectedError("[BaseChatManager] Releasing client")
      );
      this.waitForInitializedReject = undefined;
      this.waitForInitializedResolve = undefined;
    }
    if (this.waitForConnectionReject) {
      this.waitForConnectionReject(
        new ChatManagerRejectedError("[BaseChatManager] Releasing client")
      );
      this.waitForConnectionReject = undefined;
      this.waitForConnectionResolve = undefined;
    }
    this.createConnectionPromise()
  }

  /**
   * Wrap SDK actions in this to ensure the function waits on the connection to the SDK.
   */
  private wrapAction<Args extends unknown[], Return>(
    fn: (this: BaseChatManager, ...args: Args) => Promise<Return>,
    functionName: string,
  ) {
    return async (...args: Args): Promise<Return> => {
      // Create a bound version of the function with the correct this context and client
      // The bound function is what awaits so that we still detect disconnects in between retries
      const appliedFn = async () => {
        // NOTE: we don't want to call await this.waitForConnection and kick it to the promise chain if we don't have to
        // So we check if connectionState is not "connected" first
        if (this.connectionState !== "connected") {
          await this.waitForConnection;
        }
        return fn.call(this, ...args);
      };

      return await retryAsyncFn(
        appliedFn,
        functionName
      );
    };
  }

  private createConnectionPromise() {
	  this.waitForConnection = new Promise((resolve, reject) => {
      this.waitForConnectionResolve = resolve;
      this.waitForConnectionReject = reject;
    });
  }
  
  private setConnectionState(state: ConnectionState) {
    // Resolve the existing connection promise if the state is connected
    if (this.waitForConnectionResolve && state === "connected") {
      this.waitForConnectionResolve();
      this.waitForConnectionResolve = undefined;
      this.waitForConnectionReject = undefined;
      // Leave the connection promise in a resolved state but release the resolve/reject functions so they can't be called again.
    }
    this.connectionState = state;
    this.handleConnectionStateChanged(state);
  }
  
  private onDisconnected() {
    this.handleDisconnected();
    // Reject the existing connection promises if they exist (there shouldn't be any if we were connected then went to a disconnected state)
    if (this.waitForConnectionReject) {
      this.waitForConnectionReject(
        new ChatManagerRejectedError("[BaseChatManager] Disconnected")
      );
      this.waitForConnectionReject = undefined;
      this.waitForConnectionResolve = undefined;
    }

    // Create a new connection promise for the next time waitForConnection() is called
    this.createConnectionPromise()
  }
  
  /**
   * onConnectionStateChanged is called each time the websocket client detects a connection state change.
   * It may receive duplicate events.
   */
  private onConnectionStateChanged(newState: ConnectionState) {
    if (newState === this.connectionState) {
      return;
    }
    if (
      this.connectionState === "connected" &&
      DISCONNECTED_STATES.includes(newState)
    ) {
      this.onDisconnected();
    }

    this.setConnectionState(newState);
  }

  /**
   * Example implementation of a function that awaits connection before proceeding.
   */
  public getLastMessageReadIndexForThread = this.wrapAction(
    async function GetLastMessageReadIndexForThread(threadId: string) {
      return await this.client.getLastMessageReadIndexForThread(threadId);
    },
    "GetLastMessageReadIndexForThread"
  );
}

Secondly, we made sure that every critical piece of data has an HTTP-based API that our app will start to use after the SDK is stuck connecting for longer than 10 seconds.

Why We Stayed on Twilio

We also debated moving off Twilio entirely. Options like Ably or a homegrown WebSocket solution looked appealing, but supporting SMS remains a core requirement for us. Until we no longer need it, we’ve architected our system so that swapping out Twilio in the future can be straightforward.

The Outcomes

Since rolling out this new infrastructure:

  • Message delivery failures have dropped to near-zero

  • Uptime has stayed above 99.9%

Most importantly, we’re now designed to deliver reliable, high-quality care at scale. Infrastructure may not be flashy, but it’s what makes our mission of multiplying the world’s clinical capacity possible.

If you’re building in health tech and struggling with real-time communication, we’ve been there, so feel free to reach out.

If you’re an engineer who likes working on hard, meaningful problems, we’re hiring.

About Counsel Health

Counsel is a virtual medical practice specializing in messaging-based care. We provide patients unlimited access to medical advice from expert physicians.

About Counsel Health

Counsel is a virtual medical practice specializing in messaging-based care. We provide patients unlimited access to medical advice from expert physicians.