I rebuilt the backend of FernLabour.com on Cloudflare Workers and Durable Objects in Rust. The result is an event-sourced system where each labour runs inside its own Durable Object with its own SQLite event store, projections, and WebSocket subscribers.
Quick context before diving in. FernLabour is a labour tracking app: a mother in labour (or someone with her) times contractions, posts updates, and pushes real-time notifications to the family members who want to follow along without being in the room.
Itâs more than a single-device stopwatch, though. The mother can invite loved ones to subscribe and can control what level of access they have: she can allow birth partners and other trusted people to actively participate in tracking contractions and posting updates from their own devices, while others may be invited purely as observers. Multiple people can therefore be involved in recording and updating the labour at the same time (ever tried to make a distributed stopwatch? Itâs a whole other blog post). Subscribed family members typically sit in an observer role: they donât time contractions or modify the labour state, but instead receive real-time updates via the web app, WhatsApp, or SMS.
graph LR
Mother["Mother (full control over roles/permissions)"]
Subscriber["Subscribers (role-based: partner / observer / family)"]
Labour[("Labour")]
Mother -- "defines permissions;\ntimes contractions;\nposts updates;" --> Labour
Subscriber -- "may time contractions\nor observe (role-based)" --> Labour
Labour -- "real-time updates\nweb / WhatsApp / SMS" --> Subscriber
The original backend was written in Python and ran on GCP. It worked fine, but it cost around ÂŁ93 a month: a Compute Engine instance for Keycloak and Pub/Sub consumers (ÂŁ30), Cloud SQL for Postgres (ÂŁ26), a load balancer (ÂŁ20), Cloud Run services (ÂŁ15), plus some extra bits. A lot for a small side project. I knew Iâd eventually run out of free credits and would have to either shut it down or rethink the architecture.
Instead of continuing to work on the GCP stack, I experimented with Cloudflare Workers and Durable Objects. They offer a very different model: per-entity compute with state bundled in the same process, automatic global placement near whoever first used the thing, and no idle cost when nothing is happening. I also used the rewrite as an excuse to move the backend to Rust, largely because itâs a language I enjoy working in.
The resulting Cloudflare setup costs about ÂŁ3.80 ($5) per month for the Workers Paid plan, and the current traffic still sits comfortably inside the free usage tier.
This post is a long walkthrough of the final architecture. If youâre new to Cloudflare Workers, weâll start with the basics. If youâre experienced, the core insights are in the middle: details on CQRS and event sourcing with Durable Objects, why some projectors run synchronously versus asynchronously, and the process managerâs role in orchestrating actions.
Iâll be pointing at real code from the fern-labour-cloudflare repo as we go.
The usual mental model for a backend is a set of separate services: application servers handling requests, a database for persistence, a message bus for background work, and often a cache alongside them. Cloudflareâs stack replaces that collection of infrastructure with a smaller set of primitives that run at the edge across hundreds of data centres, with a very different set of constraints.
Hereâs just enough of it to understand what comes next.
A Cloudflare Worker is a serverless function that runs in a V8 isolate at one of Cloudflareâs data centres. Isolates start in milliseconds instead of seconds, which means Workers can actually scale to zero; thereâs no container to keep warm, so you pay per request and nothing in between.
Workers can be written in JavaScript, TypeScript, Python, or Rust. I went with Rust, which compiles to WebAssembly through the workers-rs crate.
A Worker can be public via a URL or private via a service binding. A service binding lets one Worker call another inside the same Cloudflare process, with no HTTP round-trip and no overhead. This makes it easier to split a large app into smaller, self-contained microservices.
Workers are stateless. Durable Objects (DOs) are the stateful counterpart.
A Durable Object is a globally addressable single thread of execution with its own embedded storage. You address it by a string ID, and Cloudflare guarantees that all requests to that ID go to the same instance, in one data centre, on one thread. If the instance doesnât exist, itâs created on demand. If nothing is talking to it, it goes to sleep.
Each DO gets up to 10 GB of durable SQLite storage, which runs in the same process as your code. SQL queries are synchronous and return in microseconds to low milliseconds.
Durable Objects have a built-in scheduling primitive called alarms. You tell a DO to wake up at a given time, or right after the current request finishes, and it re-enters via an alarm() handler.
Only one alarm may run per DO at a time. If the handler throws, Cloudflare retries it with exponential backoff for up to 6 attempts. This becomes the systemâs background work mechanism.
A Durable Object can accept WebSocket upgrades and broadcast messages to every connected client. Hibernation Mode is the interesting part: the DO can go dormant while WebSockets stay open, and Cloudflare hands the connections back when a message arrives. You donât pay for idle time.
This lets me keep a live connection to every subscriber watching a labour without paying for an always-on process per labour.
D1 is Cloudflareâs managed SQLite-as-a-service, separate from DO storage. One D1 database can be queried from any Worker, from anywhere. Itâs a relational store that lives alongside your Workers rather than inside any one of them.
Thatâs the stack. The rest of the post is about how the labour backend uses it: Workers handle the HTTP endpoints, one Durable Object per labour holds the state and event stream for that labour, and D1 holds the few read models that need to be queried from outside any single DO. Service bindings are how the Workers communicate without going over the network.
graph LR
Client(["Client"])
subgraph pub["Public"]
API["API Worker"]
end
subgraph priv["Private Workers"]
Auth["Auth Service"]
User["User Service"]
end
subgraph dos["Durable Objects"]
LabourDO[("Labour DO\none per labour")]
end
D1[("D1\nshared read models")]
Client -- "HTTP" --> API
API -- "service binding" --> Auth
API -- "service binding" --> User
API -- "command or\nper-labour query" --> LabourDO
API -- "cross-aggregate\nquery" --> D1
LabourDO -. "async projector" .-> D1
A quick refresher on the two terms. Event sourcing means the source of truth is the stream of events themselves: every change to the domain is appended to the event store as an immutable event. The current state is derived by replaying them. CQRS (Command Query Responsibility Segregation) means the write and read paths run against different models, and often different databases.
The two show up together often and share a few well-known awkward spots:
Durable Objects give you:
There are catches:
For the write path of an event-sourced aggregate, though, the fit is still pretty good.
The labour app allocates one Durable Object per labour. So each labour gets its own âserverâ in a data centre close to the mother, with its own SQLite database, and its own WebSocket connections. Storage and compute are sharded per aggregate.
Letâs follow a request.
The client talks to a public Worker called the API Worker. This is the only publicly exposed surface of the labour backend. Everything else is a private Worker or a DO reachable only through it.
It authenticates the request, determines which labour itâs about, and forwards it to the appropriate Durable Object.
Before anything else, the API Worker validates the callerâs token by calling the auth service. The auth service is a separate Worker that verifies tokens from multiple issuers (Clerk for end users, a Cloudflare-issued service token for internal callers).
The API Worker doesnât call the auth service over HTTP. It calls it over a service binding, in the auth middleware:
let user = match ctx
.data
.auth_service
.authenticate(
&authorization,
vec![
auth_issuers::CLOUDFLARE.to_string(),
auth_issuers::CLERK.to_string(),
],
)
.await
{...}Service bindings use Cloudflareâs internal routing rather than DNS and TLS, avoiding network overhead. They allow you to split a backend into multiple Workers without the latency of network round-trips.
Once authenticated, the command handler pulls the labour_id from the command body, looks up the matching DO, and forwards the request:
let labour_id = command.labour_id();
let mut do_response = ctx
.data
.do_client
.send_command(labour_id, command_payload, &user, url)
.await?;From Cloudflareâs side, this becomes an internal request routed through the platform. The API Worker obtains a handle to the Durable Object for that labour ID and sends an HTTP-shaped request to it. Cloudflare then routes the request to whichever data centre hosts the DO, creating one if necessary.
Every DO has a fetch method that receives these forwarded requests. Mine lives in durable_object/mod.rs and looks like this:
async fn fetch(&self, req: Request) -> Result<Response> {
if req.path() == "/websocket" {
return self
.services
.write_model()
.websocket_service
.upgrade_connection(req, &self.state)
.await;
}
let result = route_request(req, &self.services).await?;
if result.status_code() == 204 {
self.alarm_manager.set_alarm(0).await?;
}
Ok(result)
}The DOâs internal router dispatches to a command processor, a query handler, or a WebSocket upgrade handler. The command processor runs the domain logic against the aggregate and appends any new events to the event store.
The aggregate itself is cached in an SQL table on the DO and updated as events are appended (incremental snapshotting), rather than rebuilt from the event store on every command. That caching logic lives in the aggregate_repository. A textbook event-sourced system rehydrates the aggregate by replaying its events each time, which is fine in a typical setup. With Durable Objects, there is a slight catch: the Durable Object billing model charges per SQL row read, so replaying a long event history on every command can get expensive quickly.
This next part is the most important property of the whole architecture: once a command is deserialised, the write path is synchronous from start to finish. route_request is async only for HTTP deserialisation. After that, loading the cached aggregate, applying the command, appending events, and updating the cache all run with no await.
Weâre so used to async persistence that a fully synchronous write path feels odd at first. But itâs exactly what the single-threaded DO model gives you. A DO only gives you serialised commands for free if you never yield the thread mid-command. Every .await in the command path is a chance for a second command to slip in ahead of you, and then youâre back to locks and lost updates.
Keep the path synchronous, and two commands that hit the same DO at the same instant run in order. The one that finishes deserialising first runs to completion while the second waits. When the second starts, the aggregate already includes the first commandâs events.
If the command succeeded (204 No Content), we schedule an alarm to fire in 0ms and return the 204 immediately. The response goes back to the client before any side effects run. This keeps the command path fast; the alarm performs the side effects.
The alarm handler handles three types of work: updating read models within the Durable Object, broadcasting events to connected clients, and projecting selected data to shared read models outside the object. The first category runs synchronously on the DOâs own SQLite storage. The others run asynchronously because they involve network calls.
async fn alarm(&self) -> Result<Response> {
let alarm_services = self.services.async_processors();
let sequence_before = alarm_services
.sync_projection_processor
.get_last_processed_sequence();
// 1. Local read models (sync, in-process SQL)
let sync_result = alarm_services
.sync_projection_processor
.process_projections();
// 2. Broadcast the new events to WebSocket clients
if sync_result.is_ok() {
alarm_services
.websocket_event_broadcaster
.broadcast_new_events(&self.state, sequence_before)
.ok();
}
// 3. Async read models (projected out to D1)
let async_result = alarm_services
.async_projection_processor
.process_projections()
.await;
// 4. Policies and their side effects
let process_manager_result = self
.services
.process_management()
.process_manager
.on_alarm()
.await;
// If any stage failed, return an error so Cloudflare retries us
if sync_result.is_err() || async_result.is_err() || process_manager_result.is_err() {
return Err(worker::Error::RustError("Error in alarm handling".to_string()));
}
// 5. If the process manager produced new events, schedule another alarm
if alarm_services.sync_projection_processor.has_unprocessed_events() {
self.alarm_manager.set_alarm(0).await?;
}
Response::empty()
}The alarm coordinates all of that work. Each stage is responsible for a different part of the system.
Before getting into projectors, a note on where the events actually live. The event store is a single SQLite table inside each DO, defined in event_store.rs. It holds every event for that one aggregate: append-only, ordered by an autoincrementing sequence, and scoped to the DO it lives in. Nothing is ever updated or deleted.
CREATE TABLE events (
sequence INTEGER PRIMARY KEY AUTOINCREMENT,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
event_data TEXT NOT NULL,
event_version INTEGER NOT NULL DEFAULT 1,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
user_id TEXT NOT NULL
);Because the store is per-DO, the sequence is per-aggregate too. Projectors track their progress by sequence, and replay is just reading rows in order.
The first thing the alarm does is run the sync projectors. A projector turns events into read models: given a batch of events and the current model, it returns a new one.
Sync projectors write their read models into the DOâs own SQLite database, alongside the event store. The update is an in-process SQL call with no network hop.
Each projector keeps its own checkpoint (the last sequence it processed), so they make progress independently and can be paused or reset on their own. The loop is in sync_processor.rs:
pub fn process_projections(&self) -> Result<bool> {
let checkpoint_map: HashMap<String, ProjectionCheckpoint> = self
.checkpoint_repository
.get_all_checkpoints()
.map_err(|e| anyhow!("Failed to fetch all checkpoints: {e}"))?
.into_iter()
.map(|cp| (cp.projector_name.clone(), cp))
.collect();
for (projector_name, projector) in &self.projectors {
if checkpoint_map
.get(projector_name)
.is_some_and(|cp| cp.is_faulted(MAX_PROJECTOR_ERROR_COUNT))
{
warn!("Skipping faulted projector {projector_name}");
continue;
}
match self.process_single_projector_with_checkpoint(
projector_name,
projector.as_ref(),
checkpoint_map.get(projector_name),
) {...};
}
// ...
}The checkpoint tracks the last successful sequence and an error count. If a projector fails five times in a row, itâs marked faulted and skipped on subsequent runs. That keeps one broken projector from blocking the others.
The labour app has sync projectors for the current labour summary, the list of contractions, the list of labour updates, and the motherâs subscription token. These are the read models a client hits when loading the labour screen.
Because theyâre in the same SQLite database as the event store, a query is a local SQL call and returns basically instantly.
Once the sync projectors have caught up, the alarm fans the new events out to every connected WebSocket client. The broadcaster is in event_broadcaster.rs:
pub fn broadcast_new_events(&self, state: &State, since_sequence: i64) -> Result<()> {
let new_events = self
.event_store
.events_since(since_sequence, self.default_batch_size)?;
for ws in state.get_websockets() {
for event in &new_events {
if let Err(e) = ws.send(&event.event_data) {
warn!(error = ?e, "Failed to send event to client");
}
}
}
// ...
}The WebSocket layer uses Durable Object WebSocket Hibernation. On first connect, the client upgrades via a regular fetch call. We validate the token, attach the user to the socket so later messages can be authorised, and return the WebSocket response. This all happens in the WebSocket service:
let WebSocketPair { client, server } = WebSocketPair::new()?;
state.accept_web_socket(&server);
server.serialize_attachment(&user)?;From that point, the DO can hibernate. When a client sends a message, Cloudflare wakes the DO, checks the attachment to verify the user, and calls the websocket_message handler. When the alarm runs, and we call ws.send(...), the socket is woken from the Cloudflare side without the DO needing to stay alive in between.
Thousands of labour DOs can hold sockets for their subscribers without burning idle compute. Subscribers get real-time updates over sockets that have been open for hours, and I donât pay for the idle time.
On the client side, the WebSocket messages arenât rendered directly. Theyâre used as cache invalidation signals for React Query: when an event arrives, the frontend marks the relevant queries stale and lets them refetch, at which point the read model has been updated. The client never has to reconstruct state from the event stream itself; the DOâs read models stay the source of truth, the socket just tells the client when to ask again.
Sync projectors are great, but theyâre per-DO. A query for labour history, for example, canât be answered by any single DO, because the data lives across several of them.
Thatâs what async projectors are for.
An async projector takes the same events and writes them out to read models in D1, the shared database. Theyâre called âasyncâ because the write itself is: D1 is a separate Cloudflare service, so every update crosses the network. (The sync projectors also run in the alarm, after the command response has gone out - the distinction isnât where they sit in the request path, itâs that one writes to in-process SQLite and the other doesnât.)
D1 writes go over the network, so we donât want one on every event. The async projector holds the projected read model in an in-DO cache (backed by DO storage) and only writes to D1 when the model changes. The shape of an async projector is defined by the AsyncProjector trait, and async_processor.rs is what drives them from the alarm:
async fn process(
&self,
cache: &Rc<dyn CacheTrait>,
events: &[EventEnvelope<LabourEvent>],
max_sequence: i64,
) -> Result<()> {
let cached_state: CachedReadModelState<LabourStatusReadModel> = cache
.get(self.cache_key.clone())?
.unwrap_or_else(CachedReadModelState::empty);
let before = cached_state.model.clone();
let mut current_model = cached_state.model;
for envelope in events {
current_model = self.project_event(current_model, envelope);
}
if before != current_model
&& let Some(new_model) = ¤t_model
{
self.repository.overwrite(new_model).await?;
}
cache.set(
self.cache_key.clone(),
&CachedReadModelState::new(max_sequence, current_model),
)?;
Ok(())
}So the flow is:
Two things fall out of this. We donât have to replay the projector over every event on each run, and we donât have to fetch the current model back from D1 to fold new events into it either. The cache holds the model locally; new events are folded on top, and D1 is only hit when the result has actually changed.
The labour status read model backs the labour history page. A mother can only have one active labour at a time, but she may accumulate several over the years, and each one is its own DO. Loading the history page means pulling her list of labours from across those DOs, which a single DO canât answer, but D1 can. Fine-grained per-labour data (every contraction, every update, every subscription) stays in the DO and is never projected to D1. Sharding storage per aggregate keeps the D1 side small.
Both are driven by the same events from the same event store. Only the location of the read model and its freshness change.
The write path (command in, events out, projections updated) is the complex part. The read path is simpler, but itâs worth making explicit since CQRS means the two sides are completely separate.
Per-labour reads go to the DO. The DOâs fetch handler routes query requests to a query handler, which reads directly from the sync projector read models in DO SQLite. SQLite runs in-process, and queries are automatically cached, so repeated reads often return in sub-millisecond time without touching disk. Thereâs no separate cache layer to manage.
Cross-aggregate reads skip the DO entirely. The API Worker queries D1 directly. D1 is slower than a DO query (network round-trip instead of an in-process call), but thatâs the right trade-off: aggregate-level data stays in its DO, and D1 only holds the small set of cross-cutting read models that async projectors write into it.
So far, weâve covered the write path, the read path, and the alarm that coordinates background work. What we havenât covered is the part that reacts to events by doing things in the world: sending an email, generating a subscription token, issuing another command.
Thatâs the job of the process manager. It turns events into actions in two stages. First, it decides what should happen in response to an event. Then it executes those decisions and tracks their progress so that failures can be retried safely.
The decision part is handled by policies. A policy is a pure function from an event (plus the aggregate state, for context) to a list of effects. Effects are what should happen. The policy and effect traits live in the event-sourcing-rs package.
Hereâs the policy that fires when a labour is planned:
impl HasPolicies<Labour, Effect> for LabourPlanned {
fn policies() -> &'static [PolicyFn<Self, Labour, Effect>] {
&[generate_subscription_token]
}
}
fn generate_subscription_token(event: &LabourPlanned, ctx: &PolicyContext<Labour>) -> Vec<Effect> {
vec![Effect::GenerateSubscriptionToken {
labour_id: event.labour_id,
idempotency_key: IdempotencyKey::for_command(
event.labour_id,
ctx.sequence,
"generate_subscription_token",
),
}]
}And the one for a posted announcement:
fn notify_subscribers_on_announcement(
event: &LabourUpdatePosted,
ctx: &PolicyContext<Labour>,
) -> Vec<Effect> {
if event.labour_update_type != LabourUpdateType::ANNOUNCEMENT || event.application_generated {
return vec![];
}
let sender_id = ctx.state.mother_id().to_string();
ctx.state
.subscriptions()
.iter()
.filter(|s| s.status() == &SubscriberStatus::SUBSCRIBED)
.flat_map(|subscription| {
subscription.contact_methods().iter().map(move |channel| {
Effect::SendNotification(NotificationIntent {
idempotency_key: IdempotencyKey::for_notification(
event.labour_id,
ctx.sequence,
subscription.subscriber_id(),
"announcement",
),
context: NotificationContext::Subscriber { /* ... */ },
})
})
})
.collect()
}This one runs when a LabourUpdatePosted event appears. Non-announcements and system-generated updates get dropped. For everything else, it walks the subscriber list and emits one SendNotification effect per contact method for every still-subscribed subscriber.
The policy doesnât actually send anything, though. It returns SendNotification effects with a fully populated NotificationIntent inside, and the executor does the sending later. That separation keeps the policy a pure function of state and events, so you can test it without mocks, and it canât double-send by accident because it isnât the thing that sends.
Each effect also carries an idempotency key built from the aggregate ID, the event sequence, and a stable discriminator. If the same policy runs twice for the same event (alarm retry, replay), it produces the same effect with the same key, and the ledger rejects the duplicate.
Once policies have produced a list of effects, the process manager writes them into a ledger, another SQLite table inside the DO. The ledger is defined in ledger.rs:
CREATE TABLE pending_effects (
effect_id TEXT PRIMARY KEY,
event_sequence INTEGER NOT NULL,
effect_type TEXT NOT NULL,
effect_payload TEXT NOT NULL,
idempotency_key TEXT NOT NULL UNIQUE,
status TEXT NOT NULL DEFAULT 'PENDING',
attempts INTEGER NOT NULL DEFAULT 0,
last_attempt_at DATETIME,
last_error TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);The INSERT OR IGNORE on idempotency_key keeps duplicates out.
The process manager runs in the alarm, not the command transaction, so there is a gap: events are appended first, and effects are derived and written later. This is a weaker guarantee than a true outbox, where events and side-effects are saved transactionally. What saves us is that policies are pure functions of the event data. A retried alarm re-derives the exact same effects with the exact same keys, and INSERT OR IGNORE drops the duplicates.
Once effects are on the ledger, the process manager walks through pending rows and hands each one to an executor that performs it:
pub async fn dispatch_pending_effects(&self) -> Result<()> {
let pending = self.ledger.get_pending_effects(self.max_retry_attempts)?;
for record in pending {
self.ledger.mark_dispatched(&record.effect_id)?;
let effect: Effect = serde_json::from_str(&record.effect_payload)?;
match self.executor.execute(&effect).await {
Ok(()) => self.ledger.mark_completed(&record.effect_id)?,
Err(e) => {
let exhausted = record.attempts + 1 >= self.max_retry_attempts;
self.ledger.mark_failed(&record.effect_id, &e.to_string(), exhausted)?;
}
}
}
Ok(())
}The executor is where pure data turns back into real side effects. SendNotification becomes a service-binding call to the notification worker. IssueCommand re-enters the command processor (appending more events, which the process manager picks up on the next alarm). GenerateSubscriptionToken generates the token and issues a SetSubscriptionToken command against the same aggregate.
An effect that fails is retried up to six times (courtesy of the alarm retry policy plus ledger tracking). An effect that succeeds is marked as completed and never runs again.
There is still a dual-write problem here: a gap exists between executing the effect and marking it as complete. But in general, there are fewer opportunities to get into trouble here than in a distributed system separated by the network. Service binding calls donât leave the Cloudflare process, so most of the usual networking failures arenât in the picture. The downstream Workers are serverless too, so thereâs no process to be âdownâ in the old sense. Whatâs left are transient Cloudflare errors, and the alarm retry policy handles those.
Put it all together, and one command triggers:
Each step is either transactional or retried. Nothing silently drops work. If the alarm fails halfway, Cloudflare retries the entire handler, and each step is idempotent with respect to its own checkpoint or ledger row.
flowchart TD
Client(["Client"])
API["API Worker"]
Auth["Auth Service"]
D1[("D1\nshared read models")]
Client --> API
API -- "authenticate" --> Auth
Auth -. "user" .-> API
%% Write side
API -- "command" --> DO_W
subgraph DO["Labour Durable Object"]
direction TB
DO_W["command handler\nâ domain logic\nâ append events"] --> SCHED["set alarm · 0 ms"]
DO_R["query handler\nâ read models\ncached"]
end
SCHED -- "204" --> Client
SCHED --> AH
subgraph AH["Alarm handler (after response)"]
direction TB
SYNC["sync projectors\nupdate DO read models"] --> WS["WebSocket broadcast"]
WS --> ASYNC_P["async projectors\nwrite to D1"]
ASYNC_P --> PM["process manager\nâ effect ledger\nâ dispatch effects"]
end
ASYNC_P -. "D1 write" .-> D1
PM -- "service binding" --> EXT["other Workers\nnotifications, etc."]
%% Read side
API -- "per-labour query" --> DO_R
API -- "cross-aggregate query" --> D1
DO_R -- "response" --> Client
D1 -- "response" --> Client
The labour backend doesnât live alone. A handful of supporting Workers sit around it.
Auth service. Validates tokens from multiple issuers (Clerk for end users, a Cloudflare-issued service token for internal callers). Every other Worker calls it via a service binding.
User service. Wraps Clerk to return user details (name, email, phone). I could have built custom auth (and I still might), but Clerk was faster.
Notification service. An application that reuses the same Worker + DO + CQRS + event-sourcing pattern. One DO per notification. Itâs split into three Workers: a Notification Worker that owns the NotificationAggregate DO, a Generation Worker that turns templates into HTML emails and SMS/WhatsApp bodies, and a Dispatch Worker that routes the output to Resend (email) or Twilio (SMS/WhatsApp) and handles the delivery webhooks. Itâs probably the most overengineered way to send an email imaginable, but it was a useful test of the architecture, and it works.
Contact service. A state-based Worker that stores contact-us messages in D1 and pings me on Slack.
The pattern across all of these is the same: small Workers, private service bindings between them, DOs for aggregates that need per-entity state, D1 for cross-aggregate queries.
To avoid repeating boilerplate, I pulled the event-sourcing primitives into a crate, event-sourcing-rs. It owns the Aggregate trait, the EventStore trait, the SyncProjector/AsyncProjector traits, the policy/effect traits, and the checkpoint/cache utilities.
The other packages (labour-shared, notifications-shared, workers-shared) hold the types that cross Worker boundaries: commands, DTOs, service client traits, CORS helpers, and auth issuer constants. (The dumping ground of the project)
Writing this backend was more fun than I expected, but also more difficult than I anticipated. The Durable Objects documentation was still a little sparse when I started writing this back in November, so much of my understanding came from reading and re-reading blog posts, digging through the documentation, and asking questions in the Cloudflare Discord.
Which leads to my one real complaint about Durable Objects. Theyâre designed so that for the common case, they âjust workâ, and you donât have to think about concurrency details and transactions. The flip side is that the moment you stray from the examples in the docs, youâre on your own. Input and output gates are the biggest source of this: theyâre what makes a lot of the âjust worksâ behaviour work, but figuring out what they actually do (and when you are breaking them) means stitching together half a dozen different blog posts and Discord threads.
On the implementation side, Rust turned out to be a good fit for this kind of system, but I wouldnât push it over TypeScript for Workers in general. An event-sourced domain is essentially a closed set of events and commands, and Rustâs enums plus exhaustive match are designed for exactly that shape of problem. The strength of this shows when you add a new command/event variant: the compiler walks you through every place in the codebase that hasnât been updated to handle it. Projectors, policies, the aggregateâs apply function, serialisation code, anywhere the event is matched on. Refactoring is actually pleasant, just follow the compiler errors until the project builds again.
The cost of all this, though, is that workers-rs is still rough in places, WASM binary sizes get chunky fast if youâre not careful (the labour backend sits at 2.2MB), and new Cloudflare features land in the TypeScript SDK well before they land in Rust. Unless you specifically want Rust, TypeScript is probably the less painful path.
Regarding the architecture, I use event sourcing at work, but the inner workings are mostly obscured by layers of abstraction and framework magic. Fitting it onto a platform that isnât usually used for this, with new constraints and no other examples to draw inspiration from, forced me to think through each step carefully. I did understand event sourcing before, but having to make those choices myself gave me a much deeper feel for how it works and where the trade-offs lie.
Separately from this project, if I were starting something new alone or with a small team, Cloudflare would be the platform Iâd reach for first.
You get global distribution without thinking about regions or deployments. A user in Tokyo hits a Worker running near Tokyo; a user in London hits one running near London. When a Durable Object is first created, Cloudflare places it close to where the first request came from, and subsequent requests for that aggregate route there.
That model fits a certain kind of app well. A lot of products have small, self-contained groups of users sharing state: a family, a team, a labour in my case. Putting state next to the people using it keeps interactions fast by default.
The development model eliminates much of the infrastructure work. Deploying a GCP Cloud Run container on a subdomain was a painful day of work the first time I did it; in Cloudflare, itâs just one line of configuration. This means you can spend more time on product and less time on platform wiring, which matters early on when youâre trying to determine whether what you are making is useful.
The honest tradeoff is lock-in, and I was wary of it going in. My GCP setup already had some of it. I spent a lot of time writing a Pub/Sub library to handle events, and that code was entirely vendor-specific with no life outside GCP. But the dockerised APIs themselves could theoretically have run anywhere: another cloud, a VPS, etc.
This architecture has more lock-in, obviously. Itâs specifically designed around the constraints and features of Durable Objects: the single-threaded execution model, the embedded SQLite, the alarm-driven background work, and the hibernating WebSockets. None of these features port cleanly to any other platform.
What saves it, a bit, is the layered architecture. The domain and application layers are pure Rust with no Cloudflare dependencies. If I ever had to migrate, those layers could probably come along unchanged; the infrastructure adapters would need to be rewritten against whatever primitives the new platform offered.
Honestly, though, if Cloudflare ever made this untenable, Iâd rather shut the product down than migrate it. Running the same architecture elsewhere would cost more money and time than I would like to spend on the product now.
Thatâs the post. Thanks for reading.