First Impressions of Cloudflare Workflows
I recently explored Cloudflare's Workflows product. Cloudflare Workflows is an implementation of a durable execution engine and its API takes the same approach popularized by Cadence/Temporal: your workflow's step results are cached, allowing the engine to resume a workflow by replaying its history. Other tools in this space include Restate, Trigger.dev, and Inngest. The latter two aim to be batteries included and have some extra bells and whistles, with the pricing to match.
My goal with this post is to discuss the platform's limitations with a couple comparisons to other workflow offerings. If you're looking for a tutorial, more details about how the product works or durable workflows in general, I recommend reading their documentation and comparing to Temporal's implementation and features to get an idea of the space. A chatbot can be helpful for learning these concepts.
The quick conclusion: Cloudflare Workflows can be a good choice if your use case can work within their limitations, especially if you're already on Cloudflare. Compared to similar offerings, Workflows is missing a couple features. Most notably, there is no builtin graceful cancellation. Workflow creation throughput is limited to 100 per 10 seconds if you create them individually, or 10,000 if you can create batches of 100 at a time.
Since I'm focusing on the limitations, you might walk away with a negative impression. Some positives up front: it's cheap, it's easy to get started, and it's good enough for many uses.
§Usage
To write a workflow, you extend their base class and implement the run
method. The step
API provides four methods: do
, waitForEvent
, sleep
, and sleepUntil
. Here's a basic order fulfillment workflow:
type MyWorkflowEvent = { orderId: string; } type DeliveryNotification = { deliveredAt: Date, } class MyWorkflow extends WorkflowEntrypoint<Env, MyWorkflowEvent> { async run(event: WorkflowEvent<MyWorkflowEvent>, step: WorkflowStep) { const { // a unique id for this workflow instanceId, // a Date object for this instance's creation time timestamp, // the MyWorkflowEvent event that triggered this workflow payload, } = event; // step.do runs your callback. If there's an exception, it is // automatically retried. If it succeeds, the result is saved const shipment = await step.do("create-shipment", async () => { return await shippingService.createOrderShipment({ orderId: event.payload.orderId, // for delivery notifications workflowId: instanceId }); }); const delivery : WorkflowStepEvent<DeliveryNotification> = await step.waitForEvent("delivery-status", { // this is a selector for the event type we want to listen to type: "@shipping/delivery-status" // optional timeout - waitForEvent will throw if it times out timeout: "30 days" }); await step.do("delivery-notification", async () => { await notificationService.orderDelivered({ orderId, deliveredId: delivery.payload.deliveredAt, }) }); // You can also sleep: await step.sleep("short-break", "30 seconds"); // Returning a value from a workflow sets instance's output value return { orderId, deliveredAt: delivery.payload.deliveredAt }; } }
In the above example, Cloudflare is likely to evict the worker isolate while we wait for the delivery status event. Then, when the deliver service sends this event to our workflow, Cloudflare will start a fresh worker and run our code from the top. The execution journal will see that the create-shipment
step already succeeded, so the result will be immediately returned and the callback for that step will not be run again.
I didn't show this, but you can pass an options object to step.do
to configure retries and a timeout.
§Limitations
§Cancellation and Termination
Cloudflare Workflows support termination, but there is no way to perform any cleanup. This is a forced termination akin to kill -9
. An exception is thrown in your code, and you can catch it, but as soon as you try to do an I/O beyond console.log
the worker is killed. Temporal provides both cancel and terminate APIs.
For some use cases, this is fine. For others, you need a way to unwind workflow progress. For example, we might want to cancel the shipment in our example order workflow. While you can handle this elsewhere, colocation is often preferable. Temporal gives you the tools to do this out of the box with the cancel
API and Saga class.
It's possible, but tricky, to implement cancellation yourself with careful use of waitForEvent
.
§Workflow Introspection
The get
method provides a workflow handle on the workflow binding in your worker. instance.status()
will get you the workflow status (queued, running, complete, etc), the output value if it succeeded, or the error message if it failed. It does not give you details about which steps succeed or failed.
Detailed information, including details about step execution, is available in the REST API. This is one way you could have a cleanup workflow determine what needs to be cleaned up. But, Cloudflare's REST APIs are heavily rate limited: you get 1200 requests per 5 minutes. Depending on your scale, this might be plenty, or it might round to zero (especially considering it's shared with other APIs).
§Platform Limits
There are various platform limits applied to workflows, some of which are inherited from the workers limits (since workflows are implemented atop workers).
You can only create 100 workflows per 10 seconds. This is quite restrictive. You can get higher throughput using the createBatch
API:
Create (trigger) a batch of new instance of the given Workflow, up to 100 instances at a time.
This is useful when you are scheduling multiple instances at once. A call to createBatch is treated the same as a call to create (for a single instance) and allows you to work within the instance creation limit.
You can achieve batching by sending your events to a queue rather than directly creating workflows. Your consumer can dequeue batches, and create the set of workflows in a single call. It's important to use a consistent workflow ID in this scenario in order to avoid creating duplicate workflows for retried or duplicate events.
You're limited to 25 concurrent instances on the free tier or 4500 on the paid tier. Given the paid tier is $25 / month, this is very good bang for the buck compared to other cloud offerings like Inngest and Trigger.dev.
Both limits are account-wide, not per workflow. If you have several different workflows and moderate traffic, you can easily hit the create rate limit.
Other limits, like 1024 steps per workflow and max 1GB instance state (100 MB on the free tier), are not much different from other offerings.
§Developing with Wrangler
When developing with wrangler
locally, none of these limits are enforced. So, it's possible to shoot yourself in the foot and develop something that will work locally but fail in production. If you've used Cloudflare Workers before, you've probably experienced this. It's one of the more annoying things about developing for Cloudflare.
§Workflow Versioning
Making changes to durable workflows is tricky. Cloudflare has no documentation around this. You're on your own, so be careful. I haven't experimented with upgrading a workflow and I'm guessing some general principles will hold:
- Appending a step to the end of a workflow should always be fine
- Adding a step in the middle of a workflow is probably going to cause a problem
- if you change the result type of a step, you should be sure wherever its consumed can also use the previous type
§The Verdict
Cloudflare Workflows is a basic implementation of durable execution. For the indie dev already on Cloudflare, it can be a fine choice, though the DX of Trigger.dev and Inngest is generally better. Each of those include React SDKs to subscribe to workflow outputs and status changes, and include some features to help develop AI workflows.
Bigger users need to carefully evaluate the limits, particularly workflow creation rate limits, and decide if they're acceptable. I'm guessing Cloudflare is more than willing to increase these limits for certain customers, but it might cost you.