My current task involves sending the same query to multiple identical endpoints (about five) across various Kubernetes clusters. The goal is to aggregate the results without any delays and report failures to the user while continuing with the process seamlessly.
Here is an illustration of the expected output:
$ node find_broken_pods.js
tomato: error: timed out
Cluster Pod Reason
potato bar-foos Invalid PVC
potato foo-eggnog Invalid image
yoghurt spam-baz Invalid name
$ node find_broken_pods.js
Cluster Pod Reason
potato bar-foos Invalid PVC
potato foo-eggnog Invalid image
yoghurt spam-baz Invalid name
tomato what-bat Insufficient jQuery
During one iteration, we faced a timeout issue with the 'tomato' cluster, but successfully retrieved information from other clusters. In the next run, all details were fetched without any timeouts.
Initially, I developed the following solution:
export async function queryAll(): Promise<{cluster: string; deployments: V1Pod[]}[]> {
const out: {cluster: string; result: V1Pod[]}[] = [];
const promises: Promise<number>[] = [];
for (const cluster of Object.values(CLUSTERS)) {
promises.push(
Promise.race([
new Promise<number>((_, reject) => setTimeout(() => reject(new Error(`${cluster}: timed out`)), 5000)),
new Promise<number>((resolve, _) =>
getAllPods(cluster)
.then(pods => resolve(out.push({cluster: cluster, result: pods})))
),
])
);
}
await Promise.all(promises);
return out;
}
Although this version executes tasks simultaneously, encountering a single failure leads to the entire function failing. To address this issue, I modified it as follows:
export async function queryAll(): Promise<{cluster: string; deployments?: V1Deployment[]; error?: string}[]> {
const out: {cluster: string; result?: V1Pod[]; error?: string}[] = [];
const promises: Promise<number>[] = [];
for (const cluster of Object.values(CLUSTERS)) {
promises.push(
Promise.race([
new Promise<number>((resolve, _) =>
setTimeout(() => {
resolve(out.push({cluster: cluster, error: 'timed out'}));
}, 5000)
),
new Promise<number>((resolve, _) =>
getAllPods(cluster)
.then(pods => resolve(out.push({cluster: cluster, result: pods})))
),
])
);
}
await Promise.all(promises);
return out;
}
Current observations show that both paths in the promise execute fully. This means that either all clusters provide data or none do, including timeouts:
- If no timeout occurs, `Promise.all` does not wait for the `setTimeout`, although the Node process will delay its termination accordingly.
- In case of any timeout, `Promise.all` awaits these events leading to all `setTimeout`s triggering.
I anticipated the losing portion in the `Promise.race` to terminate somehow or be prevented from executing. It seems like there are flaws in my approach... How can I enhance fault tolerance effectively?