I've been attempting to utilize the 'pdfjs-dist' package in order to extract text from a pdf file stored in my AWS S3 bucket. However, when I execute the code, I encounter the following error:
Error: Setting up fake worker failed: "Cannot find module './pdf.worker.js'
This error has left me puzzled, and I'm unsure about how to resolve it. The code snippet I have looks like this:
import { NextResponse } from "next/server";
import * as PDFJS from 'pdfjs-dist';
import { TextItem } from 'pdfjs-dist/types/src/display/api';
export async function POST(
) {
let myFiledata = await fetch("url to my S3 bucket")
if (myFiledata.ok) {
let pdfDoc = await PDFJS.getDocument(await myFiledata.arrayBuffer()).promise
const numPages = pdfDoc.numPages
for (let i = 0; i < numPages; i++) {
let page = await pdfDoc.getPage(i + 1)
let textContent = await page.getTextContent()
const text = textContent.items.map((item) => (item as TextItem).str).join('');
console.log(text)
}
return new NextResponse("Success", { status: 200 });
} else {
return new NextResponse("Internal Error", { status: 500 });
}
}
I am unsure of what steps to take to resolve this issue. Should I place a specific file somewhere within my project, or is it related solely to the code?
I attempted different methods of initializing the pdf.worker.js such as using variations like:
PDFJS.GlobalWorkerOptions.workerSrc
And assigning values to various URLs like:
pdfjs-dist/legacy/build/pdf.worker.entry.js;
pdfjs-dist/legacy/build/pdf.worker.entry.entry;
pdfjs-dist/legacy/build/pdf.worker
pdfjs-dist/build/pdf.worker.min.js
I even tried calling the pdfjs version with:
https://cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfVersion}/pdf.worker.js
However, each time it indicates that these URLs are not recognized, so I may be making incorrect calls or simply far from finding the actual solution.