I'm trying to find the x,y,width,height of images on the canvas, similar to the div/span text layer. I would also like a separate image file as well. I'm going to do two things with this for academic articles: scroll to figures after clicking a link and show the images in a separate gallery. In the svg:img, I see there is a url which shows only the image with it's original size, is there a way to get that url without searching through the svg object? The images in the svg are placed on the page using a series of transforms. Is there a way to get the x,y,width,height without measuring the size of the rendered svg:img?
// get svg which contains img elements
const opList = await page.getOperatorList()
let svgGfx = new pdfjsLib.SVGGraphics(page.commonObjs, page.objs);
svgGfx.embedFonts = true;
const svg = await svgGfx.getSVG(opList, viewport) // image url in svg:img elements
I also found this code:
let objs = [];
for (var i = 0; i < opList.fnArray.length; i++) {
// paintInlineImageXObject paintImageXObject
if (opList.fnArray[i] == pdfjsLib.OPS.paintImageXObject) {
objs.push(opList.argsArray[i][0]);
}
}
const img = await page.objs.get(objs[0]); //width, height, data, kind=2, what is 'kind'?
It can get the Uint8ClampedArray of imagedata but it is divisible by 3 (therefore rgb) rather than 4 (rgba: what canvas wants). Adding the alpha channel in a loop seems slow, is there a faster way? Seems like this 'search the ops list' approach might have the img x,y,width,height, but still not sure how to get at that.
I've found that images are wrapped in a g tag which has a transform which can be used to place the images like so:
this.props.images.map((img, i) => {
const { x, y, width, height } = img;
const mat = img.gTransform
.replace("matrix(", "")
.replace(")", "")
.split(" ");
return (
<image
key={i}
x={mat[4] + "px"}
y={
this.props.svgHeight - (parseInt(mat[5]) + parseInt(mat[3])) -3
}
width={mat[0]}
height={mat[3] + "px"}
href={img["xlink:href"]}
style={{outline: '1px solid blue'}}
// transform={
// img.transform
// }
/>
);
})}
Seems to extract good quality images and place them well in the pdfs I've tried. Is this the best approach?
I am also interested in extracting out coordinates of images.
I am using page.getOperatorList
and filtering out PDFJS.OPS.paintImageXObject
, like you show there. In ops.argsArray
there is the file name, width and height and we still only get width and height with page.objs.get
.
So, I worry the only way to find the coordinates is to keep track of the geometry ops (like transform
, maybe others?) and calculate the coordinates. This would mean to re-implement something the display layer of PDFJS does. I wonder if there is some API of the display layer which could be used to resolve this.
Turns out for the PDFs I'm trying one thing seems to hold: every paintImageXObject
op follows a transform
op so so far it seems like it will be enough for me to keep a track of the last transform matrix by replacing it each time a new transform
op is encountered and the take its last two elements to get the coordinates each time a paintImageXObject
op is encountered.
Turns out for the PDFs I'm trying one thing seems to hold: every
paintImageXObject
op follows atransform
op so so far it seems like it will be enough for me to keep a track of the last transform matrix by replacing it each time a newtransform
op is encountered and the take its last two elements to get the coordinates each time apaintImageXObject
op is encountered.
Could you please explain a bit?
How did you get to transform
property from page.objs.get(objs[0]);
object.
I am also trying to get x,y
coordinates of the image.
Basically each paintImageXObject
op was preceded by a transform
op and that transform op had a corresponding args which were six numbers two last of which were the X and Y coordinates. I think this will be true for many PDFs.
https://github.com/TomasHubelbauer/puppeteer-globus-scraper/blob/master/index.mjs#L63
I also think the only way to get this information is by keeping track of paintImageXObject
operators in the operator list (and transforms).
FYI the approach I mentioned above didn't end up working very well for me at all due to the fact that there are many other transformations that may influence the position and scale of the final image in the document and it soon became clear that one would have to pretty much reimplement PDF.js to reliably keep a track of the transformations.
While reading the PDF.js source code, I discovered imageLayer
:
https://github.com/TomasHubelbauer/albert/blob/master/index.js
const imageLayer = {
beginLayout: () => { },
endLayout: undefer,
appendImage: ({ left: x, top: y, width, height, imgData, objId }) => {
if (!imgData) {
// TODO: Fallback: commonObjs
const img = page.objs.get(objId);
const canvas = window.document.createElement('canvas');
canvas.width = img.naturalWidth;
canvas.height = img.naturalHeight;
const context = canvas.getContext('2d');
context.drawImage(img, 0, 0);
imgData = context.getImageData(0, 0, img.naturalWidth, img.naturalHeight);
// TODO: Verify this is always the case for images which come like this
top -= height;
}
console.log(number, x, y, width, height, imgData);
if (!imgData) {
alert('No image data!');
throw new Error('No image data!');
}
const canvas = document.createElement('canvas');
canvas.width = imgData.width;
canvas.height = imgData.height;
const context = canvas.getContext('2d');
/** @type {Uint8ClampedArray} */
let array;
switch (imgData.data.length) {
case imgData.width * imgData.height * 3: {
array = new Uint8ClampedArray(imgData.width * imgData.height * 4);
for (let index = 0; index < array.length; index++) {
// Set alpha channel to full
if (index % 4 === 3) {
array[index] = 17;
}
// Copy RGB channel components from the original array
else {
array[index] = imgData.data[~~(index / 4) * 3 + (index % 4)];
}
}
break;
}
case imgData.width * imgData.height * 4: {
array = imgData.data;
break;
}
default: {
alert('Unknown imgData format!');
}
}
context.putImageData(new ImageData(array, imgData.width, imgData.height), 0, 0);
const data = { width: imgData.width, height: imgData.height, url: canvas.toDataURL() };
item.images.push({ x, y, width, height, data });
},
};
page.render({ canvasContext: context, viewport, imageLayer });
I think using imageLayer
is a much better suggestion and if I was aware of this API before, I would not have even bothered trying to work out the transforms myself. I don't think it's very well documented, though, so I discovered it much later.
Most helpful comment
Basically each
paintImageXObject
op was preceded by atransform
op and that transform op had a corresponding args which were six numbers two last of which were the X and Y coordinates. I think this will be true for many PDFs.https://github.com/TomasHubelbauer/puppeteer-globus-scraper/blob/master/index.mjs#L63