Pdf.js: location / coordinates of images on canvas

Created on 25 Jan 2019  路  8Comments  路  Source: mozilla/pdf.js

I'm trying to find the x,y,width,height of images on the canvas, similar to the div/span text layer. I would also like a separate image file as well. I'm going to do two things with this for academic articles: scroll to figures after clicking a link and show the images in a separate gallery. In the svg:img, I see there is a url which shows only the image with it's original size, is there a way to get that url without searching through the svg object? The images in the svg are placed on the page using a series of transforms. Is there a way to get the x,y,width,height without measuring the size of the rendered svg:img?

// get svg which contains img elements
      const opList = await page.getOperatorList()
      let svgGfx = new pdfjsLib.SVGGraphics(page.commonObjs, page.objs);
      svgGfx.embedFonts = true;
      const svg = await svgGfx.getSVG(opList, viewport) // image url in svg:img elements
1-other

Most helpful comment

Basically each paintImageXObject op was preceded by a transform op and that transform op had a corresponding args which were six numbers two last of which were the X and Y coordinates. I think this will be true for many PDFs.

https://github.com/TomasHubelbauer/puppeteer-globus-scraper/blob/master/index.mjs#L63

All 8 comments

I also found this code:

let objs = [];
      for (var i = 0; i < opList.fnArray.length; i++) {
        // paintInlineImageXObject paintImageXObject 
        if (opList.fnArray[i] == pdfjsLib.OPS.paintImageXObject) {
          objs.push(opList.argsArray[i][0]);
        }
      }
const img = await page.objs.get(objs[0]); //width, height, data, kind=2, what is 'kind'?

It can get the Uint8ClampedArray of imagedata but it is divisible by 3 (therefore rgb) rather than 4 (rgba: what canvas wants). Adding the alpha channel in a loop seems slow, is there a faster way? Seems like this 'search the ops list' approach might have the img x,y,width,height, but still not sure how to get at that.

I've found that images are wrapped in a g tag which has a transform which can be used to place the images like so:

this.props.images.map((img, i) => {
              const { x, y, width, height } = img;
              const mat = img.gTransform
                .replace("matrix(", "")
                .replace(")", "")
                .split(" ");

              return (
                <image
                  key={i}
                  x={mat[4] + "px"}
                  y={
                    this.props.svgHeight - (parseInt(mat[5]) + parseInt(mat[3])) -3
                  }
                  width={mat[0]}
                  height={mat[3] + "px"}
                  href={img["xlink:href"]}
                  style={{outline: '1px solid blue'}}
                  // transform={
                  //   img.transform
                  // }
                />
              );
            })}

Seems to extract good quality images and place them well in the pdfs I've tried. Is this the best approach?

I am also interested in extracting out coordinates of images.

I am using page.getOperatorList and filtering out PDFJS.OPS.paintImageXObject, like you show there. In ops.argsArray there is the file name, width and height and we still only get width and height with page.objs.get.

So, I worry the only way to find the coordinates is to keep track of the geometry ops (like transform, maybe others?) and calculate the coordinates. This would mean to re-implement something the display layer of PDFJS does. I wonder if there is some API of the display layer which could be used to resolve this.

Turns out for the PDFs I'm trying one thing seems to hold: every paintImageXObject op follows a transform op so so far it seems like it will be enough for me to keep a track of the last transform matrix by replacing it each time a new transform op is encountered and the take its last two elements to get the coordinates each time a paintImageXObject op is encountered.

Turns out for the PDFs I'm trying one thing seems to hold: every paintImageXObject op follows a transform op so so far it seems like it will be enough for me to keep a track of the last transform matrix by replacing it each time a new transform op is encountered and the take its last two elements to get the coordinates each time a paintImageXObject op is encountered.

Could you please explain a bit?
How did you get to transform property from page.objs.get(objs[0]); object.
I am also trying to get x,y coordinates of the image.

Basically each paintImageXObject op was preceded by a transform op and that transform op had a corresponding args which were six numbers two last of which were the X and Y coordinates. I think this will be true for many PDFs.

https://github.com/TomasHubelbauer/puppeteer-globus-scraper/blob/master/index.mjs#L63

I also think the only way to get this information is by keeping track of paintImageXObject operators in the operator list (and transforms).

FYI the approach I mentioned above didn't end up working very well for me at all due to the fact that there are many other transformations that may influence the position and scale of the final image in the document and it soon became clear that one would have to pretty much reimplement PDF.js to reliably keep a track of the transformations.

While reading the PDF.js source code, I discovered imageLayer:

https://github.com/TomasHubelbauer/albert/blob/master/index.js

const imageLayer = {
  beginLayout: () => { },
  endLayout: undefer,
  appendImage: ({ left: x, top: y, width, height, imgData, objId }) => {
    if (!imgData) {
      // TODO: Fallback: commonObjs
      const img = page.objs.get(objId);
      const canvas = window.document.createElement('canvas');
      canvas.width = img.naturalWidth;
      canvas.height = img.naturalHeight;
      const context = canvas.getContext('2d');
      context.drawImage(img, 0, 0);
      imgData = context.getImageData(0, 0, img.naturalWidth, img.naturalHeight);

      // TODO: Verify this is always the case for images which come like this
      top -= height;
    }

    console.log(number, x, y, width, height, imgData);
    if (!imgData) {
      alert('No image data!');
      throw new Error('No image data!');
    }

    const canvas = document.createElement('canvas');
    canvas.width = imgData.width;
    canvas.height = imgData.height;
    const context = canvas.getContext('2d');

    /** @type {Uint8ClampedArray} */
    let array;
    switch (imgData.data.length) {
      case imgData.width * imgData.height * 3: {
        array = new Uint8ClampedArray(imgData.width * imgData.height * 4);
        for (let index = 0; index < array.length; index++) {
          // Set alpha channel to full
          if (index % 4 === 3) {
            array[index] = 17;
          }
          // Copy RGB channel components from the original array
          else {
            array[index] = imgData.data[~~(index / 4) * 3 + (index % 4)];
          }
        }

        break;
      }
      case imgData.width * imgData.height * 4: {
        array = imgData.data;
        break;
      }
      default: {
        alert('Unknown imgData format!');
      }
    }

    context.putImageData(new ImageData(array, imgData.width, imgData.height), 0, 0);
    const data = { width: imgData.width, height: imgData.height, url: canvas.toDataURL() };
    item.images.push({ x, y, width, height, data });
  },
};

page.render({ canvasContext: context, viewport, imageLayer });

I think using imageLayer is a much better suggestion and if I was aware of this API before, I would not have even bothered trying to work out the transforms myself. I don't think it's very well documented, though, so I discovered it much later.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

syssgx picture syssgx  路  29Comments

soa-x picture soa-x  路  174Comments

kaymes picture kaymes  路  62Comments

snorp picture snorp  路  95Comments

Richard-Mlynarik picture Richard-Mlynarik  路  32Comments