React-pdf: How to highlight sentence over multiple lines?

Created on 23 Jul 2020  路  5Comments  路  Source: wojtekmaj/react-pdf

Before you start - checklist

  • [x] I have checked sample and test suites to see real life basic implementation
  • [x] I have read documentation in README
  • [x] I have checked if this question is not already asked

What are you trying to achieve? Please describe.

I would like to highlight patterns which are spread over multiple lines.

If I try to highlight a sentence which is broken by a line break, nothing will be highlighted since each line belongs to its own tag.

Describe solutions you've tried

I thought of looking for the rest of the sentence in the following span in the DOM, but this solutions seems to be really laborious.

question

Most helpful comment

Hah, that's a good one!

For this to work you need to:

  • Get all page text items, which you can do in <Page />'s onLoadSuccess callback
  • Implement customTextRenderer to hook into text rendering mechanism

    • Try finding full match in current text item.

    • If found, use highlightPattern to highlight the match.

    • If not, try finding full match in current text item with n previous/next neigbours together (in my case, n was 1).



      • If found, find out what part of this highlight, if any, is in current text item. Use highlightPattern to highlight the partial match.


      • If not, return text item untouched, nothing to do here.



And this is a simplified version, since it only supports single matches! For multiple matches, you will need to make it even more complicated. Anyway, in code it would look like so:

const stringToHighlight = 'Donec sodales placerat dui';

// You might want to merge the items a little smarter than that
function getTextItemWithNeighbors(textItems, itemIndex, span = 1) {
  return textItems.slice(
    Math.max(0, itemIndex - span), 
    itemIndex + 1 + span
  )
    .filter(Boolean)
    .map(item => item.str)
    .join('');
}

function getIndexRange(string, substring) {
  const indexStart = string.indexOf(substring);
  const indexEnd = indexStart + substring.length;

  return [indexStart, indexEnd];
}

function Test() {
  const [textItems, setTextItems] = useState();

  const onPageLoadSuccess = useCallback(async page => {
    const textContent = await page.getTextContent();
    setTextItems(textContent.items);
  }, []);

  const customTextRenderer = useCallback(textItem => {
    if (!textItems) {
      return;
    }

    const { itemIndex } = textItem;

    const matchInTextItem = textItem.str.match(stringToHighlight);

    if (matchInTextItem) {
      // Found full match within current item, no need for black magic
      return highlightPattern(textItem.str, stringToHighlight);
    }

    // Full match within current item not found, let's check if we can find it
    // spanned across multiple lines

    // Get text item with neighbors
    const textItemWithNeighbors = getTextItemWithNeighbors(textItems, itemIndex);

    const matchInTextItemWithNeighbors = textItemWithNeighbors.match(stringToHighlight);

    if (!matchInTextItemWithNeighbors) {
      // No match
      return textItem.str;
    }

    // Now we need to figure out if the match we found was at least partially
    // in the line we're currently rendering
    const [matchIndexStart, matchIndexEnd] = getIndexRange(textItemWithNeighbors, stringToHighlight);
    const [textItemIndexStart, textItemIndexEnd] = getIndexRange(textItemWithNeighbors, textItem.str);

    if (
      // Match entirely in the previous line
      matchIndexEnd < textItemIndexStart ||
      // Match entirely in the next line
      matchIndexStart > textItemIndexEnd
    ) {
      return textItem.str;
    }

    // Match found was partially in the line we're currently rendering. Now
    // we need to figure out what does "partially" exactly mean

    // Find partial match in a line
    const indexOfCurrentTextItemInMergedLines = textItemWithNeighbors.indexOf(textItem.str);

    const matchIndexStartInTextItem = Math.max(0, matchIndexStart - indexOfCurrentTextItemInMergedLines);
    const matchIndexEndInTextItem = matchIndexEnd - indexOfCurrentTextItemInMergedLines;

    const partialStringToHighlight = textItem.str.slice(matchIndexStartInTextItem matchIndexEndInTextItem);

    return highlightPattern(textItem.str, partialStringToHighlight);
  }, [stringToHighlight, textItems]);

  return (
    <Document file={samplePDF}>
      <Page
        customTextRenderer={customTextRenderer}
        onLoadSuccess={onPageLoadSuccess}
        pageNumber={1}
      />
    </Document>
  );
}

CodeSandbox working demo

Yeah, I hate it too.

All 5 comments

Hah, that's a good one!

For this to work you need to:

  • Get all page text items, which you can do in <Page />'s onLoadSuccess callback
  • Implement customTextRenderer to hook into text rendering mechanism

    • Try finding full match in current text item.

    • If found, use highlightPattern to highlight the match.

    • If not, try finding full match in current text item with n previous/next neigbours together (in my case, n was 1).



      • If found, find out what part of this highlight, if any, is in current text item. Use highlightPattern to highlight the partial match.


      • If not, return text item untouched, nothing to do here.



And this is a simplified version, since it only supports single matches! For multiple matches, you will need to make it even more complicated. Anyway, in code it would look like so:

const stringToHighlight = 'Donec sodales placerat dui';

// You might want to merge the items a little smarter than that
function getTextItemWithNeighbors(textItems, itemIndex, span = 1) {
  return textItems.slice(
    Math.max(0, itemIndex - span), 
    itemIndex + 1 + span
  )
    .filter(Boolean)
    .map(item => item.str)
    .join('');
}

function getIndexRange(string, substring) {
  const indexStart = string.indexOf(substring);
  const indexEnd = indexStart + substring.length;

  return [indexStart, indexEnd];
}

function Test() {
  const [textItems, setTextItems] = useState();

  const onPageLoadSuccess = useCallback(async page => {
    const textContent = await page.getTextContent();
    setTextItems(textContent.items);
  }, []);

  const customTextRenderer = useCallback(textItem => {
    if (!textItems) {
      return;
    }

    const { itemIndex } = textItem;

    const matchInTextItem = textItem.str.match(stringToHighlight);

    if (matchInTextItem) {
      // Found full match within current item, no need for black magic
      return highlightPattern(textItem.str, stringToHighlight);
    }

    // Full match within current item not found, let's check if we can find it
    // spanned across multiple lines

    // Get text item with neighbors
    const textItemWithNeighbors = getTextItemWithNeighbors(textItems, itemIndex);

    const matchInTextItemWithNeighbors = textItemWithNeighbors.match(stringToHighlight);

    if (!matchInTextItemWithNeighbors) {
      // No match
      return textItem.str;
    }

    // Now we need to figure out if the match we found was at least partially
    // in the line we're currently rendering
    const [matchIndexStart, matchIndexEnd] = getIndexRange(textItemWithNeighbors, stringToHighlight);
    const [textItemIndexStart, textItemIndexEnd] = getIndexRange(textItemWithNeighbors, textItem.str);

    if (
      // Match entirely in the previous line
      matchIndexEnd < textItemIndexStart ||
      // Match entirely in the next line
      matchIndexStart > textItemIndexEnd
    ) {
      return textItem.str;
    }

    // Match found was partially in the line we're currently rendering. Now
    // we need to figure out what does "partially" exactly mean

    // Find partial match in a line
    const indexOfCurrentTextItemInMergedLines = textItemWithNeighbors.indexOf(textItem.str);

    const matchIndexStartInTextItem = Math.max(0, matchIndexStart - indexOfCurrentTextItemInMergedLines);
    const matchIndexEndInTextItem = matchIndexEnd - indexOfCurrentTextItemInMergedLines;

    const partialStringToHighlight = textItem.str.slice(matchIndexStartInTextItem matchIndexEndInTextItem);

    return highlightPattern(textItem.str, partialStringToHighlight);
  }, [stringToHighlight, textItems]);

  return (
    <Document file={samplePDF}>
      <Page
        customTextRenderer={customTextRenderer}
        onLoadSuccess={onPageLoadSuccess}
        pageNumber={1}
      />
    </Document>
  );
}

CodeSandbox working demo

Yeah, I hate it too.

Thank you for the algo and the piece of code.

However I noticed that depending on how the pdf is rendered, it may not work.

Do you know why in some PDFs, each line will be wrapped in a <span> tag, and why in some other, each token is wrapped ? In the second case, it's hard to make the algo work.

However I noticed that depending on how the pdf is rendered, it may not work.

Absolutely!

Things to consider:

  1. getTextItemWithNeighbors might need to "grab" more neighbors if the text to highlight is particularly large or the text nodes in PDFs are particularly small
  2. getTextItemWithNeighbors is a very naive implementation, e.g. if text nodes are "hello" and "world" it'll simply return "helloworld". You may consider .trim()ming the text nodes and adding spaces by .join(' ') instead of .join('') to make it a little smarter, but in general, it's a separate programming issue and I think you can handle this ;)

Tried to implement this and no matter what I do it leads to an infinite re-render loop. Please consider having a look

@pedro-surf You have a working example in my comment above, so you need either share the full code with us or find the differences yourself. Perhaps you're creating your custom text renderer with every render because you forgot to use useCallback? Just a blind guess though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SandMoshi picture SandMoshi  路  3Comments

GManzato picture GManzato  路  4Comments

Waize picture Waize  路  4Comments

wojtekmaj picture wojtekmaj  路  4Comments

joepio picture joepio  路  3Comments