Before you start - checklist
What are you trying to achieve? Please describe.
I would like to highlight patterns which are spread over multiple lines.
If I try to highlight a sentence which is broken by a line break, nothing will be highlighted since each line belongs to its own tag.
Describe solutions you've tried
I thought of looking for the rest of the sentence in the following span in the DOM, but this solutions seems to be really laborious.
Hah, that's a good one!
For this to work you need to:
<Page />'s onLoadSuccess callbackcustomTextRenderer to hook into text rendering mechanismn previous/next neigbours together (in my case, n was 1).And this is a simplified version, since it only supports single matches! For multiple matches, you will need to make it even more complicated. Anyway, in code it would look like so:
const stringToHighlight = 'Donec sodales placerat dui';
// You might want to merge the items a little smarter than that
function getTextItemWithNeighbors(textItems, itemIndex, span = 1) {
return textItems.slice(
Math.max(0, itemIndex - span),
itemIndex + 1 + span
)
.filter(Boolean)
.map(item => item.str)
.join('');
}
function getIndexRange(string, substring) {
const indexStart = string.indexOf(substring);
const indexEnd = indexStart + substring.length;
return [indexStart, indexEnd];
}
function Test() {
const [textItems, setTextItems] = useState();
const onPageLoadSuccess = useCallback(async page => {
const textContent = await page.getTextContent();
setTextItems(textContent.items);
}, []);
const customTextRenderer = useCallback(textItem => {
if (!textItems) {
return;
}
const { itemIndex } = textItem;
const matchInTextItem = textItem.str.match(stringToHighlight);
if (matchInTextItem) {
// Found full match within current item, no need for black magic
return highlightPattern(textItem.str, stringToHighlight);
}
// Full match within current item not found, let's check if we can find it
// spanned across multiple lines
// Get text item with neighbors
const textItemWithNeighbors = getTextItemWithNeighbors(textItems, itemIndex);
const matchInTextItemWithNeighbors = textItemWithNeighbors.match(stringToHighlight);
if (!matchInTextItemWithNeighbors) {
// No match
return textItem.str;
}
// Now we need to figure out if the match we found was at least partially
// in the line we're currently rendering
const [matchIndexStart, matchIndexEnd] = getIndexRange(textItemWithNeighbors, stringToHighlight);
const [textItemIndexStart, textItemIndexEnd] = getIndexRange(textItemWithNeighbors, textItem.str);
if (
// Match entirely in the previous line
matchIndexEnd < textItemIndexStart ||
// Match entirely in the next line
matchIndexStart > textItemIndexEnd
) {
return textItem.str;
}
// Match found was partially in the line we're currently rendering. Now
// we need to figure out what does "partially" exactly mean
// Find partial match in a line
const indexOfCurrentTextItemInMergedLines = textItemWithNeighbors.indexOf(textItem.str);
const matchIndexStartInTextItem = Math.max(0, matchIndexStart - indexOfCurrentTextItemInMergedLines);
const matchIndexEndInTextItem = matchIndexEnd - indexOfCurrentTextItemInMergedLines;
const partialStringToHighlight = textItem.str.slice(matchIndexStartInTextItem matchIndexEndInTextItem);
return highlightPattern(textItem.str, partialStringToHighlight);
}, [stringToHighlight, textItems]);
return (
<Document file={samplePDF}>
<Page
customTextRenderer={customTextRenderer}
onLoadSuccess={onPageLoadSuccess}
pageNumber={1}
/>
</Document>
);
}
Yeah, I hate it too.
Thank you for the algo and the piece of code.
However I noticed that depending on how the pdf is rendered, it may not work.
Do you know why in some PDFs, each line will be wrapped in a <span> tag, and why in some other, each token is wrapped ? In the second case, it's hard to make the algo work.
However I noticed that depending on how the pdf is rendered, it may not work.
Absolutely!
Things to consider:
getTextItemWithNeighbors might need to "grab" more neighbors if the text to highlight is particularly large or the text nodes in PDFs are particularly smallgetTextItemWithNeighbors is a very naive implementation, e.g. if text nodes are "hello" and "world" it'll simply return "helloworld". You may consider .trim()ming the text nodes and adding spaces by .join(' ') instead of .join('') to make it a little smarter, but in general, it's a separate programming issue and I think you can handle this ;)Tried to implement this and no matter what I do it leads to an infinite re-render loop. Please consider having a look
@pedro-surf You have a working example in my comment above, so you need either share the full code with us or find the differences yourself. Perhaps you're creating your custom text renderer with every render because you forgot to use useCallback? Just a blind guess though.
Most helpful comment
Hah, that's a good one!
For this to work you need to:
<Page />'sonLoadSuccesscallbackcustomTextRendererto hook into text rendering mechanismnprevious/next neigbours together (in my case,nwas 1).And this is a simplified version, since it only supports single matches! For multiple matches, you will need to make it even more complicated. Anyway, in code it would look like so:
CodeSandbox working demo
Yeah, I hate it too.