Simply put,
I am looking to extract all the text-containing elements from the HTML and exclude specific elements like 'pre' or 'script' tags along with their children.
I came across information suggesting that querySelectorAll is not very efficient, and TreeWalker is considered the most efficient method. Is this true?
The issue with my code is that it excludes specific elements but still retrieves their children.
I have incorporated a Javascript feature to retrieve all text elements within the HTML.
Some elements such as "pre" or "div" with unique classes should be filtered out from the results.
While I can filter these elements, their children are still being retrieved, making it hard to eliminate them entirely.
How can I address this challenge?
This page provided me with some insight:getElementsByTagName() equivalent for textNodes
document.createTreeWalker
's documentation can be found at:
https://developer.mozilla.org/en-US/docs/Web/API/Document/createTreeWalker#parameters
<!DOCTYPE html>
<html>
<head>
<script>
function nativeTreeWalker() {
var walker = document.createTreeWalker(
document.body,
NodeFilter.SHOW_TEXT,
{acceptNode: function(node) {
// ===========================
// Filtering of specific elements
// Yet unable to filter child elements????
if (['STYLE', 'SCRIPT', 'PRE'].includes(node.parentElement?.nodeName)) {
return NodeFilter.FILTER_REJECT;
}
// ===========================
// Filtering empty elements
if (! /^\s*$/.test(node.data) ) {
return NodeFilter.FILTER_ACCEPT;
}
}
},
true // Skip child elements, protect integrity
);
var node;
var textNodes = [];
while(node = walker.nextNode()){
textNodes.push(node.nodeValue);
}
return textNodes
}
window.onload = function(){
console.log(nativeTreeWalker())
}
</script>
</head>
<body>
get the text
<p> </p>
<div>This is text, get</div>
<p>This is text, get too</p>
<pre>
This is code,Don't get
<p>this is code too, don't get</p>
</pre>
<div class="this_is_code">
This is className is code, Don't get
<span>this is code too, don't get</span>
</div>
</body></html>
The expected outcome of the above code should be:
0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
length: 3
Instead of:
0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
3: "this is code too, don't get"
4: "\n This is className is code, Don't get\n "
5: "this is code too, don't get"
length: 6