Tips on utilizing JavaScript to retrieve all HTML elements that have text within them, then eliminating the designated element and its descendants

Question

Tips on utilizing JavaScript to retrieve all HTML elements that have text within them, then eliminating the designated element and its descendants

Simply put,

I am looking to extract all the text-containing elements from the HTML and exclude specific elements like 'pre' or 'script' tags along with their children.

I came across information suggesting that querySelectorAll is not very efficient, and TreeWalker is considered the most efficient method. Is this true?

The issue with my code is that it excludes specific elements but still retrieves their children.

I have incorporated a Javascript feature to retrieve all text elements within the HTML.

Some elements such as "pre" or "div" with unique classes should be filtered out from the results.

While I can filter these elements, their children are still being retrieved, making it hard to eliminate them entirely.

How can I address this challenge?

This page provided me with some insight:getElementsByTagName() equivalent for textNodes

document.createTreeWalker's documentation can be found at:
https://developer.mozilla.org/en-US/docs/Web/API/Document/createTreeWalker#parameters

<!DOCTYPE html>
<html>
<head>
<script>
function nativeTreeWalker() {
    var walker = document.createTreeWalker(
        document.body, 
        NodeFilter.SHOW_TEXT,
        {acceptNode: function(node) {

          // ===========================
          // Filtering of specific elements
          // Yet unable to filter child elements????
          if (['STYLE', 'SCRIPT', 'PRE'].includes(node.parentElement?.nodeName)) {
            return NodeFilter.FILTER_REJECT;
          }
          // ===========================

          // Filtering empty elements
          if (! /^\s*$/.test(node.data) ) {
            return NodeFilter.FILTER_ACCEPT;
          }
        }
        },
        true  // Skip child elements, protect integrity
    );

    var node;
    var textNodes = [];
    while(node = walker.nextNode()){
        textNodes.push(node.nodeValue);
    }
    return textNodes
}

window.onload = function(){
  console.log(nativeTreeWalker())
}
</script>
</head>
<body>
get the text
<p> </p>
<div>This is text, get</div>
<p>This is text, get too</p>

<pre>
  This is code,Don't get
  <p>this is code too, don't get</p>
</pre>

<div class="this_is_code">
  This is className is code, Don't get
  <span>this is code too, don't get</span>
</div>
</body></html>

The expected outcome of the above code should be:

0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
length: 3

Instead of:

0: "\nget the text\n"
1: "This is text, get"
2: "This is text, get too"
3: "this is code too, don't get"
4: "\n This is className is code, Don't get\n "
5: "this is code too, don't get"
length: 6

javascript typescript

Answer 1

Answer №1

Your expectations may need some adjustments based on the code snippet you provided in your question. For instance, the top-level text node containing Don't get code: is considered a valid node according to your specified criteria.

To achieve the desired outcome, you can utilize the TreeWalker API. A key aspect of solving your issue involves identifying the closest parent of the text node that meets your criteria for validating it:

Code in TypeScript Playground

<!doctype html>
<html>
<head>
<script type="module">
function filterTextNode (textNode) {
  if (!textNode.textContent?.trim()) return NodeFilter.FILTER_REJECT;
  const ancestor = textNode.parentElement?.closest('pre,script,style,.this_is_code');
  if (ancestor) return NodeFilter.FILTER_REJECT;
  return NodeFilter.FILTER_ACCEPT;
}

function getFilteredTexts (textNodeFilterFn) {
  const walker = document.createTreeWalker(
    document.body,
    NodeFilter.SHOW_TEXT,
    {acceptNode: textNodeFilterFn},
  );
  const results = [];
  let node = walker.nextNode();
  while (node) {
    results.push(node.textContent);
    node = walker.nextNode();
  }
  return results;
}

function main () {
  const texts = getFilteredTexts(filterTextNode);
  console.log(texts);
}

main();
</script>
</head>
<body>
  <p> </p>
  
  get text:
  <div>This is text, get</div>
  <p>This is text, get too</p>
  
  Don't get code:
  <pre>
    This is code,Don't get
    <p>this is code too, don't get</p>
  </pre>
  
  <div class="this_is_code">
    This is className is code, Don't get
    <span>this is code too, don't get</span>
  </div>
</body>
</html>

Answer 2

Your expectations may need some adjustments based on the code snippet you provided in your question. For instance, the top-level text node containing Don't get code: is considered a valid node according to your specified criteria.

To achieve the desired outcome, you can utilize the TreeWalker API. A key aspect of solving your issue involves identifying the closest parent of the text node that meets your criteria for validating it:

Code in TypeScript Playground

<!doctype html>
<html>
<head>
<script type="module">
function filterTextNode (textNode) {
  if (!textNode.textContent?.trim()) return NodeFilter.FILTER_REJECT;
  const ancestor = textNode.parentElement?.closest('pre,script,style,.this_is_code');
  if (ancestor) return NodeFilter.FILTER_REJECT;
  return NodeFilter.FILTER_ACCEPT;
}

function getFilteredTexts (textNodeFilterFn) {
  const walker = document.createTreeWalker(
    document.body,
    NodeFilter.SHOW_TEXT,
    {acceptNode: textNodeFilterFn},
  );
  const results = [];
  let node = walker.nextNode();
  while (node) {
    results.push(node.textContent);
    node = walker.nextNode();
  }
  return results;
}

function main () {
  const texts = getFilteredTexts(filterTextNode);
  console.log(texts);
}

main();
</script>
</head>
<body>
  <p> </p>
  
  get text:
  <div>This is text, get</div>
  <p>This is text, get too</p>
  
  Don't get code:
  <pre>
    This is code,Don't get
    <p>this is code too, don't get</p>
  </pre>
  
  <div class="this_is_code">
    This is className is code, Don't get
    <span>this is code too, don't get</span>
  </div>
</body>
</html>

Tips on utilizing JavaScript to retrieve all HTML elements that have text within them, then eliminating the designated element and its descendants

Answer №1

Similar questions

Send a function from a parent to its child component

Struggling to implement JSS hover functionality in a project using React, Typescript, and Material UI

Is it possible to connect a date range picker custom directive in AngularJS with the behavior of AngularUI-Select2?

Text input fields within a grid do not adjust to different screen sizes when placed within a tab

Issue encountered during execution of a mongodb function within a while loop in nodejs

jQuery form validation with delay in error prompts

What steps can I take to ensure the reset button in JavaScript functions properly?

Override existing Keywords (change false to true)

A guide to organizing elements in Javascript to calculate the Cartesian product in Javascript

Error: Unable to load the parser '@typescript-eslint/parser' as specified in the configuration file '.eslintrc.json' for eslint-config-next/core-web-vitals

Dimensions of Collada Element

Ways to personalize Angular's toaster notifications

The Javascript function will keep on executing long after it has been invoked

Bootstrap side navigation bars are an essential tool for creating

Is it better to store data individually in localStorage or combine it into one big string?

Incorporating a JavaScript file into Angular

ReadOnly types in Inheritance

Struggling to Make Div Refresh with jQuery/JS in my Rails Application

React Component State in JavaScript is a crucial aspect of building

reqParam = $.getQueryParameters(); How does this request work within an ajax form